Migrating data from one system to another is fraught with numerous difficulties. Typically, data transfer between systems is performed in one of two methods, the ELT or ETL process. In other words, you must perform the following tasks in one of two orders of operation:
As you might guess from the above definitions, ETL stands for extract, transform and load, and refers to a process where the data is transformed before being loaded into the target system. This method requires a staging area in which the transformation can take place. On the other hand, ELT is a model where the raw data from a source is loaded into a target system directly, and any transformations required by downstream systems are performed within the target system, usually a data lake that can handle multiple forms of unstructured and structured data. One could ostensibly argue that then the target system is the staging area for the downstream processes, arguably making it an extended ETL process. There is no standard way in which all systems handle and transfer said data, let alone a standard way to handle the component stages.
Here we will cover some of the key difficulties that occur when migrating data, focusing on the transformation and loading part of the ETL/ELT process, which is where Flatirons Fuse resides and can help make life easier.
Perhaps the best way to frame this discussion is to highlight some of my personal experiences navigating the difficulties inherent in the transfer of data. A few years back, working as a contractor, I was hired for a migration project involving a legacy application that had been in use for approximately seven years. This project, an ETL migration for a rewrite of the application, was originally estimated to be a three-to-six-month transition. It turned out to be an ordeal of almost two years due to the number of complexities encountered in moving over half a decade of non-standardized, mal-formatted, or seldom-used but important legacy data.
Over time, it can become a huge challenge to determine a single source of truth for a piece of data. It might be copied across multiple locations, updated in some and not others, saved in a different format for transfer to an outside organization, or any number of potential issues. You can come across tables that are not used anymore, and data that has become denormalized (stored in multiple locations, typically to improve read performance and searchability). Denormalized data is supposed to be simple duplication of a single source of truth, but over time, due to poor data management practices, the data might no longer be equivalent. Tracing the lineage of such errors to determine the actual source and value of the data can range from a daunting task to near impossibility. If the data is siloed to a particular group within an organization, the data can be nearly unrecognizable from the original, resulting in problematic edge cases all over the place.
As you might have guessed from the previous section, strong data governance practices – ie; planning, avoiding data siloes, and guaranteeing that any data duplication is updated from a single source (ideally in an automated fashion), can go a long way in mitigating the chances of a disastrous data import. But you don’t always have the luxury of strong data governance in migration, particularly if more than one organization is involved. In a merger situation, for instance, or even the consolidation of data for several teams within an organization, the best practices of one side are sure to differ from those of the other.
This does not mean that the practices are necessarily better or worse, just that they are unequal. Conflict is unavoidable if the methods used, data sources, or formats are not the same. Whether consolidating data for use across multiple teams in a single organization or attempting to reconcile the data from two entirely different organizations, or even just a customer onboarding process, any migration progress must first begin with discovery. Discovery means a deep dive into the data, determining each source, determining the processes that might change the data, and tracing the data from source to destination, along with any transformations that occur.
In the discovery process, both sides must agree as to which sources are to be trusted and how to reconcile any conflicting data. Without this baseline, the migration process will be a harrowing experience.
Once you’ve gotten to know the data to be migrated, it becomes time to plan how to get it there. Planning and communication are key.
It is not as simple as dumping the data from one database to another and trusting SQL to sort out the differences. This is in fact a rather commonly seen mistake, usually as an initial attempt. The problem with this is that this is essentially an attempt at an ETL process without adequate support for the transformation part of the process.
For example, you might be merging very similar data from a CRM to an ERP system. Similar data runs through both systems, but perhaps with granular differences between them – there might not be a common key between one table on the CRM and another table on the other system. Simply merging the two together with a data dump and a SQL query might corrupt data irreparably on the target side, particularly if data is being over-written or appended to existing fields.
Standardizing on a single format, whether it be JSON, CSV, or another format, is not a solution in itself, because there are numerous considerations that must be handled specifically to each. For a successful data migration to occur, you must also define the data to be transferred and define the data itself to a standard recognized by the target system (ie standardized headers, record formats, etc).
Data migration requires not only a thorough examination of both source and target data, but also a consensus between teams or organizations on which pieces are important to each, what to do with extraneous data, how the system is expected to handle the more complex entries, and exactly which transformations need to occur. For example, if one system stores basic addresses with country codes etc, and the other uses a more in-depth format including GPS coordinates, a way to handle these differences must be defined.
You might also have to plan to migrate the data in stages, due to data dependencies and layers of dependencies. In SQL databases, for example, there are what are called foreign keys and indices. If you have an organization with many users, the organization itself, with its associated foreign keys and indices needs to be migrated first. If the users were migrated without the organization’s keys and indices, SQL would essentially crash, unable to recognize the new data or know where to put it. In other words, your migration plan has to include any layers of dependency and supply a validated order of operations to ensure the data is transferred successfully.
Imagine transferring the data from a common application, one with a defined structure, such as Microsoft Teams. The bulk of the data itself would be images, videos, and documents. However, the data may be tied to user accounts, which in turn are within teams, and if there are other organizations invited to share resources, well, you get the idea of the complexities involved. The hierarchical pyramid would need to be preserved in order for the migrated data to be preserved in the proper context.
Older file formats can complicate the transfer of data as well. If a SQL dump is performed on a data format from a previous generation, it can be difficult to form a frame of reference to determine the best way of transforming and mapping the data to the way it was used by the old application. Because the data is delivered without context, you might not even recognize that it came from Excel. This is one strong use case to use a bridge format, like CSV, to reframe the data for the new system. Even if the new system does not natively use CSV, the simplified tabular format can allow for the necessary transformation and mapping of records to occur. But even with a bridge format, both teams must agree on key concepts, and how to translate any conflicts between old and new.
As you can see, importing data is far from simple, and requires cooperation between legacy and new, between separate organizations, and between on-premises and cloud versions of an application, whatever the case may be. Core concepts must be preserved, organizational hierarchies maintained or redefined, and the data transformed to a usable format for the target system. The right tool can help, no question, but the right plan requires strong discovery, the definition of data standards, and planning. Flatirons Fuse can help not only with the transformation and loading of your data, but with the expertise required to manage any complications.