Data Extraction

Data Extraction, Transformation, and Migration Tools


Key words: data extraction, data scrubbing, data migration, data warehousing, database modeling



Data Extraction, Cleansing and Migration Tools

to support

Data Warehouse, Database Consolidation, and Systems Reengineering Projects 
 

Summary. This information is intended to facilitate data extraction, cleansing and migration methods selection, with particular emphasis on tools. The main body of the document briefly summarizes issues and makes recommendations. The appendix presents an overview of applicable commercial tools. Our recommendation is to select a suite of tools, each one of which meet specific needs.

The Issue. The process of developing several data warehouses and other database projects has highlighted the need to effectively and efficiently manage the extraction, cleansing, transformation and migration of data from legacy systems. Effectiveness is necessary whenever the data is of great value (which is usually). Efficiency is necessary, because the long term investment of resources in these activities can be high.

Why tools? Tools can help achieve technical efficiency, and can perform data quality tasks that would otherwise be impractical. Just as important, tools such as Prism also provide a structure to the migration process that can be leveraged as an effective management tool.

The Question. What tools and techniques are most appropriate to assist with data extraction, Cleansing and Migration to support Data Warehouse, Database Consolidation, and Systems Reengineering Projects, given the business problem and environment?

The Decision Process. The purpose is not to document a decision that has been made but to lay out the issues to facilitate discussion, and to make recommendations in support of the decision process.

There is not one decision to make, but several in a series. Most of the decisions are not about technical choices, but about management choices. On the face of it, the key decision seems to be: do we need a

  • Quick solution, to get information out ASAP; or a
  • Systematic solution to handle routine processes smoothly, efficiently, effectively.


 

But in fact, both are needed. In an emergency situation, the "quick solution" is necessary. Also, if it is not clear if the finished warehouse or database as initially conceived will meet needs, the "quick solution" functions as a test or prototype, that can be followed by a more systematic solution. However, systematic data extraction and cleaning solution of some sort will be needed for most problems (problems discussed in the *Data Quality Methods* and Procedures white paper). The only real question is when, and how capable and complicated it must be. A cost benefits analysis can help address how automated and capable the eventual solution should be. 
 

The essential decision will address the question: given a business need and set of environmental conditions, what is the best technical solution that will yield the best costs to benefits ratio? The source of information to that decision is the set of answers to the following questions about business needs and conditions:

  • The base of knowledge about system data is adequate - or - Is inadequate.
  • Extracts/loads are one-time - or - Ongoing/repetitive.
  • Focus of concern is quality of data content - or - Expedient delivery of data.
  • Data quality issues are specific (e.g. address, or person-name) - or - General (several categories).
  • Does a suitable tool exist - or - Do tools have to be acquired or created.
  • Is extraction/transport to be managed centrally - or - Managed through distributed control.
  • Is extraction/transport (and scheduling) to be controlled via Parameters - or - Custom coded extracts and transformations.

Figures 1 and 2 model a decision flow that incorporates the above questions. The purpose of this model is to demonstrate the rational behind the selection of a set of tools with varying strengths and capabilities. Figure 1 highlights the differences in tool applicability among three different tasks: on-line data entry, on-line querying, and batch transport. On-line data entry and querying are not the focus of this paper. Batch transport is expanded upon in Figure 2, featuring decisions relevant to batch data extraction, cleaning, transformation and transport.

Each branch in the diagram represents a decision branch, and each box represents a tool category. A specific recommended tool is listed in the diagram for each tool category. The sections below list other tools that fit within that category. Figure 2 also features three tool "super categories": Analysis, Data Quality and Scrubbing, and Extract Transport. 

Taking Action. Only a set of tools will meet all requirements. Unfortunately, there is no one tool package that addresses that extremely large number of issues pertaining to data extraction, cleaning, and transport. As illustrated in Figure 2, different tools specialize in addressing different issues. The following tools each fit a specific need that must be addressed.