Editor’s note: Wayne Yaddow is an independent consultant with over 20 years’ experience leading data migration/integration/ETL testing projects at organizations including J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, IBM, and Achieve3000. Additionally, Wayne has taught IIST (International Institute of Software Testing) courses on data warehouse, ETL, and data integration testing. He continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at email@example.com.
Data integrations are at the core of data warehousing, data migration, data synchronization, and data consolidation projects. In the past, most data integration projects involved data stored in databases. Today, organizations are being asked to integrate their database and structured records with data from documents, emails, log files, websites, social media, audio, or video files. This opens the door to countless opportunities for error — and requires a number of checkpoints to ensure data integrity.
Using data warehousing as an example, Figure 1 illustrates the primary checkpoints (testing points) in an end-to-end data quality testing process. It shows the points at which data should be verified as it is extracted from source, transformed for loads into target databases, aggregated for loads into data marts, consolidated, etc. Only after data owners and all other stakeholders confirm that a data integration was successful can the whole process be considered complete and ready for production.
Figure 1: Checkpoints that are necessary to audit and verify data quality in data integration projects. A data warehouse integration example is depicted in Figure 1.
Data testing skills and experience are paramount for project success
The data integration/migration testing lead and other hands-on data testers are typically the best prepared to perform the testing required to ensure data quality throughout all stages of a data extract, transform, and extract (ETL) life cycle. Unfortunately, the impulse to cut project costs is often strong, especially in the final delivery phase of data integration and data migration projects.
At this late phase of the project, a common mistake is to delegate testing responsibilities to resources with limited business and data testing skills. In recent years, there has been an evolving trend towards expecting data analysts, data engineers, business analysts, ETL developers, and even business users to plan and conduct much of the data integration, data migration, and data warehouse tests. Doing so may be risky.
No matter whom you designate to perform your data integration testing, be sure that they possess the following skills:
- A firm understanding of data warehouse and database concepts
- The ability to develop strategies, test plans, and test cases specific to data integration and the enterprise’s business
- The ability to create compelling ETL test cases and scenarios based on ETL database loading technology and business requirements
- Advanced skills with SQL queries, stored procedures, and test scripting to delete, define, load, and merge data for tests
- In-depth understanding of ETL development tools
- Experience with automated testing for ETL processes and other source-to-target verifications
- Advanced knowledge of project business data and metadata (data sources, data tables, data dictionary, business terminology)
- Innovative skills with data profiling and associated methods and tools
- Understanding of data models, data mapping documents, ETL design, and ETL coding
- The ability to communicate effectively with data engineers, database designers, and developers
- Experience testing multiple DB systems, (e.g., Oracle, SQL Server, DB2, Postgresql)
- Troubleshooting of ETL (e.g., Informatica/DataStage) sessions, workflows, and logs
- The ability to deploy database code to test environments
- Experience with Microsoft Excel and a variety of data quality verification tools
- Effective use of defect management and tools
Data integration projects can fail for many reasons: Poor data architecture, inconsistently-defined data, inability to combine data from different data sources, missing and inaccurate data values, inconsistent use of data fields, unacceptable query performance, and so forth.
These project risks can be diminished with well-trained and motivated testers who provide ongoing support from the earliest phases of the data integration development. I hope that this article helps you take the first steps towards that end.