Editor’s note: Wayne Yaddow is an independent consultant with over 20 years’ experience leading data migration/integration/ETL testing projects at organizations including J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, IBM, and Achieve3000. Additionally, Wayne has taught IIST (International Institute of Software Testing) courses on data warehouse, ETL, and data integration testing. He continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at firstname.lastname@example.org. [ Read more of Wayne’s BI/DWH testing blogs ]
Decisions in today’s organizations have become increasingly data-driven and real-time, so the systems that support business decisions must be of exceptional quality. People sometimes confuse testing data warehouses that produce business intelligence (BI) reports with backend or database testing or with testing the BI reports themselves. Data warehouse testing is much more complex and diverse. Nearly everything in BI applications involves the data that “drives” intelligent decision making.
Data integrity can be compromised at all DWH/BI phases: when data is created, integrated, moved, or transformed. However, testing of data warehouses is usually deferred until late in the cycle. If testing is shortchanged (e.g., due to schedule overruns or limited resource availability), there’s a high risk that critical data integrity issues may slip through the verification efforts. Even if thorough testing is performed, it’s difficult and costly to address any data integrity issues exposed by this late-cycle testing. At this phase, the cause of the error can be anything from a data quality issue stemming from when the data enters the data warehouse, to a data processing issue caused by a malfunction of the business logic along the layers of the data warehouse and its BI components. This is a painstakingly tedious task and often consumes considerable resources.
My next few blogs outline strategies and best practices for catching data integrity issues as early as possible—to reduce the resources required to address them and ensure that data
Identifying Quality Issues During the DWH Design Phase
A first level of testing and validation begins with the formal acceptance of the logical data model and “low level ETL design” (LLD). All further testing and validation will be based on the understanding of each of the data elements in the model.
Data elements that are created through a transformation or summary process must be clearly identified and calculations for each of these data elements must be clearly documented and easily interpreted.
During LLD reviews and updates, special consideration should be given to typical modeling scenarios that exist in the project. For example:
- Verify that many-to-many attribute relationships are clarified and resolved
- Verify the types of keys that are used: surrogate keys and/or natural keys
- Verify that business analysts/DBA’s review with ETL architect and developers (application), the lineage and business rules for extracting, transforming, and loading the data warehouse
- Verify that all transformation rules, summarization rules, and matching and consolidation rules have clear specifications
- Confirm that specified transformations, business rules and cleansing described in low level design (LLD) and application logic specifications have been coded correctly in ETL, Java, and SQL used for data loads
- Verify that ETL procedures are documented to monitor and control data extraction, transformation, and loading. The procedures should describe how to handle exceptions and program failures
- Verify that data consolidation of duplicate or merged data is properly handled
- Verify that samplings of domain transformations will be utilized to confirm they are properly changed
- Ensure unique values exist for primary and foreign key fields between the source data and the data loaded to the warehouse
- Validate that target data types are as specified in the design and/or the data model
- Verify that data field types and formats are specified and implemented
- Verify that default values are specified for fields where needed
- Verify that processing for invalid field values in the source are defined
- Verify that expected ranges of field values are specified
- Verify that all keys generated by the ETL “sequence generator” are identified
- Verify that slowly-changing dimensions are described