Editor’s note: Wayne Yaddow is an independent consultant with over 20 years’ experience leading data migration/integration/ETL testing projects at organizations including J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, IBM, and Achieve3000. Additionally, Wayne has taught IIST (International Institute of Software Testing) courses on data warehouse, ETL, and data integration testing. He continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at email@example.com.
In my previous blog, I looked at why and how to test a data warehouse project. This time, let’s focus on how to build an end-to-end data warehouse testing strategy and test plan. Ultimately, the success of a data warehouse solution is highly dependent upon your ability to plan, design and execute a set of effective tests that expose issues with data inconsistency, data quality, data security, the ETL process, performance, business flow accuracy, and the end user experience.
Overall, the primary focus of testing should be on the end-to-end ETL process. This includes validating the loading of all required rows, correct execution of all transformations and successful completion of the cleansing operation. The team also needs to thoroughly test SQL queries, stored procedures or queries that produce aggregate or summary tables. Finally, with Digital Transformation initiatives focusing on delivering exceptional customer engagement, it is becoming increasingly critical for the test team to design and execute tests that accurately mirror the customer perspective.
Building an end-to-end data warehouse testing strategy
The following figure shows a representative data warehouse implementation from identification of source data (lower left) to report and portal reporting (upper left). In between, several typical phases of the end to end data warehouse development process are depicted (for example, source extract to staging, dimension data to the operational data store (ODS), fact data to the data warehouse and report and portal functions extracting data for display and reporting). The graphic illustrates that all data load programs and resulting data loads should be verified throughout the end-to-end QA process.
Figure 1: End-to-End Data Warehouse Process and Associated Testing
An end-to-end data warehouse test strategy documents a high-level understanding of the anticipated testing workflow. The strategy will be used to verify that the data warehouse system meets its design specifications and other requirements.
Successfully executing this strategy requires the following skills:
- Understanding fundamental concepts of data warehousing and its place in an information management environment
- Understanding how the testing process fits into data warehouse development process
- Development of data warehouse test strategies, test plans and test cases – what they are and how to develop them, specifically for data warehouses and decision support systems
- Creating effective test cases and scenarios based on business and user requirements for the data warehouse
- Participating in reviews of the data models, data mapping documents, ETL design and ETL coding; providing feedback to designers and developers
- Participating in the change management process and documenting relevant changes to decision support requirements
Additionally, you will want to follow core best practices such as:
- Formal QA data track verifications should begin early in the ETL design and data load process and continue through deployment and into production.
- Testers should be given early access to the ETL development environment so they can assess the quality of early data loads and offer valuable feedback to development teams. Such early access can dramatically aid preparations for formal testing and identify issues early.
- Where projects utilize offshore or contract test teams, they may discover the need for more adequate and representative samples of data (production data, if possible) for test planning and test case design.
- For all project stakeholders, data models, database design documents (LLD’s), ETL design and data source to target mapping documents need to be kept in sync until transition.
- Data warehouse test automation (particularly for regression testing) and associated tools are critical for supporting agile and iterative development processes.
Writing an effective data warehouse test plan
An effective test plan is the cornerstone for the entire data warehouse testing effort. The plan will help test engineers validate and verify data requirements from end to end (source to target data warehouse). A primary purpose of a formal test program is to verify data requirements as stated in the:
- Business requirements document
- Data models for source and target schemas
- Source to target mappings
- ETL design documents
As requirement documents specifications are the “what” for ETL development, the test plan can serve as the “what” for the test process. The test plan describes how the QA staff will verify that the data warehouse meets requirements. Properly constructed, the test plan is a contract between the QA team and all other project stakeholders.
In addition to the data requirements, the test planning effort should also consider:
- Configuration management system
- Project schedule
- Data quality verification process
- Incident and error handling system
- QA staff resources estimates and training needs
- Testing environment budget and plan
- Test tools
- Test objectives
- QA roles and responsibilities
- Test deliverables
- Test tasks
- Defect reporting requirements
- Entrance criteria that should be met before formal testing commences
- Exit criteria that should be met before formal testing is completed
Common data warehouse issues to watch for
As you design and execute the test plan, think about identifying issues such as:
- Inadequate ETL and stored procedures (use design documentation to aid in test planning)
- Field values that are null when specified as “Not Null”
- Field constraints and SQL that are not coded correctly for the ETL tool
- Excessive ETL errors after entry to formal QA
- Source data that does not meet table mapping specifications (ex., dirty data)
- Source to target mappings that 1) were not reviewed before implementation, 2) are incorrect or 2) are not consistently maintained throughout the development lifecycle
- Data models that are not adequately maintained during the development lifecycle
- Duplicate field values (in either source or target data) that were defined as DISTINCT in mapping specifications
- ETL SQL / transformation errors leading to missing rows and invalid field values
- Constraint violations in source (perhaps could be found through data profiling)
- Target data that is incorrectly stored in nonstandard formats
- Primary or foreign key values that are incorrect for important relationship linkages