Editor’s note: Wayne Yaddow is an independent consultant with over 20 years’ experience leading data migration/integration/ETL testing projects at organizations including J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, IBM, and Achieve3000. Additionally, Wayne has taught IIST (International Institute of Software Testing) courses on data warehouse, ETL, and data integration testing. He continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at firstname.lastname@example.org.
Now that every company strives to be a data-driven company, data quality is more critical than ever. According to Gartner, the average company loses $8.2B USD annually as a result of poor data quality. Moreover, a recent Experian study found that 4 in 5 (83%) businesses see data as an integral part of forming a business strategy, yet they suspect 30% of their contact and prospect data may be inaccurate. With ‘improving the customer experience’ being called out as a top priority for 2018, the research also reports that 69% believe inaccurate data is undermining their ability to provide this.
Testing data warehouse implementations has become an imperative for instilling trust in the data that the business relies upon. Complex business rules and transformation logic are built using ETL logic, and thus demand diligent and thorough testing.
Why test data warehouse projects?
Here are just a few of the many reasons to thoroughly test the data warehouse and apply a QA process that is specific to data and ETL testing:
- There is often a massive amount of source data from varied types of data repositories
- The quality of source data cannot be assumed and should be profiled and cleaned
- Source data may be inconsistent and/or redundant
- Many source data records may be rejected; ETL / stored procedure logs will contain messages that must be acted upon
- Required source field values may be missing
- Source data history, business rules and audits of source data may not be available
- Enterprise-wide data knowledge and business rules may not be available to verify data
- Since data ETLs must often pass through multiple phases before loading into the data warehouse, extraction, transformation and loading components must be thoroughly tested to ensure that the variety of data behaves as expected, within each phase
- Heterogeneous sources of data (e.g., mainframe, spreadsheets, Unix files) will be updated asynchronously over time then incrementally loaded.
- Transaction-level traceability will be difficult to attain in a data warehouse
- The data warehouse will be a strategic enterprise resource and heavily replied upon.
What does the data warehouse verification involve?
A good understanding of data modeling and source-to-target data mappings equip the tester with information to develop an appropriate test strategy. Hence, it’s important that during the project’s requirement analysis phase, the QA team works to understand the data warehouse implementation to the greatest extent. Data warehouse testing strategies will, in most cases, be a consortium of several smaller strategies. This is due to the nature of data warehouse implementations.
Different stages of the data warehouse implementation (source data profiling, data warehouse design, ETL development, data loading and transformations, etc.), require the testing team’s participation and support. Unlike some traditional testing, test execution does not start at the end of the data warehouse implementation. In short, test execution itself has multiple phases and is staggered throughout the life cycle of the data warehouse implementation.
Verification across data warehouse project testing phases can include:
- Data completeness: Ensure that all expected data is loaded by means of each ETL procedure.
- Data transformations: Ensure that all data to be transformed is completed correctly according to business rules and design specifications.
- Data quality: Ensure that the ETL process correctly rejects, substitutes default values, corrects, ignores and reports invalid data.
- Performance and scalability: Ensure that data loads and queries perform within expected time frames and that the technical architecture is scalable.
- Integration Testing: Ensure that the ETL process functions well with other upstream and downstream processes.
- User-acceptance testing: Ensure the data warehousing solution meets users’ current expectations and anticipates their future expectations.
- Regression testing: Ensure existing functionality remains intact each time a new release of ETL code and data is completed.