Editor’s note: Wayne Yaddow is an independent consultant with over 20 years’ experience leading data migration/integration/ETL testing projects at organizations including J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, IBM, and Achieve3000. Additionally, Wayne has taught IIST (International Institute of Software Testing) courses on data warehouse, ETL, and data integration testing. He continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at email@example.com.
Data warehouses are becoming increasingly large, increasingly complex and increasingly important to the businesses that implement them. They are becoming increasingly large and complex because:
- They are drawing more data from a greater number of more diverse sources across the enterprise to create larger, richer assemblages of both alphanumeric and financial information.
- They are being leveraged by a wider range of users to support a greater number of decisions that impact the bottom line every day.
For those reasons, it’s essential that businesses rigorously ensure the quality of the data in their data warehouses. If they fail to do this, users will make faulty decisions based on incorrect data. Over time, their confidence in the data will erode to the point where they won’t use the business intelligence tools and other applications that rely on the data warehouse — this means that huge IT investments will be wasted. Just as important, any business using financial data in its data warehouse must be able to withstand the scrutiny of auditors and regulators.
In other words, without effective data quality management measures in place to support their data warehouses, businesses will remain highly vulnerable to operational, financial and regulatory risks. Unfortunately, few companies have adequate safeguards in place for their data warehouses today. They may have conventional tools in place for validating certain types of data (such as customer names and addresses) once they’re in the data warehouse, but they often lack the controls necessary for…
- Preventing bad data from getting there in the first place
- Properly validating financial data
- Discovering and remediating the root causes of chronic data quality problems
- Documenting data quality management measures to third parties (such as auditors and regulators)
My next few blogs identify some primary data warehouse testing challenges and offer approaches and guidelines to help you address them. The best practices and the test methodology presented here are based on practical experiences verifying DWH/BI applications.
Identifying tests and documentation for data warehouse test planning
Because data warehouse testing is different from most software testing, a best practice is to break the testing and validation process into several well-defined, high-level focal areas for data warehouse projects. Doing so allows targeted planning for each focus area, such as integration and data validation.
- Data validation includes reviewing the ETL mapping encoded in the ETL tool as well as reviewing samples of the data loaded into the test environment.
- Integration testing tasks include reviewing and accepting the logical data model captured with your data modeling tool (such as ERwin or your tool of choice), converting the models to actual physical database tables in the test environment, creating the proper indexes, and testing the ETL programs created by your ETL tool or procedures.
- System testing involves increasing the volume of the test data to be loaded, estimating and measuring load times, placing data into either a high-volume test area or in the user acceptance testing (UAT) and, later, production environments.
- Regression testing ensures that existing functionality remains intact each time a new release of ETL code and data is completed.
- Performance and scalability tests ensure that data loads and queries perform within expected time frames and that the technical architecture is scalable.
- Acceptance testing includes verification of data model completeness to meet the reporting needs of the specific project, reviewing summary table designs, validation of data actually loaded in the production data warehouse environment, a review of the daily upload procedures, and finally application reports.
Few organizations discard the databases on which new or changed applications are based, so it is important to have reliable database models and data mappings when your data warehouse is first developed, then keep them current when changes occur. Consider developing the following documents that most data warehouse testers need:
The backbone of a successful BI solution is an accurate and well-defined source-to-target mapping of each metric and the dimensions used. Source-to-target data mapping helps designers, developers, and testers understand where each data source is and how it transitioned to its final displayed form. Source-to-target mappings should identify the original source column names for each source table and file, any filter conditions or transformation rules used in the ETL processes, the destination column names in the data warehouse or data mart, and the definitions used in the repository (RPD file) for the metric or dimension. This helps you derive a testing strategy focused more on the customized elements of the solution.
Data warehouse models are crucial to the success of your data warehouse. If they are incorrect or non-existent, your warehouse effort will likely lose credibility. All project leaders should take the necessary time to develop the data warehouse data model. For most data warehouses, a multi-month building effort with a highly experienced data warehouse modeler may be needed after the detailed business requirements are defined. Again, only a very experienced data warehouse modeler should build the model. It may be the most important skill on your data warehouse team.
The data architecture and model is the blueprint of any data warehouse; understanding it helps you grasp the bigger picture of a data warehouse. The model helps stakeholders understand the key relationships between the major and critical data sources.
We stress the importance of getting your data model right because fixing it might require a great deal of effort, in addition to stalling your project. I’ve seen projects with models so corrupted that it almost makes sense to start from scratch.
Several factors contribute to information quality problems. Changes in source systems often require code changes in the ETL process. For example, in a particular financial institution, the ETL process corresponding to the credit risk data warehouse has approximately 25 releases each quarter. Even with appropriate quality assurance methods, there is always room for error. The following types of potential defects can occur when ETL processes change:
- Extraction logic that excludes certain types of data that were not tested.
- Transformation logic may aggregate different types of data (e.g., car loan and boat loan) into a single category (e.g., car loan). In some cases, transformation logic may exclude certain types of data, resulting in incomplete records in the data warehouse.
- Current processes may fail due to system errors or transformation errors, resulting in incomplete data loading. System errors may arise when source systems or extracts are not available or when source data has the incorrect format. Transformation errors may also result from incorrect formats.
- Outdated, incomplete, or incorrect reference and lookup data leads to errors in the data warehouse. For example, errors in the sales commission rate table may result in erroneous commission calculations.
- Data quality issues including source system data that is incomplete or inconsistent. For example, a customer record in the source system may be missing a ZIP code. Similarly, a source system related to sales may use an abbreviation of the product names in its database. Incompleteness and inconsistency in source system data will lead to quality issues in the data warehouse.