Editor’s note: Wayne Yaddow is an independent consultant with over 20 years’ experience leading data migration/integration/ETL testing projects at organizations including J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, IBM, and Achieve3000. Additionally, Wayne has taught IIST (International Institute of Software Testing) courses on data warehouse, ETL, and data integration testing. He continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at firstname.lastname@example.org.
This is the fourth blog in a series on Identifying Data Integrity Issues at Every DWH Phase.
Before looking into data quality problems during data staging, we need to know how the ETL system handles data rejections, substitutions, cleansing, and enrichment. To ensure success in testing data quality, include as many data scenarios as possible. Typically, data quality rules are defined during design; for example:
- Reject non-numeric data to be loaded in a numeric field
- Load fields with full contents – i.e. data fields are not truncated/trimmed as specified in data mappings
- Ensure that no duplicate records or field contents are loaded
- Substitute NULL if a decimal field has non-numeric data
- Verify and correct the “city/state fields” if necessary based on the ZIP code
- Ensure that data integrity constraints are properly handled (Ex., rejected, replaced, etc.)
For another example, consider the following business rule: “If no match is found in the look-up table, then load it nonetheless, but report the error in the error log/report”. Depending on the data quality rules of the application being tested, scenarios to test might include NULL key values, duplicate records in source data and invalid data types in fields (e.g., alphabetic characters in a decimal field).
A data cleaning process may be executed in the data staging area in order to improve the correctness of the data warehouse. The staging and DWH load phases are considered a most crucial point of data warehousing where the full responsibility of data quality efforts exist. It is a prime ETL phase for validating data quality from source or auditing and cleaning data issues. Other factors that can contribute to data quality problems during staging and subsequent ETLs:
- Conflicting business rules used by various data sources
- The inability to schedule extracts on time, or within the allotted time interval when updating the DWH
- Inability to capture all changes (ex., inserts, updates, deletes) in source files
- Absence of an effective and centralized source metadata repository
- Misinterpretation of slowly changing dimensions (SCDs) in ETL code
- Errors in the transformation or substitution values for NULL values in the ETL process
- Absence of automated or effective unit testing facility in ETL tools
- Dearth of effective error reporting, validation, and metadata updates in ETL code
- Inappropriate ETL process for data insert/update/delete functions
- Loss of data during the ETL process (rejected records, dropped fields)
- An inability to restart the ETL process from checkpoints without losing data
- Lack of automatic and effective data defect and correction functions in the ETL code
- Inability to incorporate profile, cleansing, and ETL tools to compare and reconcile data and associated metadata
- Misaligned primary and foreign key strategies for the same type of entity (e.g., one table stores customer information using the Social Security Number as the key, another uses the CustomerID as the key, and still another uses a surrogate key)
It is always important to review detailed test scenarios with business users and technical designers to ensure that everyone is in agreement. Data quality rules applied to the data will usually be invisible to the users once an application is in production; users will only see what has been loaded in the data warehouse. For this reason, it is important to ensure that what is done with invalid data is reported to users for verifications. These data quality reports present valuable information that sometimes reveals systematic errors.