Editor’s note: Wayne Yaddow is an independent consultant with over 20 years’ experience leading data migration/integration/ETL testing projects at organizations including J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, IBM, and Achieve3000. Additionally, Wayne has taught IIST (International Institute of Software Testing) courses on data warehouse, ETL, and data integration testing. He continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at firstname.lastname@example.org.
This is the third blog in a series on Identifying Data Integrity Issues at Every DWH Phase. [part 1 | part 2]
The quality of data can be compromised depending upon how data is received, entered at source, integrated, maintained, processed, and loaded. Data may be impacted by numerous design and coding processes that load data into your data warehouse environment, most of which may affect its quality to some extent. Despite all efforts at data verification, there may still exist a certain percentage of dirty data.
Residual dirty data that is discovered in the ETL process should be reported in audit/error logs to include reasons for each failure.
Data quality issues can arise (and should be discovered) at every stage of data warehouse development (see the figure below). Those issues may originate in data sources, data models, source data profiling, data staging and the many other ETLs. The following depicts likely development phases where data quality problems may be found:
- Database schema design and development – the data model and mappings (e.g., failure to review and gain approvals for the schemas and mappings before implementing into code)
- Source data profiling (e.g., failure to find errors that should be found by means of data profiling)
- Source data and data stage loading (e.g., failures in ETL logic or code resulting in dropped records)
- Data cleansing and transformations (e.g., failures to correctly implement business rules)
Common phases for DWH/BI data movement (e.g., ETLs)