Editor’s note: Wayne Yaddow is an independent consultant with over 20 years’ experience leading data migration/integration/ETL testing projects at organizations including J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, IBM, and Achieve3000. Additionally, Wayne has taught IIST (International Institute of Software Testing) courses on data warehouse, ETL, and data integration testing. He continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at email@example.com.
This is the second blog in a series on Identifying Data Integrity Issues at Every DWH Phase.
Errors in data sources often have a profound effect on the DWH and resulting BI reports that use the data. These are the files and tables that are coming from the systems and applications that are often outside of the enterprise DWH team’s control (see below). Changes to these applications could come without any previous warning. Therefore, the continuous monitoring of source data quality is a requirement. Poor quality data of the input files is one of the most common causes of business intelligence project failure.
Data sources may have their own methods of extracting and storing data along with their controls of data quality. Diversity of data may contribute to data quality problems if not properly resolved before loading into the data warehouse. Sources may be exposed to some types of unsecured access; this makes them unreliable, which ultimately may contribute to poor data quality. Data sources may also experience other kinds of problems, such as data from legacy systems for which that same data may not be accompanied with meta-data that properly describes them. The sources of erroneous data include data entry error (e.g., an incorrect data update error introduced by a human or computer).
Following are conditions that may cause data quality problems associated with data sources:
- Inappropriate selection of candidate data sources (not suitable for the DWH/BI project)
- Sources which do not comply with data warehouse business rules
- Misunderstandings of inter-dependencies among data sources resulting in duplicate records
- Unanticipated changes in source applications, not communicated to the DWH/BI team
- Use of unexpected formats in data sources (Ex., ASCII, EBCDIC)
- Using hyphens to indicate negative numbers
- Using different data types for similar columns (e.g., addressID is defined as numeric in one table and a string in another)
- Orphaned, missing, or incorrect relational keys
- Delimiters in a flat file also used as a valid character in a field
- Updating data warehouse tables with outdated source data
- Conflicting information present in what should be related data sources
- Lack of quality validation routines where source data is created or when it is loaded to the DWH
- Lack of business ownership, policy, and planning of enterprise source data
- Inconsistent/incorrect data formatting (e.g., the name of a person is stored in one table in the format “First-name Surname” and in another table in the format “Surname, First-name”)
- Missing or additional columns in source files
- Missing or misspelled values in data sources
- Inconsistent use of special characters (e.g., a date may have different special characters to separate the year, month, and day)