Editor’s note: Wayne Yaddow is an independent consultant with over 20 years’ experience leading data migration/integration/ETL testing projects at organizations including J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, IBM, and Achieve3000. Additionally, Wayne has taught IIST (International Institute of Software Testing) courses on data warehouse, ETL, and data integration testing. He continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at wyaddow@gmail.com.
Numerous data quality (DQ) cleansing tools are available to assess and cleanse source data before any ETL process is run for data integrations, data migrations, data warehouse loads, and analytic applications.
This blog outlines the needs for testing data after it has been cleansed to ensure that the expected results were achieved.
What is Data Cleansing / Data Cleaning?
Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt/ inaccurate records from a record set, table, or database. Cleansing refers to identifying and transforming incomplete, incorrect, inaccurate or irrelevant parts of data and and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing is often performed interactively with data cleaning tools or as batch processing through scripting.
Data Quality (DQ) Cleansing Tools
There are a number of commercial data quality (DQ) cleansing tools on the market to choose from. In addition, IT teams sometimes develop and customize their own data cleansing programs to avoid tool acquisition costs or because the commercial products do not yet provide the required capabilities.
Some DQ cleansing tools test target data after ETLs are complete: reconciling uncleaned source data against cleansed, transformed, or enriched target data.
Testing “cleansed data” with automated tools often focuses on functions such as data profiling and a large variety of data transformation verifications that come as functions of those tools. A primary need for automated ETL data cleansing test tools is to support data transformation verifications after ETLs have completed.
Common Data Cleansing Categories
The following summary of data cleansing categories are listed in the order of marketplace need and Gartner’s Magic Quadrant assessments:
- Parsed data – decomposed data.
- Merged, linked, grouped, or matched data – data that has been linked or merged with related data entries within or across datasets, using a variety of techniques such as rules, algorithms, metadata.
- Standardizing and cleaning data – applied business standards or business rules to modify data for specific formats, values, and layouts.
- De-duplicating data – removing any of a large variety of duplicate data existing in source or target data before, during, or after ETLs.
- Enriching data – integration of a variety of sourced data to improve completeness and add value.
- Performance, scalability – data changes to deliver suitable throughput and response times to satisfy performance SLAs.
The Case for Verifying the Results of Data Cleansing Efforts
Data profiling focusses on analysis of individual attributes/columns of data. Profiling derives information such as the data type, length, value range, variance, uniqueness, discrete values and their frequency. occurrence of null values, typical string pattern (e.g., for phone numbers), etc., providing an exact view of various quality aspects of each attribute of interest for cleansing.
As issues are discovered through data profiling tools and a need is seen for cleaning, that data will be cleansed. Subsequently, it will be necessary to verify that the cleansing was complete and correct throughout. If little or no testing is done, some of the data values may only be partially corrected or not at all.
Table 1: Examples of how attribute metadata can aid in detecting data quality problems.
Issues | Metadata | Examples of Values |
Illegal values | cardinality | Gender > 2 may indicate problem |
max, min | max, min outside of permissible range | |
variance, deviation | variance, deviation of statistical values should not be higher than threshold | |
Misspellings | attribute values | sorting on values often brings misspelled values next to correct values |
Missing values | null values | percentage/number of null values |
attribute values + default values | presence of default value may indicate real value is missing | |
Varying value representations | attribute values | comparing attribute value of a column in one table against that in a column of another table |
Duplicates | cardinality + uniqueness | attribute cardinality = # rows should hold |
attribute values | sorting values by number of occurrences; more than 1 occurrence indicates duplicates |
Some examples of dirty data are shown in Table 2. These give insights into the complexity of analysis and the types of cleansing that will be necessary, particularly when millions of records are involved and the variety of dirty data is extensive.
Table 2: Common examples of data to be cleansed.
Scope, Problem | Dirty Data | Reasons/Remarks | |
Attribute | Missing values | phone=9999-999999 | unavailable values during data entry (dummy values or null) |
Misspellings | city=Munnich | usually typos, phonetic errors | |
Cryptic values, Abbreviations | experience=B; occupation=DB Prog. | Definitions not provided in data dictionary | |
Embedded values | name=K. Jones 12.02.70 New York | multiple values entered in one attribute (e.g. in a free-form field) | |
Misfielded values | city=France | ||
Referential integrity error | Name=Dave Smith, dept=137 | Referenced dept (137) not defined | |
Record | Violated attribute dependencies | city=Seattle, zip=77777 | city and zip code should correspond |
Record type | Word
transpositions |
name1=“J. Smith, name2=Miller P. | usually in a free-form field |
Duplicated records | emp1=(name=”John Smith”,…); emp2=(name=”J. Smith”,…) | same employee represented twice due to data entry errors |
A Sampling of Data Cleansing Verifications in ETL Testing Tools
Often, DWH/BI teams develop, in-house, test tools and processes to verify their data cleansing efforts. Commercial data quality assessment tools typically do not provide testing tools nor guidance for verifying correctness of the cleansing operations. The costs can be high as data testing tools are developed (and tested) then modified, for other ongoing cleansing efforts.
For data cleansing projects, many ETL testing tools (e.g., Tricentis Tosca BI/DWH Testing) allow verification of data transformations/cleansing with minimal programming skills (e.g., SQL, stored procedures, etc.).
Test tool functions should primarily be chosen to support verifications of source-to-target data cleansing, transformation tests – either during ETLs (by interfacing with the ETL tool) or directly after ETL’s have been completed.
A few of the major 1) data cleansing and 2) general data transformation test categories follow. This list is essentially in the order of need based my research and experience.
- Tests using data profiling to verify that data has been cleansed or business rules have been applied.
- Tests of arithmetic conversions: perform tests of arithmetic operations (add, multiply, etc..) on values in source table fields or ETL lookups which are then loaded into target fields. Check precisions, accuracy, format of results.
- Tests of data type and format conversions: test the conversion of data from the source data type or length to another data type or length in the target.
- Tests of merged, decomposed, enriched, grouped, or linked data: supporting (among other things) the testing joins and merges of data from a variety of source types or like sources.
- Testing derived target values: test the computation of derived values in target columns using business formulas (e.g.: averages, totals, etc.).
- Testing target data default values: when a source value is missing (null, empty string, etc..), verify that a correct default was applied to the target using business rules.
- Verifying ETL record rejects and related exception processing: Verify that source records which should be rejected during the ETL process have been rejected. Test to ensure that all records for which an ETL incident/error log entries should be created be created was run log entry.
Every data cleansing effort should end with a plan for automating repetitive data quality auditing and testing tasks. The goal should be to have ongoing procedures to check data for accuracy using reliable and efficient tools. It is essential to make data cleansing, and follow-up testing of the cleaned data, regular parts of your data analysis process.
[Read more data warehouse / BI testing blogs by Wayne Yaddow]