be_ixf;ym_201910 d_14; ct_250
image

BI / DWH Testing

Using ETL Testing Tools to Test Data After It Has Been Cleansed

Editor’s note: Wayne Yaddow is an independent consultant with over 20 years’ experience leading data migration/integration/ETL testing projects at organizations including J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, IBM, and Achieve3000.  Additionally, Wayne has taught  IIST (International Institute of Software Testing) courses on data warehouse, ETL, and data integration testing. He continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at wyaddow@gmail.com.

Numerous data quality (DQ) cleansing tools are available to assess and cleanse source data before any ETL process is run for data integrations, data migrations, data warehouse loads, and analytic applications.

This blog outlines the needs for testing data after it has been cleansed to ensure that the expected results were achieved.

What is Data Cleansing / Data Cleaning?

Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt/ inaccurate records from a record set, table, or database. Cleansing refers to identifying and transforming incomplete, incorrect, inaccurate or irrelevant parts of data and and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing is often performed interactively with data cleaning tools or as batch processing through scripting.

Data Quality (DQ) Cleansing Tools

There are a number of commercial data quality (DQ) cleansing tools on the market to choose from. In addition, IT teams sometimes develop and customize their own data cleansing programs to avoid tool acquisition costs or because the commercial products do not yet provide the required capabilities.

Some DQ cleansing tools test target data after ETLs are complete: reconciling uncleaned source data against cleansed, transformed, or enriched target data.

Testing “cleansed data” with automated tools often focuses on functions such as data profiling and a large variety of data transformation verifications that come as functions of those tools. A primary need for automated ETL data cleansing test tools is to support data transformation verifications after ETLs have completed.

Common Data Cleansing Categories

The following summary of  data cleansing categories are listed in the order of marketplace need and Gartner’s Magic Quadrant assessments:

  1. Parsed data – decomposed data.
  2. Merged, linked, grouped, or matched data – data that has been linked or merged with related data entries within or across datasets, using a variety of techniques such as rules, algorithms, metadata.
  3. Standardizing and cleaning data – applied business standards or business rules to modify data for specific formats, values, and layouts.
  4. De-duplicating data – removing any of a large variety of duplicate data existing in source or target data before, during, or after ETLs.
  5. Enriching data – integration of a variety of sourced data to improve completeness and add value.
  6. Performance, scalability – data changes to deliver suitable throughput and response times to satisfy performance SLAs.

The Case for Verifying the Results of Data Cleansing Efforts

Data profiling focusses on analysis of individual attributes/columns of data. Profiling derives information such as the data type, length, value range, variance, uniqueness, discrete values and their frequency. occurrence of null values, typical string pattern (e.g., for phone numbers), etc., providing an exact view of various quality aspects of each attribute of interest for cleansing.

As issues are discovered through data profiling tools and a need is seen for cleaning, that data will be cleansed. Subsequently, it will be necessary to verify that the cleansing was complete and correct throughout. If little or no testing is done, some of the data values may only be partially corrected or not at all.

Table 1: Examples of how attribute metadata can aid in detecting data quality problems.

Issues Metadata Examples of Values
Illegal values cardinality Gender > 2 may indicate problem
max, min max, min outside of permissible range
variance, deviation variance, deviation of statistical values should not be higher than threshold
Misspellings attribute values sorting on values often brings misspelled values next to correct values
Missing values null values percentage/number of null values
attribute values + default values presence of default value may indicate real value is missing
Varying value representations attribute values comparing attribute value of a column in one table against that in a column of another table
Duplicates cardinality + uniqueness attribute cardinality = # rows should hold
attribute values sorting values by number of occurrences; more than 1 occurrence indicates duplicates

Some examples of dirty data are shown in Table 2. These give insights into the complexity of analysis and the types of cleansing that will be necessary, particularly when millions of records are involved and the variety of dirty data is extensive.

Table 2: Common examples of data to be cleansed.

Scope, Problem Dirty Data Reasons/Remarks
Attribute Missing values phone=9999-999999 unavailable values during data entry (dummy values or null)
Misspellings city=Munnich usually typos, phonetic errors
Cryptic values, Abbreviations experience=B; occupation=DB Prog. Definitions not provided in data dictionary
Embedded values name=K. Jones 12.02.70 New York multiple values entered in one attribute (e.g. in a free-form field)
Misfielded values city=France
Referential integrity error Name=Dave Smith, dept=137 Referenced dept (137) not defined
Record Violated attribute dependencies city=Seattle, zip=77777 city and zip code should correspond
Record type Word

transpositions

name1=“J. Smith, name2=Miller P. usually in a free-form field
Duplicated records emp1=(name=”John Smith”,…); emp2=(name=”J. Smith”,…) same employee represented twice due to data entry errors

A Sampling of Data Cleansing Verifications in ETL Testing Tools

Often, DWH/BI teams develop, in-house, test tools and processes to verify their data cleansing efforts. Commercial data quality assessment tools typically do not provide testing tools nor guidance for verifying correctness of the cleansing operations. The costs can be high as data testing tools are developed (and tested) then modified, for other ongoing cleansing efforts.

For data cleansing projects, many ETL testing tools (e.g., Tricentis Tosca BI/DWH Testing) allow verification of data transformations/cleansing with minimal programming skills (e.g., SQL, stored procedures, etc.).

Test tool functions should primarily be chosen to support verifications of source-to-target data cleansing, transformation tests – either during ETLs (by interfacing with the ETL tool) or directly after ETL’s have been completed.

A few of the major 1) data cleansing and 2) general data transformation test categories follow. This list is essentially in the order of need based my research and experience.

  1. Tests using data profiling to verify that data has been cleansed or business rules have been applied.
  2. Tests of arithmetic conversions: perform tests of arithmetic operations (add, multiply, etc..) on values in source table fields or ETL lookups which are then loaded into target fields. Check precisions, accuracy, format of results.
  3. Tests of data type and format conversions: test the conversion of data from the source data type or length to another data type or length in the target.
  4. Tests of merged, decomposed, enriched, grouped, or linked data: supporting (among other things) the testing joins and merges of data from a variety of source types or like sources.
  5. Testing derived target values: test the computation of derived values in target columns using business formulas (e.g.: averages, totals, etc.).
  6. Testing target data default values: when a source value is missing (null, empty string, etc..), verify that a correct default was applied to the target using business rules.
  7. Verifying ETL record rejects and related exception processing: Verify that source records which should be rejected during the ETL process have been rejected. Test to ensure that all records for which an ETL incident/error log entries should be created be created was run log entry.

Every data cleansing effort should end with a plan for automating repetitive data quality auditing and testing tasks. The goal should be to have ongoing procedures to check data for accuracy using reliable and efficient tools. It is essential to make data cleansing, and follow-up testing of the cleaned data,  regular parts of your data analysis process.

[Read more data warehouse / BI testing blogs by Wayne Yaddow]