be_ixf;ym_201910 d_15; ct_250
image

BI / DWH Testing

Managing Data Warehouse/BI Data Risks Through Data Reconciliation and Data Lineage Processes

Editor’s note: Wayne Yaddow is an independent consultant with over 20 years’ experience leading data migration/integration/ETL testing projects at organizations including J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, IBM, and Achieve3000.  Additionally, Wayne has taught  IIST (International Institute of Software Testing) courses on data warehouseETL, and data integration testing. He continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at wyaddow@gmail.com.

This blog, which explains how a robust and flexible solution for data validation and reconciliation supports decision integrity, is the second article in a 3-part blog series. [Read Part 1]

International Laws Require Data Reconciliation and Lineage

European Union (EU) law requires that all financial firms conduct regular reconciliations of their front office records against financial databases and data integrations reported to EU regulators.

Due to the EU obligations under MiFID II, (i.e., the EU Markets in Financial Instruments Directive 2004), significant data risk is inherent in the process of data integration and reporting that could result in possible loss of BI “decision integrity”. All firms must implement a robust and flexible solution for data validation and reconciliation. MiFID II covers nearly the entire universe of financial instruments.

Data risk is an increasing problem in the financial industry due to the number of processes data is exposed to between its source and its target destinations. Under MiFID II, reporting data may need to pass through a number of external firm’s databases before reaching regulators. Each step frequently includes both manual and automated data transformation and enrichment layers, potentially introducing errors.

The further removed a transaction is from its final target, the greater the likelihood that errors will occur. It also becomes increasingly difficult to track these errors and reconcile data back to data sources (i.e., data lineage). The consequence is higher costs through manual data reconciliations, and the potential for enquiries and penalties due to inadequate controls.

What are Data Reconciliation and Data Lineage Processes?

Data reconciliation processes and tools are used to during testing phases during a data integration where the target data is compared against original and ongoing transformed source data to ensure that the integration (e.g., ETL) architecture has moved and/or transformed data correctly. [3]

Broadly speaking, data reconciliation is an action of:

  • Conducting vital (quality) checks after data loads/ETLs: [2]
    • Metadata correctness – sources to targets
    • Data integration completeness – row and column counts
    • Target data uniqueness – ex., no duplicate rows nor columns
    • The referential integrity of both source and target data – ex., primary and foreign key relationships
    • The correctness of data aggregations – source to target and transformations
  • Comparing data points to detect errors in data movements and transformations 
  • Identifying data sets (e.g., source and target) where there should be no differences
  • Comparing and verifying metadata from sources to targets (e.g., data types, data lengths, min/max values, precision, and much more)

Data reconciliation is central to achieving those goals. The stepped-up pace of the market and more demanding client expectations have made a high quality data reconciliation and data lineage assessment process a competitive necessity.

Candidate applications for data reconciliation can be based on these influences:

  • The complexity of the data extraction processes (e.g. extraction of delta data vs. full load)
  • Defined and regular human interventions
  • Number of modifications to standard extraction processes
  • Frequent changes to business rules

Data lineage processes and tools help organizations understand where all data originated from, how it gets from point to point, how it is changed/transformed, and where it may be at any time. [3]

Through data lineage, organizations can understand what happens to data as it travels through various pipelines (e.g., spreadsheets, files, tables, views, ETL processes, reports); as a result, more informed business decisions ensue. Data lineage clarifies the paths from database tables, through ETL processes, to all the reports that depend on the data; data lineage enables you to double-check all reports to ensure they have everything required.

Data lineage enables organizations to trace sources of specific business data for the purposes of tracking errors, implementing  changes in processes, and implementing system integrations to save substantial amounts of time and resources, thereby tremendously improving BI efficiency and “decision integrity”. And, without data lineage, data stewards will be unable to perform the root cause analysis necessary to identify and resolve data quality issues.

Organizations can not advance without understanding the story of that data that it uses continuously for decision making integrity. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.[3]

Finally, data lineage is often represented visually to reveal the data flow/movement from its source to destination via various changes. Also, how data is transformed along the way, how the representation and parameters change, and how the data splits or converges after each ETL or incorporation into reports.

An Example of Data Warehouse/BI Data Reconciliation

Generally, data reconciliation for information systems is an action of:

  1. Comparing specific data points
  2. Identifying differences in source and target data points where there should be none
  3. Troubleshooting differences

Example: Data warehouse ETL’s integrate product data from two sources: “Source A” and “Source B” (Figure 2), Each Source A and B record may have a different product number (Prod_num) for the same product name (ex., Mutual Fund A). [1]

In the image below, ETL’s create the “Data Warehouse” tables that allow for only one product number (Prod_num) per product. In case of a data discrepancy, the value from Source A will have precedence over the value from Source B.

Due to these implemented business rules, the Data Warehouse tables and BI reports will contain the Prod_num for product (Product: Mutual Fund A, Prod_num: 123) the same as in Source A, while the Prod_num from Source B will be logged as an exception – perhaps to be corrected in the future.

In Figure 1, the reconciliation of data between the Data Warehouse and the Source B Exceptions Log will be compared:

  1. Product Name and Product_num from the Source B Exception Log with the Data Warehouse tables
  2. For records that do not match, and the Prod_num from the Data Warehouse equals Prod_num in Source A, the reconciliation succeeds

Note: If Product_num exceptions in Source B had not been captured and logged, Data Warehouse Prod_nums and BI reports would not all reconcile with those from Source A.

Example of Data Reconciliation flow for a sample ETL integration

In this example the reconciliation processes are concerned with the data in the final reports, together with all the ETL exceptions, are accurately reconcilable with the Prod_nums in the original Source A.

Challenges with Data Reconciliation and Data Lineage Processes

A few challenges prevent widespread deployment of data reconciliation systems include a lack of services for a) securely and accurately generating data reconciliation information within a computing system, b) securely coordinating that collection within distributed systems, and c) understanding and controlling the storage and computational overheads of managing such processes.

  • Data reconciliation implementation efforts depend on the complexity of the data, the number of sources, and the number of metrics to reconcile.  It is desirable to prioritize the metrics based on business requirements.
  • Maintainability, supportability, and developing reconciliations to problem areas are some key matters that should be addressed by an IT support team.
  • Data reconciliation is an often overlooked and undervalued activity, seen as a routine—albeit necessary. However, done correctly, reconciliation is crucial to two of the most important determinants of your organization’s success: front-office performance and client satisfaction.
  • Unfortunately, reconciliation at many firms remains a laborious, time-consuming and risk-laden process, where the staff is forced to search through reams of spreadsheets and manually reconcile some data manually.
  • Automating reconciliation processes and implementing exception-based workflow best practices offers businesses easy wins in the quest for greater productivity, performance, and efficiency.

Next up: Part 3 will cover “Setting Up Data Warehouse/BI Projects to Support Data Reconciliation”

[1] Rachid Mousine, “Design Essentials for Reconciliation Data Warehouse”, Formation Data Pty Ltd.

[2] “Running the Vital Checks Database Wizard”, https://documentation.tricentis.com/tosca/1220/en/content/tosca_bi/bi_vital_filetodb.htm Tricentis.com

[3] Wikipedia, “Data Lineage”, “Data Validation and Reconciliation”, Wikipedia.org