Editor’s note: Wayne Yaddow is an independent consultant with over 20 years’ experience leading data migration/integration/ETL testing projects at organizations including J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, IBM, and Achieve3000. Additionally, Wayne has taught IIST (International Institute of Software Testing) courses on data warehouse, ETL, and data integration testing. He continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at email@example.com.
Data warehousing (DW) and business intelligence (BI) projects are a high priority for many organizations as they seek to empower more and better data-driven decisions and actions throughout their enterprise. These groups want to expand their user base for BI, analytics, and data discovery so fewer users are making uninformed decisions. At the same time, users are demanding high-quality and often complex BI reports.
However, data warehouse projects are highly intricate and fundamentally risky. Among their many tasks, the project manager who leads the data warehouse team must identify all data quality risks associated with a particular data warehouse implementation. The goal of this process is to document essential information relating to project risk.
This article focuses on helping organizations sidestep QA problems that many other data warehouse projects have experienced. Tips offered here will help ensure satisfaction as data warehouse (and DM/A) teams plan and implement new functions and capabilities. These time-tested recommendations may save significant money, time, and staff resources and improve results from the data warehouse application being developed. An extended version of this paper was first published on the Data Warehouse Institute website (TDWI.org).
Failure to introduce QA testing early in the project
During the initial phases of data warehouse/BI projects, the focus is often on BI requirements and data-related needs to build the operational data store, enterprise data warehouse, and application reporting infrastructure. Somehow, the importance of overall project testing and data quality has often been overlooked.
There is always an appreciation of data quality, but as data warehouse requirements and design progress, the overwhelming focus on data modeling, data capture, and ETL design may cause the team to lose focus on data quality. Eventually, issues such as these will arise: “Target data is not reconcilable with sources”; “duplicate data abounds”; “aggregations and report drill-downs are not correct.”
Ultimately, the success of a data warehouse is highly dependent on an ability to plan, design, and execute a set of effective tests that expose early and ongoing issues: issues with data inconsistencies, data quality, data security, the ETL process, performance, business-flow accuracy, and the end-user experience.
Many data warehouse teams debate when to start testing as they’re developing new software. For most DW projects, software testing should start as soon as the design and requirements are baselined. An early start to QA provides several advantages that improve the overall efficacy of software testing. Involvement at the beginning of a project makes testers more effective by enabling them to learn about the product and business rules they will be testing, which allows them to design test plans and test cases based on desired outcomes as much as the code they are presented.
During the design and requirements phase, testers can work with developers to determine what aspects of a design are testable and what areas will have a higher risk. This knowledge will help prevent testing errors and better equip testers to design test cases and identify defects.
Implementing a successful data warehouse project is difficult. It requires a balance of many factors, such as strong business involvement, thorough data analysis, a scalable system, data architecture, comprehensive program, and data governance, high-quality data, use of established standards and processes, excellent communications, and project management.
Failure to adequately profile/validate source data before loading to a Data Warehouse
Analysts’ studies have consistently shown that over 75 percent of data warehouse and data integration project teams have either overrun their schedules or their budgets or otherwise experienced project failure. Why the high failure rate?
Inadequate source data quality is the root cause of failure across a wide range of corporate data warehouse initiatives. Profiling and validating all source data upfront can generate significant benefits.
The traditional approach to data warehouse projects follows these basic steps:
- Analyze the business, user, and the project’s technical requirements
- Analyze the available internal and external data sources
- Identify and analyze a set of data sources from legacy systems, operational systems, and external sources to determine their relevance to the requirements of the target database
It may be folly to assume you know your source data before beginning to design your target data warehouse. The main weakness in the traditional data integration approach is that it assumes that the data required for an application is available from the data sources. Major corporations have spent millions of dollars on a data integration project, only to learn that the source data will not support the target model—whether they built the model themselves or an enterprise application vendor-defined it. Because the process is made up of a series of disjointed steps usually executed manually by independent teams of programmers, the discontinuity between the steps often leads to disaster.
Source data profiling and related analysis should be conducted for each data source using column analysis, table analysis, primary and foreign key analysis, cross-table analysis, and so on. The really important up-front work lies in the essential profiling exercises used to determine minimum, maximum, mean, mode, percentiles, standard deviations, and duplicates, in addition to metadata such as data types, length, null values, and string patterns. The metadata can then be used to discover potential issues including prohibited values, misspellings, missing values, and varying value representations.
Failing to implement widespread automated testing
With the advent of DevOps for data warehouse , organizations are releasing new applications faster than ever—sometimes on-demand or multiple times a day. However, numerous businesses are still using manual ETL test processes for highly visible or customer-facing applications. That translates into a risk to customer loyalty, the brand, confidential data, and worse. Even with new automation tools entering the marketplace, ETL and data profiling testing today continues to be generally accomplished with manual testing.
Automating ETL tests allows frequent smoke and regression testing without much user intervention. Automated testing of trusted code after each new database build can save measurable time and cost.
A decision to implement automated tools for data warehouse testing depends on a budget that supports additional spending to meet advanced testing requirements. If implementing vendor-supplied test automation is deemed cost-prohibitive, it is important to consider test tools built and maintained in-house because they are likely to be of greater advantage than no test automation at all.
When developing scenarios for test automation, evaluate your complete set of test scenarios to determine the best candidates for automation based on risk and value: Which types of defects would cause you to stop an integration or deployment? Which types of tests exercise critical, core functionality? Which tests cover areas of the application that have historically been known to fail? Which tests provide information not already covered by other tests in the pipeline?
In the end, test automation saves time and money, and more important, business users will appreciate the quality of BI deliverables and accept the data from the data platform solution as a “single version of the truth.”
Failure to influence and support adequate project change management
Change is constant on any data warehouse project. Regardless of industry, there eventually comes a need for new requirements and other changes. A drive for continuous improvement and new DW requirements commonly initiates the need for a change in the project scope or deliverables.
Change management is an important component in the success of data warehouse initiatives, but how often is change management short-changed? According to Forrester’s Q1 2014 Global BI Maturity Survey, fully half of those surveyed believe that their processes for managing change based on new data warehouse data and functionality are not well established and do not function smoothly.
Testers need the ability to reference critical documents involved in changes—documentation such as business requirements, design and technical specifications, data mapping documents, ETL job flows, and more. The ability to identify and link these documents to the overall change management process and to prepare test plans are critical for effective quality assurance.
Change for BI initiatives often comes from multiple sources, including business owners or other stakeholders’ requests as well as impacts from source system changes. The QA team should participate in how changes are logged, managed, prioritized, and updated and then should ensure all changes are tested. If your organization has a change-tracking tool already, it is a good idea to make use of it.
Data warehouse and BI/analytics initiatives tend to budget for technology and implementation, but change management and post-go-live adoption activities are often underfunded—or even overlooked entirely. Good change management facilitates communication from the outset, motivating users to move from resistance to acceptance and even excitement—increasing buy-in and greatly enhancing successful adoption of the new functionality
Patrick Meehan, research director in Gartner’s CIO Research Group, believes that most failing data warehouse/BI (and DM/A) projects are delayed, or go over budget as a result of poor communication of new and changing requirements between the management and IT teams.
When your organization is rolling out a data warehouse initiative, chances of success are greatly enhanced when change management is an integral part of the initiative.