Editor’s note: Wayne Yaddow is an independent consultant with over 20 years’ experience leading data migration/integration/ETL testing projects at organizations including J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, IBM, and Achieve3000. Additionally, Wayne has taught IIST (International Institute of Software Testing) courses on data warehouse, ETL, and data integration testing. He continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at firstname.lastname@example.org.
This is blog 2 out of 3 in a series on the top BI/Data Warehouse Testing challenges. The first blog focused on Effective Test Planning for BI/Data Warehouse Testing.
Well-planned extraction, transformation, and load testing should be a high priority and appropriately focused. Reconciliation of warehouse data with source system data (data that feeds the data warehouse) is critical to ensure business users have confidence that all reported information is accurate according to business requirements and data mapping rules.
Why is ETL testing so complex? Most of the action occurs behind the scenes; output displayed in a report or as a message to an interfacing system is just the tip of the iceberg. There is always more underlying data behind such reports and messages. The combinations can be virtually endless depending on the type of application and business/functional logic. Furthermore, enterprises are dependent on various business rules as well as different types of data (such as transactional, master, and reference data) that must be considered.
The environments to be tested are complex and heterogeneous. Multiple programming languages, databases, data sources, data targets, and reporting environments are often all integral parts of the solution. Information for test planning can be found in functional specifications, requirements, use cases, user stories, legacy applications, and test models—but are they complete and accurate? What about conflicts in input specifications? Many data types must be managed, including files, tables, queues, streams, views, and structured and unstructured datasets.
How do you represent test data? How do the testers prepare test data? How do they place it where it belongs? How much time will this take, and how much time should it take?
You can perform testing on many different levels, and you should define them as part of your ETL testing strategy. Examples include the following:
The objective of constraint testing is to validate unique constraints, primary keys, foreign keys, indexes, and relationships. Test scripts should include these validation points. ETL test cases can be developed to validate constraints during the loading of the warehouse. If you decide to add constraint validations to the ETL process, the ETL code must validate all business rules and relational data requirements. If you automate testing in any way, ensure that the setup is done correctly and maintained throughout the ever-changing requirements process. An alternative to automation is to use manual queries. For example, you can create SQL queries to cover all test scenarios and execute these tests manually.
Source-to-target count comparisons
The objective of “count” test scripts is to determine if the record counts in corresponding source data match the expected record counts in the data warehouse target. Counts should not always match; for example, duplicates may be dropped or data may be cleansed to remove unwanted records or field data. Some ETL processes are capable of capturing record count information (such as records read, records written, or records in error). When the ETL process can capture that level of detail and create a list of the counts, allow it to do so. It is always a good practice to use SQL queries to double-check the source-to-target counts.
Source-to-target data validation
No ETL process is smart enough to perform source-to-target field-to-field validation. This piece of the testing cycle is the most labor-intensive and requires the most thorough analysis of the data. There are a variety of tests you can perform during source-to-target validation. For example, verify that:
- Primary and foreign keys were correctly generated
- Not-null fields were populated properly
- Each target field is free from improper data truncation
- Target table data types and formats are as specified
Transformations of source data and application of business rules
Tests to verify all possible outcomes of the transformation rules, default values, and straight moves as specified in the business requirements and technical specification document.
Users who access information from a data warehouse must be assured that the data has been properly collected and integrated from various sources after which it has been transformed in order to remove inconsistencies, then stored in formats according to business rules. Examples of transformation testing include the following:
- Table look-ups: When a code is found in a source field, does the system access the proper table and return the correct data to populate the target table
- Arithmetic calculations: Were all arithmetic calculations and aggregations performed correctly? When the numeric value of a field is recalculated, the recalculation should be tested. When a field is calculated and automatically updated by the system, the calculation must be confirmed. As a special mention, you must ensure that when you apply business rules, no data field exceeds its boundaries (value limits).
Batch sequence dependency testing
Data warehouse ETLs are essentially a sequence of processes that execute in a defined sequence. Dependencies often exist among various processes. Therefore, it is critical to maintain the integrity of your data. Executing the sequences in the wrong order might result in inaccurate data in the warehouse. The testing process must include multiple iterations of the end-to-end execution of the entire batch sequence. Data must be continually verified and checked for integrity during this testing.
Job restart testing
In a production environment, ETL jobs and processes fail for a variety of reasons (for example, database-related or connectivity failures). Jobs can fail when only partially executed. A good design allows for the ability to restart any job from its failure point. Although this is more of a design suggestion, every ETL job should be developed and tested for restart capability.
During process validation, your testing team will identify additional data cleansing needs and identify consistent error patterns that might be averted by modifying the ETL code. It is the responsibility of your validation team to identify any and all suspect records. Once a record has been both data and process validated and the script has passed, the ETL process is functioning correctly.
You should test views created with table values to ensure the attributes specified in the views are correct and the data loaded in the target table matches what is displayed in the views.
These tests involve selecting a representative portion of the data to be loaded into target tables. Predicted results, based on mapping documents or related specifications, will be matched to the actual results obtained from the data loaded. Comparison will be verified to ensure that the predictions match the data loaded into the target table.
You must test for duplicates at each stage of the ETL process and in the final target table. This testing involves checks for duplicate rows and checks for multiple rows with the same primary key, neither of which can be allowed.
This is the most important aspect after data validation. Performance testing should check that the ETL process completes within the load window specified in the business requirements.
Verify that the system can process the maximum expected data volume for a given cycle in the expected time. This testing is sometimes overlooked or conducted late in the process. However, it should not be delayed.
As its name suggests, this involves testing the upstream and downstream interfaces and the intra-data-warehouse connectivity. Testing should examine the exact transactions between these interfaces. For example, if the design approach is to extract the files from a source system, test extracting a file from the system, not just the connectivity.
Check whether the application fails and where it should fail with invalid inputs and out-of-boundary scenarios.
Operational readiness testing (ORT)
This is the final phase of testing; it focuses on verifying the deployment of software and the operational readiness of the application. During this phase you will test:
- The solution deployment
- The overall technical deployment “checklist” and timeframes
- The security of the system, including user authentication and authorization and user access levels
Evolving needs of your business and changes in source systems will drive continuous change in the data warehouse schema and the data being loaded. Hence, development and testing processes must be clearly defined, followed by impact analysis and strong alignment between development, operations, and the business.