In the area of software development and application lifecycle management continuous and automated testing is proven practice. Yet in an area in which quality is key a structured approach in test management and test automation has still not become widespread: When it comes to the testing of data warehouses and business intelligence systems, one still encounters a high level of unstructured testing, manual test effort and individual, development-driven test activities.
The new Tricentis Tosca BI testing module extends the Tosca test suite with capabilities to meet the requirements arising from testing projects in the DWH/BI space. Watch Reeeliance consultant, Christian Stahn’s recent talk on finding out how you could leverage your DWH and BI testing activities by a defined and repeatable automated test approach.
Here’s the full transcript
Today, I have the pleasure to introduce you to automated BI and Data Warehouse testing with Tosca BI. My name is Christian Stahn, I’m working as a consultant for Reeeliance. And maybe before we go to the topic, just a short introduction of who we are and what we do. As mentioned, we are a consulting company in the area of information management, so we do business intelligence and data warehouse project. And here our consulting services focusing on three areas. So, on the top part we have the strategy definition. So here we help our client to define how to set course for a successful future, and future proof a way to organize and develop their information architecture.
So it’s about what should I even do, what’s my information management roadmap to define the projects, and to initiate this project. Then in the upper area of organizational platform, we focus on defining the guidelines on an organizational level. How to do the projects we defined on the strategy definition which focuses for example on enterprise architecture, mainly of course on the application and data architecture. But also, when it comes to how to set up the organization, how to setup processes, and what governance rules and process needs to be in place to build the right environment for the actual projects. And in the projects or in that area of project execution, we’re not so much the ones doing the implementation, so that’s the whole build part and run part is kind of skipped. So we focus rather on these classical software engineering disciplines and say “okay, how do they need to be adapted, to be adjusted to suit the needs that we have in testing BI and data warehouse projects.
So it’s project and project management, requirement management. Large part of it is data modeling. And of course also, if I have defined my information requirements, so what should my analytical applications provide. How can I ensure that they actually do this for the whole area of quality assurance and testing, which is the subject of that speech today.
Now maybe a short disclaimer, as Alain mentioned today, it’s rather this theory part to do a bit of the motivation, why its needed and what are different topics in the area to consider while there will be a more hands on part tomorrow and a workshop session and of course on our booth here in the exhibition area, you’re all very welcome to see also the tool in action.
So a short view on the agenda, I’ll talk a bit about the challenges that we have in BI and data warehouse projects and of course focusing on the area of testing and say, in general, what are things we have to consider when it comes to testing of BI solutions and data warehouses. And finally, I’ll give you a first idea how end to end testing with Tosca BI for automated testing can be done.
So, let’s have a look here. What risks and challenges do we have in BI projects? So, of course it’s not like back in the day that there is a BI project on the green field. Whenever we come now to a client, we usually focus on the situation that there is already not just one, but many BI or data warehouse solutions in place. We have a very heterogeneous architecture which also means that this knowledge about the different systems is rather scattered and widespread across different people which poses some demands on the testing. Because of this very heterogeneous architecture, there are many different technologies in place when it comes to the end to end testing, like really testing all layers of my architecture, testing all the different systems involved, mainly on the data entry side then very heterogeneous situation. That demands the testers of having are also very broad technical skills.
In addition to the technical skills, domain knowledge is also required to create tests and to define what needs to be tested on the BI level – especially on the front end. There is also a huge need for insight into the business logic.
Another important part here is the area of data. I mean, we are talking about data quality and many clients we have, have special requirements when it comes to the quality of data, certain governance rules they have to comply with especially when it comes to area of insurance and banking which are all very strong factors to consider when it comes to the testing of those solutions.
Testing, as mentioned, domain knowledge is required to say “okay, is the information I am provided the front end correct?” It goes hand in hand with the very high cost of testing because I need very skilled people to do that.
Moreover, in the area of BI, we have this pressure to deliver new iterations or new functionality to the market in shorter periods, so the need for DevOps and Agile practices is also given in here.
This is kind of a very high level view on what we face when it comes to testing in BI projects.
When you look at a very high level picture of how data warehouse or BI architecture could look like, usually we have data coming from a source system and it’s transformed across different layers to finally go to the consumer. So we usually have a kind of staging area where the data is loaded initially, then we have a central area where that information is stored. Then, for different stakeholders and different needs we have in this data area subject oriented areas of my data warehouse. And finally on the report or analysis area, we have then the activity of the data consumption. This is where the end user comes into place. Most of you know this curve of the costs.
The later the defect is found, the more expensive it is to actually fix this.
This can also be seen in the area of BI and data warehousing on the different layers of my architecture. Usually, when I find in my report that something is wrong, it’s very costly to go back all the chain to find where the error has been introduced. The goal, of course, needs to be also here to find it, the error, in a stage where it’s still a bit cheaper to actually correct this defect.
The goal must be to test all layers – not just the upper layers, and to test as early as possible
Now, the question is, to what extent is a company able to answer the question: what’s the quality on the different layers? So, who of you are in the area of business intelligence and data warehousing? Okay, some are. so, maybe you can answer for yourself how good are you in saying “okay, where exactly, until which layer can I rely exactly on the quality and where are most of the issues being introduced?”
We tried to, from the experience we have in our projects, to kind of asses and define the maturity in data warehouse testing that we face. A bit related to the CMMI level. When it comes to the state initial, that’s what we, of course, see at every client. So, everybody does manual tests on the front end layer. Whenever the BI team develops new reports, it develops a new application, the final business units that in the beginning defined the information requirement tests the results. They are checking reports, they are doing analysis in cubes and to see whether the data is right. That’s what we always found but it’s very manually and it’s at a very late stage.
When it comes to the level of managed, that means CMMI classifies it managed on a project level. Here, we also have defined unit tests by the developers which is, of course, something that we also see in all projects. The BI developers building this solution, of course, do their own tests and here in that managed layer, we see there is also at least some project level [of defined 00:08:37] test management approach. The developers are required to document their tests and to document their findings. But still, it’s rather manual and very individual.
This is what we see mostly when it comes to testing for our clients. More advanced clients also, we could say assigned this to the level of defined. There is already a sort of test automation. Usually, these are individually built frameworks that do their job but, again, it’s not so much the case that it’s a company wide approached and a very standardized approach. It’s still pretty individual and in combination with the manual business and also some defined regression tests. What we tried to define, to classify this that it’s rather proactive than reactive but of course there is still some more advanced levels that you could advance to. When it comes to the CMMI level of quantitatively managed, that means that the process is measured and controls needs or, of course, demands the process to be more strict and more defined. What we can see is that we have arrived at this level when, for example, the whole process of defining the tests and defining what needs to be tested and to measure the coverage, etc., is done by a very structured approach.
For example, there is base testing that you maybe also have seen as it’s supported by Tosca BI, to really prioritize the different requirements and say “okay, which area should I focus on when it comes to testing?”. You find a managed repository of regression tests that means that we have a set, a repository of repeatable tests that can be used to, when I have data iterations or additional testing parts that I can use this again to see, okay, that I have reached a certain level of quality, I do not fall under that level.
Finally, an optimized version. That’s when we really have it in the end to end view and that can be maybe seen two fold.
End-to-end means that BI and data warehouse is an integrated part of the business processes themselves, meaning there is a proactive data governance
For example, that there is a certain data stewardship that if the application changes, which of course always or in most cases has an impact on my reporting that I am proactively informed and can consider this in the evolution of my data warehouse. Of course, optimize would also mean that the whole testing part and development part goes into the direction of continuous integration, of continuous delivery, the whole DevOps topic that Tricentis is currently pushing.
How can an organization improve now to a more advanced and more mature level? This is again the picture from the beginning but this rising slope has and can also not be labeled by the cost of the defects but also by the cost of testing and by the complexity. As mentioned before, the data that’s here delivered like in a raw material from the source system is being transformed to fulfill the need and the goal to have a centralized and harmonized reporting. I want to see my client just once and not as many different types of clients and attributes of clients as I have from the different source systems. With that transformation, complexity is added and, as I said a few times already, to understand this logic that has been introduced on these different layers requires more skilled testers.
What should be your goal and what can help in increasing my test coverage?
That would mean that I, maybe, have to try to, on the lower levels, automate these tests and I have to automate the test creation and also the test execution to test early and that way to free my other time of my valuable resources for more complex business rules that are not so easy to test and especially not so easy to automate.
Let’s add another perspective to the picture. Also, BI projects are no longer around like back in the stone age. Like in other areas also in BI we have the pressure to deliver analytic applications and improvements to applications in a more timely manner. Instead of building the application layer by layer, I build the source, I build the stage layer, and another layer, I still try to have the full development cycle from design, design build test and, go to production. That’s what is meant by implement here. I try to do it across all layers in short iterations. What we see with our clients, we still have the good old waterfall approach with very long project cycles but also here, agile BI is no longer becoming more and more widespread and that’s being used and even if we see a lot of iterative approaches.
The duration of this iteration has a different length, if you like, but we see this and so, of course I think that’s nothing new for all of you that attend the accelerate when it comes to iteration. The amount of tests or the functionality that still needs to be tested increases by each iteration. How can this be covered? I need to have automated regression tests to still be able to cover this. That requires two things. I need to be able to do automation and when it comes to changes to the data, I have the means to maintain my test cases in a very efficient way.
Let’s have a look at what Tosca BI then finally offers to fulfill these needs that I have just shown. Here, we see again, a slightly adjusted picture of a general BI or data warehouse architecture. We, again, have our data sources which could be relational database systems: SAP Hana, Adobe, … etc. any kind of data sources that deliver data and information to my data warehouse. Data ingestion is a kind of different name for this whole staging area. Somehow this data goes into that whole environment that I call my data warehouse via different technologies be it classical ETL, via gloss, change data, capture, etc. and from that layer it goes through the next layers that in the data storage builds a kind of corporate memory or the core to store and to have a historical version of my data. Again, it could be different approaches when it comes to the data architectures. All the classical, 3NF modeling of a data warehouse, more advanced data vault technology or anything else, the operational data store, etc.
Business views and business logic could be, for example when it has a data vault approach, could be the business vault, could be sets of data quality rules, business rules, etc. that are applied when I proceed to the upper layers to the information consumption. If the classical view of data can still be described even if does not physically have facts and dimension describing these different facts to query the data.
We could have different reporting front ends accessing the data on different layers. You see a little bit how we tried to depict it in the arrows. It’s no longer that strict that we fulfill the way, go through all these layers and have everything in the data. Especially if there is a project of data and the sale service BI is really having tools and front ends to consume the data already on this raw level which, of course, makes this a little bit more difficult to test.
Focusing on the classical data: What does Tosca BI offer when it comes to testing?
The Tosca BI is an extension to the Tosca testing suite and offers a set of functionalities which we tried to cluster into different types of tests. The first area of test is the area of prescreening. What is the goal for this? Here we try to check the data if it’s even worth loading and to ensure that not already at this very early stage the process being interrupted or broken. This, again, reads from meta data and checks if the data is in the format that I have defined and that the data that’s delivered to us has to adhere to. These are meta data checks, format checks to see whether it’s in the right structure. Could be, for example, simple things like field tests or data types, etc. that we can check beforehand, before it goes to the next step.
The most important part(s) to testing using Tricentis Tosca BI are vital checks and field tests
These are tests on table and column level of the different tables. The general approach here is that I use the information that I have from my database which is the meta data to see, okay, what are the tests that I have to actually carry out and what are my test objects? By the help of wizards, I can read my DDLs, I can read the structure of my database and generate tests automatically. For example, they could be completeness tests that are simple row count, or to have a minimum or maximum values or sum checks on that level of completeness. But I can also read more advanced information from my database like uniqueness, integrity, constraints, etc. and use this information to generate my test case. That’s what Tosca does.
I can define and if you attend the workshop tomorrow, you can see you can read the meta data from a source, from a target, you can map it and say “okay, for example, this is a table that needs to be from the source, that needs to be completely also in the target and I automatically generate these test cases”. You can have this, basically, on all layers of our data warehouse. These table tests are rather straightforward and for the field tests I can also introduce quite a lot of individual logic so that I can, for example, define that a certain field, be it for example, a front key is that every of these values needs to be present in the dimensional table defined in this value, for example.
While these tests are being done on individual tables and rows, what also can be used is the whole area that’s called reconciliation tests where we compare to entire data sets. It could be on some level but we also offer, or the tool offers, the capability to do even a row by row comparison for every single field and every single row. That might be useful for some extent. For example, as described here that I have certain subsets that exported from my data warehouse to ensure that this is complete.
While this was rather on the backend layers, as you know Tosca BI is a front end test automation tool so I can also combine these two different rules. Here, in the report testing area, I can use the whole UI test automation. For example, enter, log into the front end, open a parametrized report or predefined report and compare it to the underlying layers in my database. For example, if I have an environment with this as my core data warehouse and the calculations are on the top layer, then I have, for example, Cognos as a front end tool and I have in between my meta data layer with a framework matcher and to ensure that here everything is correct, all joined conditions are defined correct, etc. I could, for example, have certain reports being opened by Tosca, the data being read and checked against the data marked. Combining all this together with the test that I have done on the previous errors have already given me quite good view and quite good feeling of the quality of my data.
There’s one final part that also should be mentioned. The whole area of monitoring. What Tosca also does well, it goes rather from the project view maybe more to the maintenance view that I kind of have a profiling and monitor the development of my environment. For example, I could store the usual number of rows that is added by every load, etc. I can have some indicators showing me or giving me a hint that maybe something in the back end is not right. For example, I can also check drop run times from certain lock tables in the database, etc. It offers me quite broad functionality to also support me in the run area.
Now I got carried away a little and see we just have two minutes left. Maybe one final thought that I want to give you.
The question is: can I automate everything with Tricentis Tosca BI?
If you ask the Tricentis guys they would say “of course you can”. Question is if you should. Automating tests is always an effort. I have different ways how to do it. I have an input, I have an output in my data warehouse and it undergoes various transformations and so what would be a mean to validate their output? I could rebuild the whole complex transformation that happens throughout the layers with very complex SQLs. That costs time and it’s most of the time also error prone again. It would have to be tested, so maybe that’s not the best approach but for some, and certain special areas, that could be a valid approach.
Another thing would be to have reference data that I can check my output against a defined set and that’s, of course, something that if you think back of the previous slides with the reconciliation that I can very easily do. If I have tested and verified test data, I have a very easy tool at hand to compare this data against my reference data. But the key, and with this single test, single capabilities that Tosca BI offers, I have another option and we call it the “divide and conquer” principle.
By separating the more complex transformation into individual tests and across the different layers to this table is complete and this field and that column is correct, etc., and combining all this, I can have a broad set of rather simple, easy to be defined and easy to be maintained tests to finally give me the whole idea, the answer, if the output is the desired output.
Okay, time is up. Maybe very short summary.
What can I do with Tricentis Tosca BI?
I can reduce the test generation effort by reading meta data and generating a large set of test cases on my own and still being able to adapt them and change them according to my needs. The maintenance effort is reduced, I can also update these tests automatically if something in the database changes and by that means I can reduce the effort to create and run the tests, especially on the lower levels. This way, as mentioned again, can free my valuable resources to focus with manual tests and with, again, then in the manual approach on the more complex parts of my data warehouse to be tested.
If you would like to find out more about BI and DWH, watch our Tricentis-Reeeliance joint webinar, called ‘BI Testing: Poor Data Quality Now Means Big Data Problems Later’