Test Gap Analysis turned out to be a surprisingly hot topic at Accelerate 2017. From Tricentis Founder Wolfgang Platz’s keynote presentation, to a standing-room-only technical talk by CQSE Founder Dr. Elmar Juergens, to attendee conversation on the floor—everyone was talking about Test Gap Analysis.
Is it because this fresh approach to testing new code faster and more efficiently seems so perfectly suited for Continuous Testing in Agile and DevOps processes? Or because it’s one of the few practices that truly bridges the gap between developers and testers?
You can learn more and decide for yourself if you join us in Vienna this October. Dr. Juergens will be presenting “Test Impact Analysis: How to Find New Bugs 100x Faster” (in a much larger room this time—now that we know there’s a ton of interest). We’ll also be featuring hands-on demonstrations of how Test Impact Analysis works with Tricentis Tosca, the industry-leading Continuous Testing platform.
If you missed last year’s introduction, here’s your chance to get up to speed before Dr. Juergens presents the latest innovations this October.
Here’s the full transcript
In long-lived software systems, most of the defects in production occur in those areas that changed since the last release. There’s a whole body of research that tries to find out which areas are most bug-prone, and one of the key results is that most of the bugs are in those areas that changed since the last release. As testers, this probably won’t come as a surprise for you.
What was really surprising for us, however, is how much code goes through testing entirely unexecuted. How much code, and how much changed code, gets shipped entirely untested. In this presentation, I want to outline test gap analysis as a tool that allows you to find untested changes before you ship them into production.
Since I want to present our experiences with test gap analysis, just a short background on me, on where those experiences come from.
I have two professional roles. On the one hand, I work as a postdoc researcher in the research group at the Technical University of Munich. And in that role, I’m most interested in what is relevant to measure, because one of the key fallacies in research on software quality is that we measure a lot of things that are easy to measure but not relevant to know. And in the researcher role, I’m interested in what is really relevant to know.
In the Founder role of CQSE, I’m interested in how to have an impact, because just installing a tool and running it as an analysis doesn’t help anything. And here I’m interested in how can we build analyses that really have a positive impact on how we build and deliver software.
My third role is I’m a Junior Fellow of the Gesellschaft fur Informatik, the German Computer Scientists’ Foundation. And I’m trying to have more exchange between research and practice. I’m especially happy to be here today.
Here are a couple of the companies we work with. This is not meant as an advert. The main message I want to send to you is, test gap analysis and our experiences with it are not limited to one programming language or one organization type or one type of software. You see a couple of business information systems. You see a couple of embedded software producers, and we have some teams that are very agile, like the LV 1871 here. They had Kent Beck himself in house. And we work for some government agencies, like the Bavarian police, where maybe the first thought when you think about how they may develop software that is not agile. My main message here is that what those companies have in common is that software is critical for them and that they have long-lived software systems that they need to test, and they can’t always test everything. They need to make conscious resource allocation decisions, and test gap analysis helps them to focus a significant part of the tests on those areas they changed.
Why Did We Start Researching Test Gap Analysis?
Why do we think it’s important? We made a study with Munich Re, which I can disclose, since we published it at a scientific conference. And the very simple illustration means this is the entire system, and this is the part that changed since the last release. This is what they covered in their tests, and now we have tested changes and we have untested changes. And in the study, we quantified these. And we analyzed two entire releases. And in the first release, 15% of the code that existed was modified since the last release, and half of those 15% went through testing unexecuted. On the second release, it was even worse. 60% went through.
And then what we did was we monitored the system for a long time. Every single bug that was reported by the users was traced back to the source code, and we analyzed if that source code was changed or not, and if it was tested or not. And, not surprisingly, field fault probability was five times higher for untested changes. This is not the interesting part of the study. I think the interesting part is how much code slipped through, because Munich Re has a very structured testing process.
And this was a study we did in retrospect, which is interesting from a scientific perspective, but not so interesting for the people involved because they say that it’s not useful to know that in hindsight.
What is Test Gap Analysis?
So what we built is test gap analysis.
It’s a hybrid analysis in the sense that we use a static analysis to determine what changed. For that, we connect directly to the version control system to recognize every single change a developer makes to any branch of the software. And we have a second part, which is a dynamic analysis, which traces test execution of all testing activities that take place, meaning manual tests and automated tests, Tricentis Tosca tests and unit tests and so on. And then we throw that information together to see where are untested changes.
To be better able to show examples from practice, I want to use a tree map, which Wolfgang also mentioned in the keynote today. The basic idea of a tree map is that we use boxes to depict components of a system. In that case, it’s a business information system, and every box here represents a business component or technical component (your IBase classes, dialogs and so on.)
And now I go in deeper, every small box represents a class. And the surface area of the rectangle corresponds to the size of the corresponding class and lines of code.
And now I go in one level further. Now the small boxes are methods, the executable part of a system. And it still holds, the bigger the surface area, the bigger the corresponding methods in lines of code. Layout is performed by an algorithm according to package structure. Now, I use these tree maps to show changes, execution, and test gaps.
In this system, developers worked for half a year. This was pre-agile transition in that organization; 20 developers worked on roughly 2 million lines of code for half a year. And gray means the code that did not change. Red means new code, orange means code that was there before but changed. So we see there were not a lot that did take place here, and a lot of changed code appeared here.
I used the same visualization for execution. Now everything that is gray means it was not executed during test. Green means executed during test. Under this visualization, I have both manual and automated tests. Not only from a single version, but what we often see in so called cumulative testing, maybe a two week test period. And the first three days you execute tests on one version, then you get a new patch and you continue testing but you don’t repeat all the tests. So we aggregate the data together in a way, that it’s only green here, if it did not change after the test.
And I think the interesting view now is the test gaps. I used the same tree map and the colors now mean the following. Gray means unchanged, so this is the code that did not change since the last release. The color itself is the new code of the changed code. And only if it’s green was it executed during testing, either in manual test or an automated test.
And what we are interested in seeing if there’s big red chunks. And here we see in this system, in this test phase there were entire components which were not at all, or almost not at all tested. And we use test gap analysis in a way that it’s fully automatically computed. And so that as testers, test manager, developer, and so on, every single day they get fresh test gap tree maps to be able to make a conscious decision or whether they want to let code slip through untested or whether they want to still do some tests.
And in this system, what happened here is we spoke with the developers and the testers. This system had one contractor doing the development work and another contractor doing the testing. And the test contractor had a test factory in India which did exactly the testing they were told to do. The program management bug was that the test factory, was only transmitting regression tests. They didn’t get any tests for the new code, which is a project management error, not a fault of the test factory. In this case it was critical, and they decided not to release it like this. Instead they did an exploratory test where they took requirement engineers and end users and they let them use the system without structured test cases—relying on their knowledge of the domain, and that they would report bugs if the system behaved differently than expected. And then after three weeks of testing with 23 exploratory testers, we got this map. At least we didn’t have the big red chunks, but we still had some areas which were untested. They were then inspected and further testing took place.
Limitations of Test Gap Analysis
Now, I said I’m a researcher, and I also want to outline the limitations of such approaches. And our goal is not to have a 100 percent change coverage, per se, because there might be areas where it’s not problematic if we don’t test them.
For example, in this system, there was a lot of new code here which we inspected with the developers. It was not tested, but it was not a problem because it was not yet reachable. Remember they had six month release cycles, and some of the developers prepared code for the next release which was not yet reachable. It’s questionable if it’s a good development method idea, but from a testing point of view, that’s not problematic.
Same here. Here was code that was not reachable anymore. From a development point of view, that should be deleted. But from a testing point of view it’s not problematic if it doesn’t work, because it can’t be executed. The code here however, was critical, it was not used by anyone because it was hidden under your eye. They retested that.
And the second limitation I want to stress, is that even if we had 100 percent change coverage, this would not mean that we had no bugs. As always, testing can only show the presence of bugs. We can never prove the absence of bugs.
And I always try to say: execute during tests for the green areas not tested because test coverage as an instrument is a bit coarse. We know that it was executed during tests, but we don’t know how thoroughly it was really tested. So for all the green areas, there can still be bugs inside that we haven’t found. But for all the orange and red areas, we know that every single bug in there cannot have been found because it was never executed.
On Analysis Granularity
On which level of granularity do we do this profiling? This depends a bit on the technology. We can do it on a statement level. But again for all the red ones, since the entire method was never called, we know that every single statement and every single branch and every single path has not been executed. For the green ones, especially for the big ones, we don’t know how thoroughly. We can use profilers that do a statement at branch level analysis. But they often have a higher performance impact. So it’s bit of a trade off decision between low performance impact for more coarse grained analysis and more depth of information.
The funny point is, I always only get this question in such a presentation. When a company uses that on their own system, they typically have enough of those big gaps that they don’t want to know about smaller gaps, which would increase the number only.
Different Ways of Using Test Gap Analysis
Now, I want to cover some rough use cases, because there are different ways of using the test gap analysis practice. One is a hot fix test.
This is an example from an SAP system. A patch was deployed and now you want to make sure that the patch doesn’t introduce a new bug, because the customer’s already a little itchy. And the first thing you usually do is make sure that the bug’s really fixed. And with test gap analysis we can see in this case, from the eight methods that the patch affected, only three were covered.
And now we put a tester and a developer together. Make them inspect the further changes, and determine what further tests should we do to make sure that we didn’t break anything on top of that. We do this until we reach a completely green state.
I want to come back to the question asked before.
If we know those test gaps, what can we do to determine which test cases would we need to run to execute those test gaps? Or how critical are they?
And there’s two different answers to that. The first one is the one that requires less technology: we simply put a tester and a developer together in one room, using the tool and just looking at the code. If you click in the tool on the tree map, you directly jump to the code, have the version history information, the ticket information, in which context a change was made. And this, from our experience works pretty well.
The second solution is to throw more technology at the problem. And our tool, Teamscale, is integrated with a version control system (to know all the changes), it is integrated with the ticket system (to know in which context a change was made). And many developers specify the ticket for every commit they do.
For example, this is a log from our own system and the ticket 9838 was touched in all those blue lines. And what we can do is analyze the log, to find out all the methods that were changed in the context of a ticket. And then we can throw the test gap information against that, and then we can compute ticket coverage for every single ticket in our system. Be that a bug, a change request, a user story, an epic whatever that’s called…as long as it’s managed in a ticket system, we can compute the ticket coverage. What percent of the code implemented for the ticket was executed during testing? And this is not yet the test case you would need to run, but it’s at least closer to the functionality and easier to do a priority decision. Because maybe you have more test gaps than you can cover on your test phase. Then maybe you want to prioritize them according to risk, and having the ticket it belongs to is a significant step towards getting closer to risk.
Test Gap Analysis and Security
Does test gap analysis contribute to security? No, not at all. We do have incremental security analyses in our tool that discover SQL injection effects, for example, by doing incremental string-taint analysis. And I’m happy to answer questions on that, but that’s beyond the scope of this talk.
Beyond Test Gap Analysis
I want to give a short sneak preview beyond test gap analysis.
One problem we often see with companies that are successful with the automated testing is that the test suite size grows, and executing all tests takes longer and longer. This increases the time between introducing a bug and having a failing test case. And this lowers the value of the test suite.
And we are in a good position to fix that, because we know which code changed and we know which test covered which code. And we can for some languages, and with a Tricentis Tosca integration, capture the test coverage per test case.
If, for example, you test on a daily or hourly basis. we don’t need to always execute all tests. We can decide which tests will most quickly find bugs introduced by the changes since the last test run.
We do this in two steps. The first one is to select tests. Imagine every single dot is a test case, and we can remove all the non-impacted tests (meaning those tests that don’t execute any of the changes made since the last run) because we know they can’t find a bug. This cuts down the number of tests. The second step is to prioritize the sequence of the remaining ones in order to find a new bug as early as possible.
How does that look? If we would run them all, we would know which ones pass and the red ones which fail. So, the ideal sequence would be to run the failing ones first. We don’t know that, but we can approximate it to have the failing ones run early at least. And if you want to break the build on the first failure, this significantly cuts down the execution time. We’ve run an empirical study on this, on a couple of open source systems and on our own system, TeamScale. And on those studies, we find 90 percent of the bugs in 5 percent of the time: so we find a lot already during continuous testing. You still have to do a big run every week or month to have an accurate mapping and to find the 10 percent of bugs that slipped through. But you find a substantial amount in a very short time. And this is why we’re working on a tighter integration with Tricentis Tosca… to automate that full test.
[Editor’s note: Here’s a preview of that integration…]