AI in Software Testing
How to Test Learning Systems
The question that plagues most of us testers in the era of AI is, “Do we need to change our current test approaches to test learning systems?” In short, the simple answer is, “Yes, but only slightly!” Why is that? The past has shown several times that new technologies (e.g., mobile devices, cloud technologies) significantly affect how we deliver software. Unsurprisingly, software testing needs to go along with these changes, so these technologies also demand (at least) some slight adaptations to our current approaches to software testing.
Bottom Line. As technology advances, so must our test approaches.
On closer inspection, this becomes obvious. Learning systems are already our constant companions in our private lives, so it’s not surprising that these learning systems will also influence our daily testing routines. Learning systems (e.g., deep neural networks) are most probably the next big thing in software testing. But, bear in mind that this technology is just another technology. It’s not black magic.
What Do We Mean by “Testing”
Let’s first figure out what exactly we mean by “testing” to seriously talk about the impact of learning systems on testing. The purpose of testing is to close the gap between what we know and what we don’t know to reveal problems in our software. The reason we test won’t change, but how we do it will. How do we close this knowledge gap? We do it in two ways: We test software by checking (formal testing) and by exploring (exploratory testing).
Checking means evaluating software by applying algorithmic decision rules to specific observations of a software product (Michael Bolton). A check just provides an answer to the question, “Does this assertion pass or fail?” This implies that a check (e.g., performed through an automated test case) provides a binary result (true or false, yes or no, zero or one). That’s our current understanding. A check is machine decidable, and for that reason, checking is often referred to as formal testing.
Checking can be done either by machines or by humans. When it is done by machines, we call it machine checking, and when it is done by humans, we simply call it human checking. Checking is all about confirming existing beliefs. This way of looking at testing tends to focus on what we know (Maaret Pyhäjärvi).
In contrast to checking (formal testing), exploratory testing is a way of thinking, much more than it is a body of simple mechanical checks. Here, we want to figure out whether there is a problem in a product rather than asking if certain assertions pass or fail. This kind of decision usually requires the application of a variety of many human observations (e.g., questioning, study, modeling, inference).
Therefore, exploratory testing means evaluating a software product by learning it through exploration and experimentation (Michael Bolton). This implies that exploratory testing is not about creating test cases. It is certainly not about the number of test cases you create. It’s about performing experiments (James Bach). Hence, exploratory testing is the basis for formal testing. It’s all about analyzing potential risks. So, this way of looking at testing tends to focus on what we don’t know.
What does this mean for testing learning systems?
What does this mean for testing learning systems? It says just two things.
First, we need to learn. We must acquire knowledge about the internal workings of these learning systems (e.g., deep neural networks) to enable us to reveal problems in these systems through powerful testing experiments in the future. Here, you already apply exploratory testing because you focus on things you don’t know about the system.
In addition, you should challenge the assumptions everybody (e.g., developers) holds about the system without any empirical evidence. This might enable you to reveal fundamental problems arising from incorrect assumptions about the system. This might sound like a hard and daunting task to you. But don’t see the difficulties; see the opportunities, not the threats, accept this new challenge, and take the chance. We are more than convinced that, through thoughtful exploration and experimentation, you will become an even more powerful tester.
One way to break into the field of AI is to first talk to your developer friends to understand the specifics of the learning system you are about to test. In addition, take online courses about AI to get the big picture. We highly recommend the online course at deeplearning.ai by Andrew Ng (Stanford University). No worries; we don’t have a vested interest in promoting this course. It’s worth a shot. Don’t hold back, and don’t be afraid. It’s not rocket science (by any means). It’s your call.
Second, we need to change how we check. Our current testing techniques are mainly based on fixed inputs and outputs. Testers are hard-wired to believe that, given some inputs, the output will be constant until the software undergoes some changes. This is no longer true in learning systems. In these systems, there is no exactness anymore. The output is no longer fixed; it will change over time. It will evolve as the system is fed more and more data over time (Deepika Mamnani).
This implies that we need to break out of our established testing patterns. It sounds like we must throw our current test approaches away and start all over again. No worries; that’s not the case. Our current test approaches are still needed. They are still applicable. We just need to slightly adapt them to test these systems. Instead of testing static systems, we are now also asked to test dynamic systems. And that’s not a big deal. Trust us.
How Systems Learn
Before we dive into the nitty gritty of testing learning systems, we need to understand two basic techniques through these systems can learn.
- Supervised Learning. In the following sections, we will focus on supervised learning systems because supervised learning is by far the most frequently used technique across a wide range of industry use cases. For supervised learning, you have input data and the corresponding output data. Here, the job of the learning system is to learn the mapping function from the input data to the output data. The goal is to approximate this mapping function so well that, when you have new input data, you can predict the output data with high accuracy.
- Unsupervised Learning. In contrast, for unsupervised learning systems, you have only the input data. This technique is mainly used to understand the data structure and/or the data distribution. The goal is to learn more about the data. Unsupervised learning systems are often described as systems that teach themselves.
How to Verify that a Learning System Works
Let’s get more concrete. The above notion (learn, adapt) is at the heart of an experience report from Angie Jones (Twitter, Automation Engineer). The article shows how to test an advertising application that uses a deep neural network to determine the best ad that a business could provide to a specific customer in real time. You can click the link above to read the entire article, but here we thought we would mention a few noteworthy items, enrich them with our own thoughts, and add findings from other experience reports. The steps required to verify that a learning system works as intended are as follows.
As outlined above, the first thing you need to do is to understand how the system processes data, how it learns, and how it utilizes information in the form of data to make future decisions. This helps you to determine the range of correctness.
Before you can test your learning system, you must give it some food to learn. This food is data. The next thing you need to do is to create a training data set. From a traditional point of view, this means that you relate the inputs to the expected outputs. In this training data set, you have the input data (e.g., images) labelled with the expected output (e.g., It’s a cat). By doing this, you set an expectation for what the system must know.
Note that, by bombarding your learning system with training data, you allow the learning system to create its internal model of reality. It will look for patterns in the data. The more data you provide and the more variations you have in your training data, the more the learning system will refine its model. This means that, by training the system you put the system into a specific knowledge state. It’s like configuring your learning system from a pure testing perspective. But bear in mind that, by providing the data, you already create an expectation of which patterns the learning system must know.
Our advice is to start simple and then incrementally increase the richness of the data after each test. In the end, the training data set needs to be representative of the real-world use of the system. Note that these systems are usually very sensitive to the provided data, so you can get fooled easily.
In accordance with your training data set, you then must develop a test data set to check whether or not the system managed to learn the expected patterns. Note that the system won’t necessarily return a binary result (e.g., “Yes, there is a cat in the image!”). In enterprise applications, we usually don’t check for binary things such as cats, dogs, or any other animals in images. It’s more like, “Is the best ad shown to a specific customer based on the customer’s buying behavior?” But how do we verify this? Let’s have a look at an example below to bring some light into this darkness.
Train Service Example
Imagine you want to implement demand-based pricing for a train service. Your goal is to encourage riders to use the train during non-peak times. So, you want a system that dynamically adjusts the pricing in real time to make it financially attractive for the riders. As such, you want a system that convinces the riders to consider riding when the trains are less crowded.
Such a system usually has different pricing strategies and tries to optimize two things. It tries to balance the ridership throughout the day, and it tries to increase the total revenue from the ridership. Now, you might ask, “Why not just use traditional mathematical rule-based optimization?” It turns out that these rule-based systems aren’t practical anymore because of the sheer complexity of these scenarios. So, the goal of the learning system is to reach a state of (1) spread-out ridership and (2) revenue that at least covers the costs.
Now, imagine that the system has already been trained. How do you check that this system optimizes these two factors based on some given input? For example, imagine that you would provide a test data set where a bunch of people take a train from some location to some other location at the same price and at the peak time (e.g., 8am). But that’s not what the railway agency wants.
The system is now asked to suggest different travel times and different prices to the riders in such a way that the ridership spreads out and that the total revenue at least covers the costs for the railway agency. This implies that the system will return a ridership and price distribution. For example, imagine that the system (based on the provided input) suggests that 60% of the people should take the train at the peak time (e.g., 8am) for the highest possible price, 25% should take the train at a non-peak time (e.g., 7am) at a lower price (e.g., at a 10% discount), and 15% should take the train at a non-peak time (e.g., 9am) at an even lower price (e.g., at a 15% discount). These are fake numbers; don’t confuse them with reality. Nevertheless, the question is how to check whether that’s the optimal solution. Well, you can’t. You simply can’t check whether that’s the best solution that can be achieved. You can only check whether the system performed the desired optimization based on your expectations.
Based on the way you’ve trained the system, you derive expectations about how the system should react on new input data. For example, you could define a range of the average ridership spread and a range for the average revenue you expect for new input data. Then you could compare these expectations to the result of the system. You don’t check for binary results (e.g., cats in images) anymore. It’s more like measuring results in statistical terms. You measure the statistical significance of the results to determine that the system doesn’t randomly distribute the ridership and adjust the prices. The system must do this according to the learned patterns. Although that’s fuzzy, it’s the only statement you can make at the end of your testing day.
Never forget that your primary role as a tester is to provide information to others (e.g., developers, product owners) to enable them to make informed decisions. Writing a test report is like summarizing all the activities highlighted above. You should include what you know about the learning system and what you expect the system to do in certain situations to allow others to spot incorrect assumptions you’ve probably made.
You should provide detailed information about how you trained the system and what your expectations were to allow others to decide whether or not the training data is meaningful enough. You should include information about how you verified your expectations to allow others to interpret your (statistical) conclusions. All you have to do is communicate the level of confidence that the system works as intended.
The magic behind testing learning systems is not black magic. There is no magic at all behind the scenes. Let us conclude with a wise statement from Angie Jones: “I learned that no matter how smart machines become, having blind faith in them is not wise. While we should embrace new technology, it’s imperative that we (as testers) not allow others to convince us that any software has advanced beyond the point of needing testing. We must continue to question, interrogate, and advocate for our customers that we serve.”