GDPR Test Data Management

Test data management

How GDPR Impacts Test Data Management

Test data challenges are nothing new to software testers. Manual testers spend 50 – 75% of their effort on finding and preparing appropriate test data. Moreover, when teams automate testing, full control of test data becomes mandatory.

Now, the enforcement of the General Data Protection Regulation (GDPR) is further complicating the test data challenge by placing constraints on using production data for testing purposes. Many vendors are suggesting that the GDPR can be addressed by simply deploying products that assist with data generation, data extraction and masking. But is it just a question of finding the right tool? What about the challenges of consistent data provisioning in multiple systems with data redundancy—in other words, test data management in enterprise system landscapes?

Watch as Tricentis Test Data Management Guru, Franca-Sofia Fehrenbach, talks about strategies for getting the exact test data you need, without putting your organization at risk of violating the GDPR.

Here’s the full transcript

Hello everyone. Welcome to my talk about general data protection regulation and test data management. My name is Franca Sofia Fehrenbach, and I would like to show you first my agenda. First I will point out some important flag facts about testing our management. Why are we facing so many Test Data Management issues these days, and how can we solve our problems? The second topic will be about the GDPR itself. I will point out the most important facts you need to know about general data protection regulation. I will also point out what consequences it will have on test data management. Last but not least, I will talk about the test data management challenges, and I will explain how to establish a proper TDM strategy.

Why are we facing so many Test Data Management issues these days?

We at Tricentis have been talking about it for years, but the acceptance of the industry was limited. One of the reasons is that for tests, the manual testers needed to deal with the test data management. They needed to prepare and create their test data that fits for their test.

This process of creating appropriate test data for your tests is really, really time consuming. It’s actually so time consuming that most testers in fact spend up to 75% of their manual test effort just for the test data preparation. They only use a quarter of their time for really executing their tests.

As long as people test manually, they won’t be aware of the test data issue and the effort they really invest in this topic.

Now, you heard Wolfgang’s keynote, right? Everyone goes for automation. As soon as you tackle test automation, you also need to prepare your test data properly. Because if you don’t prepare your test data for your test automation, you will run into the phenomenon, very famous already, it’s called false positives. Can anyone explain to me what false positives actually mean? Come on, what does it mean?

False positives occur when the test case produces fail results, but there’s actually nothing wrong with the app under tests.

Let’s have a look at this number, 75%. 75% false positive in an enterprise test portfolio, this is the average number. Meaning 75 out of 100 test cases go wrong without pointing to any error. There can be three reasons for that. The first reason is the maintenance. If you have a lack of maintenance, if the test cases are not properly maintained, you will have a lot of false positives. The second reason is insufficient test data. This is why we are here too. The second reason is the test environment is not sufficiently available. You will run into these three problems sequentially.

First you will have, of course, the lack of maintenance issue. If you are able to solve this problem, you will run into a test data management problem. If you are able to solve this problem as well, you still need to deal with the test environment.

The GDPR drives this TDM problem even further. General data protection regulation. I want to explain three main keywords which I will use. First data subject. A data subject is an identified, or identifiable person. A data controller is a person or organization who is in the possession of personal data, and data mines the purposes for which, and the manner in which any person or data are processed. Last but not least, the data processors, any person who processes the data on behalf of the data controller.

Nine things you really need to know about the GDPR. First, it is replacing the EU’s data protective directive, which has been enforced since 1995. This directive does not really account anymore for the way we are using technology and data today. The EU comes up with a new regulation. It’s a regulation and a directive, so what does that mean? A directive, directives are implemented on a national level, whereas regulations are passed in Europe and automatically becomes legally binding for each member state.

Yeah, there’s no way around it anymore. There are not a lot of differences in the interpretation anymore. It is a clear regulation, and it counts for every state. It captures a lot of personal data. What means a lot? There has been a long running debate regarding which data should be considered as personal. The GDPR states that personal data does not only include any information relating to a data subject, but also includes online identifiers, such as IP addresses and unique device identifiers.

It has extraterritoriality. What does that mean? Extraterritoriality means it does not only apply to organizations within the EU, but also to organizations outside the EU. It has a global effect. Of course, not all organizations outside the EU, only organizations that deliver services or goods to any EU data subject.

It applies to data processes.The current EU legislation only applies to data controllers, and does not affect data processors. The GDPR wants to make clear that data processors also play a critical role in protecting EU citizen’s data, and therefore it will introduce several rules to support this.

Data breaches must be notified.If you’re an organization and you notice that you have a data breach, you have the duty to report this data breach to a supervisory authority. This must happen within 72 hours.

Data experts aren’t getting any easier. You have to make sure that you protect the data. Business in general are prohibited to transfer data outside the EU if the data protection is not enough, of course, what is enough? There is a long running debate about it, but still, you need to take care about any data which you want to transfer outside the EU.

There are massive fines. It can be up to 4% of the global annual turnover. I have here an example. Using the 2016 Tesco Bank as an example. Had GDPR been in effect at that time of the breach, Tesco could have potentially been liable for fines up to 1.9 billion pounds. 4% of the global annual turnover can really hurt.

Last but not least, Individual’s rights are strengthened. This is a huge topic. There are a few changes in data subject rights. I want to take a closer look at it. We start first with the right to access. Under the current EU legislation, people or individuals have the right to access, block, delete, and change their own data. This will also be the same when the GDPR is in place, but the GDPR wants to expand these rights. What does that mean?

Data subjects, as a data subject you’re allowed to obtain the data controller’s confirmation if your data is processed or not, and even where it is processed and for what purpose.

the right to be forgotten, also known as the right to erasure. That means basically you can send a request to the organization, and tell them you want your data should be deleted, and the need to delete it. It doesn’t matter if there is any compelling reason for its deletion or not.

Data portability. You are allowed to receive the personal data concerning you, and you’re also allowed that your data is transferred from one IT environment to another IT environment, not only transferred, also copy or moved, and this of course must happen in a safe and secure way.

Breach notification, I already mentioned it. An organization has the duty to notify the data breach within 72 hours to a supervisory authority.

Privacy by design. This topic has a long history. The GDPR makes clear that privacy needs to be built in natively into product and services. It also will affect development and not only the testing world. Okay, when will this be enforced? On the 25th of May, next year.

I have a quick slide about the the GDPR in comparison to other countries. Definition of personal data. In the EU, personal data means any information related to a natural person. In the US it means anything that can be used to contact or distinguish a person. In Australia, any information about an identified individual. There’s still some differences.

Collection and processing. In the EU, it’s important what is stored, of course under the GDPR now, what is stored, where is it processed, when was it collected, and who uses it, and for what purpose? In the US, pre-collection notice and opt out for use. In Australia, why is it collected, who uses it, and what for? This is pretty similar to all of them.

Breach notification, as I already said twice, within 72 hours. In the US it’s just required, no more definition. Australia, not even a legal obligation.

Okay, so how will it affect a tester? Just assume you’re a tester, and you have a specific authority level. The question here is, is it enough?

Can this tester do his job with the authority level he currently has?

As you know, testers need a certain authority level to do his job, but a high authority level grant access to personal data as well. Access to personal data violates the GDPR restrictions.

You also need to consider that, the authority level of the testers. I will talk about that later, in the last topic in the last section, when I talk about the dynamic masking processes, where you need to consider the authority levels of the users.

A real world example. Back in 2015, Kiddicare, a company from the UK, wanted to create a new website. Of course they wanted to test their website and they had a testing environment to test their website. They had a breach, they had a security breach, and this breach exposed real customer data, and 800,000 customers were affected. Everything got exposed, names, addresses, contact details, anything. 800,000 people got exposed. When this happens under the GDPR, you can imagine how high the fines will be.

What consequences will it have on Test Data Management?

Of course, productive data is no longer applicable for testing if it contains any personal data. You need to try pseudo-anonymize your test data. What does that mean? How can we design a good test data management strategy and still consider the GDPR rules?

When we come to a customer and a customer has TDM issues, what is the first thing he’s going to do? He panics, and he’s looking for the right tool to solve his problems. This won’t help. The right tool won’t help. It is all about the concept. You need the right concept, and it’s not easy, we know that.

Having a good TDM concept is really difficult and you need to think about it, but it’s worth it, and you need to do it.

There are different scenarios with sensitive data, and these different scenarios also require different methods of data protection. The primary purpose of data is to work with it in production. The GDPR of course cannot prohibit that. You need your data in production, but there is something for that, we call it dynamic masking. I already mentioned it earlier.

There’s also the secondary use, secondary use means you use your data for testing purposes. There are two different approaches. Use the synthetic test data generation, or you use the persistent masking. I will now explain the two different masking techniques. First I will start with dynamic masking, and then the persistent masking.

Masking strategy, dynamic masking. When you use the dynamic masking procedure, let’s call it that way, then you change the appearance of the data but not the data itself. You leave the production database untouched. You anonymize the data depending on the authority level of the user.

As an example, we already used the example, I know, but to understand it. You have in a production database on the left, the real values, of course. There’s some users who need to work with the production data, because they have to do it, which is okay for the GDPR. Then you just give, for instance the analyst, you give him the right data. An IT administrator, for instance, just need the last four digits, as an example.

Through a dynamic masking process, you just give him the last four digits. An offshore support doesn’t need to know the real numbers at all, so you just mask everything, you completely change the values. Dynamic masking is nothing else, just to react on the level of the user and change the data for its purpose.

Persistent masking is completely different of course. You permanently change the data from the production database. You anonymize the data, and if a user then works with this anonymized database, there’s no risk left, because there’s no sensitive data anymore. If there is a data breach, there are not really production data anymore, so what shall happen? This is the persistent masking.

How can it look like? Well, it’s pretty easy, you just change the data. Of course, you need to stay considerate of stuff, but in theory you just change the data and you anonymize it. You have several techniques for that. You can shuffle the values after you’ve seen that photo ID. You can substitute values. You can use for every Smith a Willis, or for every Jones a Davis. You can use constant values for a city, or you can, for credit cards numbers there are of course special techniques you can use. Keep in mind, the data, it should be a one way process. It’s not allowed, of course, that it is possible to go from anonymized database back to the production database. It’s a one way process, of course.

How does this fit into GDPR?

The four key points. Risk reduction, you minimize the sensitive data footprint. Consent, use customer data for the purpose it was intended. Security, ensuring only authorized individuals have access to sensitive data. This was basically the dynamic masking. Erasure, removing an individual’s association to a data record, so one way, not both ways.

How can we properly implement TDM?

Now let’s talk about a TDM implementation. Basically we have two sides. We have the supply side, and we have the administer side. First of all focus on the supply side, later on, on the other one.

On the supply side you have two different approaches. You have the top down approach, which is basically the data extraction and masking process itself. You have the synthetic test data generation, the bottom up approach. What do you think is the best way to create your test data? Of course you try to create your test data synthetically as much as possible and just fill the gaps with the top down generated data. Why? Because you have several challenges, there are several challenges you have to face when you work with the top down approach.

Some of these challenges are listed here. You need extensive knowledge of all databases and systems. It is really complex, time consuming, and you have a high risk to fail. The test data might not be in the golden copy. If it is in the golden copy, you might not have enough for a daily test run. Last but not least, you need to extract the right data. I want to go exactly in this topic.

Because at an enterprise level you have a lot of systems, you have a variety of systems and they all interact with each other. These systems have a high level of redundancies. These redundancies, you also need of course to consider in the masking process. Normally, usually people have something like an enterprise integration layer in between, and it takes care of the orchestration data delivery, and it makes sure that your data is consistent.

Now when it comes to masking it can be really tricky, because you need to also consider these redundancies in the different systems. You have basically two options to deal with it. You can build the whole logic, the whole orchestration logic, you can rebuild it, but then you will have this maintenance nightmare, because whenever things change in the bus, you need to change your masking process as well, and you need to adapt the changes in the process.

Option two is, you let the enterprise integration layer play in your favor. Mean you use the orchestration logic that is already here, and check data through the APIs. That means that the enterprise integration layer that is already here takes care of the data consistency, and you don’t have to take care of it by yourself.

I’m running out of time, therefore I need to hurry a bit. The right side. You have to administer test data as well. Administer test data means basically you need to have a stateful test data engine. What is stateful test data? Complex test processes require test data in certain well defined states. When registering test data in TDM, a status need to be assigned to it. Test data of course is all through the test itself, which we also called consumption of test data. How can this look like?

On the bottom you have your stateful test data management. For your test, you read stateful test data that fits for your test. You use this test data to test the system in the test, and afterwards you register the state changes in your TDM.

Okay, a very important slide, because I go back a bit and just think about it. We had on the left side the bottom up approach and the top down approach. We had the synthetical data generation approach, and we had the masking and extraction of data as the top down approach.

Is it really possible to cover a lot of risk with just using the synthetical test data generation approach?

I extra prepared the slides for that, because yes, it is possible. This slide is about the risk coverage that can be provided by tests through synthetic test data. As you can see, the risk coverage is really high, it’s 97% for transportation, 98 for retail, 92 for insurances, 96 for telco. Only for banks it is a bit lower due to data history. Really synthetical test data generation is really a approach you should consider, and the GDPR of course goes hand in hand with it, because there is no risk because you don’t use real customer production data.

The core conclusion is, whenever you can go synthetic, do it. The only constraints will be to automate the provisioning of test data, and to historical data, especially when it comes to banks. TDM is more about the right concept than the right tools. Think about it.

As a summary, test data management is the next big thing to solve. As soon as automated test cases are here, if you start automating and you have not a properly prepared test data, you will have a high level of false positives. Productive data are no longer applicable for testing if they contain personal data.

Last but not least, TDM is more about direct concept than the right tools. Whenever you can go synthetic, do it.

Of course, ARM architecture, we are happy to help. We provide a stateful test data management engine, and we also of course can automate the synthetical approach. We can automate the provisioning of test data. We also are able to integrate other tools, if you want to use the top down approach. Thank you.

If you would like to learn more about GDPR, register for our webinar ‘How GDPR Impacts Test Data Management: From Masking to Synthetic Data Generation’.