Testing generative AI systems and red teaming: An introductory guide


Simona Domazetoska

Product Marketing Manager

Date: Apr. 16, 2024

The topic of testing AI and ensuring its responsibility, safety, and security has never been more urgent. Controversy and incidents of AI misuse have increased 26-fold since 2021, highlighting growing concerns. As users quickly find out, AI tools are not infallible; they can make mistakes, display overconfidence, and lack critical questioning.

The reality of the market is that AI is prone to error. This is exactly why testing AI is crucial. But how do we test AI? How does testing traditional AI systems differ from testing generative AI? What is red teaming and what are some methods we can implement to test a non-deterministic system? Stick around — we’ll cover these topics and more!

For the sake of clarity, this blog will center its AI discussion on developing and testing applications within a DevOps framework — be it web, ERP, CRM, or mobile apps. We’re setting aside topics such as medical advancements, autonomous vehicles, and robotics, despite their importance. Our focus is primarily on the IT and enterprise application domain. So, let’s dive in.

At the heart of AI: Testing

At the heart of AI is the testing discipline. Consider that we test it for myriad reasons — to see if it works, to see if it can plan a vacation, to see if we can trick it into poor behavior. By and large, AI is all about testing. If anything, AI will elevate the testing discipline because testing is so core and fundamental to AI. Human testers are at the center of this shift; we need experienced human testers more than ever to steer that process. To learn more about skills you need to adopt to stay ahead of this curve, check out our article “Will AI take over software testing jobs?

What is generative AI-based testing?

Generative AI-based testing refers to the methodologies and practices used in evaluating the functionality, reliability, and safety of generative AI systems. These systems, which generate new content or data based on the inputs they receive, require rigorous testing to ensure they perform as intended and do not produce harmful or unintended outcomes.

Below, we’ll explore several market trends that are driving the adoption of generative AI-based testing.


The rising demand for generative AI-based testing is, in part, a response to legislative changes. For instance, the European Union’s AI Act mandates red teaming in certain scenarios to ensure AI systems are secure and reliable. Similarly, companies in the U.S. that interact with federal government agencies must comply with the Executive Order on AI, which includes stringent testing requirements. Moreover, companies like Microsoft are pushing for red teaming as part of their effort to mitigate copyright issues and ensure the responsible deployment of AI technologies.

Harm and Risk

The urgency of generative AI testing is underscored by incidents where AI systems have failed to eliminate prejudice, despite training. OpenAI’s chatbots, for example, have demonstrated the use of racist stereotypes after anti-racism training, according to a report by New Scientist. Microsoft encountered a significant setback in 2016 when its chatbot exhibited racist behavior, showcasing the dangers AI can pose in online interactions. Furthermore, Google faced criticism when its AI tool generated offensive historical images, highlighting the fallibility of AI systems in understanding complex social contexts.

Data privacy, security, and fallibility

The International Monetary Fund has identified six areas of generative AI risk, including embedded bias, hallucinations, use of synthetic data, explainability, data privacy, and cybersecurity. Since the systems are trained on a corpus of human-generated data without acknowledging their source, copyright concerns and privacy issues are becoming serious concerns and are at the forefront of this debate. Data poisoning attacks, where training data for AI models is intentionally tampered with, also pose a real threat to generative AI models like OpenAI, potentially leading to incorrect or harmful outputs. The exact number of data poisoning cases is not clearly known, but these attacks highlight the susceptibility of AI systems to such manipulation​​.

A McKinsey Global Survey on AI highlighted that while 40% of organizations plan to increase AI investments, 53% recognize cybersecurity as a related risk. Yet only a minority are taking steps to mitigate these risks. This indicates a significant gap in the readiness of organizations to deal with the cybersecurity implications of generative AI​.

What is the difference between traditional AI and generative AI?

Traditional AI, often called deterministic AI or narrow AI, is an AI system that performs specific tasks it was trained to do. Commonly, this category encompasses supervised machine learning systems or rule-based systems that rely on preprogrammed rules or predefined instructions to complete an action, thereby facilitating decision-making through logical reasoning. As result, traditional AI excels in solving well-defined tasks. For instance, an example of narrow AI would be a visual-based AI system trained to identify, interact with, and automate elements in applications. These systems are great at performing tasks that are predictable or repetitive and for which they have been tailored, yet they lack the capability to innovate or adjust to unfamiliar scenarios and systems beyond their initial programming.

In contrast, generative AI has ushered in a new era, shifting towards more broad, adaptable, and dynamic systems. Characterized by their non-deterministic nature, these systems can produce varying outcomes from identical inputs, attributed to their inherent randomness or variability in processing. Generative AI, as its name suggests, can generate novel content. These AI systems are commonly trained on large data sets and rely on unsupervised machine learning methods to predict patterns and analyze data so they can generate unique outputs that mimic that data or human creativity.

How does testing a traditional AI system differ from testing a generative AI system?

Simply put, testing a traditional or narrow AI system tends to be more straightforward because their outputs can be predetermined and validated. For instance, you can punish and reward a traditional AI system if it recognizes an object correctly or incorrectly. Consequently, testing unfolds in a much more controlled manner. However, this doesn’t imply that testing these systems is simple; it’s crucial to assess potential biases within the data such as representation, recency, or selection bias. You can also apply exploratory testing methods to uncover edge cases or try to simulate real world usage to help identify why the system might not perform as expected.

Testing generative AI systems presents a greater challenge due to their inherently non-deterministic and variable outputs. Another major difference is that generative AI systems require red teaming, and testing traditional AI doesn’t.

Traditional AI testing does not require red teaming

Can a traditional AI system leak sensitive data and information? No. It is structurally incapable of doing so. Traditional AI systems generally do not require red teaming, where teams interact with the generative AI system using malicious prompts or identify vulnerabilities or harmful, inappropriate behavior. This is because traditional AI systems are designed to handle data in a deterministic manner, strictly following predefined rules and operations. These systems cannot autonomously access or expose sensitive information like social security numbers because they do not have the capability to retain or recreate data not explicitly programmed into their output. Unlike generative AI, which can synthesize new content that might inadvertently reveal confidential data, traditional AI is developed with a narrow scope of functionality, limiting its output to specific tasks without the ability to go beyond its explicit directives.

Key differences


  • Traditional: Deterministic system: for a given input, output is the same and is known
  • Generative: Non-deterministic / probabilistic system: for a given input, output almost always varies

Learning mechanism

  • Traditional: Trained frequently on labeled data. Relies on explicit rules defined by human programmers, limiting its ability to learn from new data without manual intervention
  • Generative: Can learn and improve over time through deep learning, analyzing large amounts of data, identifying patterns, and making predictions

Data and tasks

  • Traditional: Handles structured data and tasks that require precise and deterministic decision-making
  • Generative: Excels at processing and understanding large amounts of unstructured data, such as images, videos, and text


  • Traditional: Rigid and struggles to adapt to new, unforeseen situations without manual changes to its programming
  • Generative: More flexible and capable of adapting to novel scenarios by learning from large and diverse datasets

Creativity and autonomy

  • Traditional: Lacks creative capacity and autonomy, focusing on logical reasoning and specific problem-solving within well-defined parameters
  • Generative: Capable of generating content autonomously, exhibiting a level of creativity and outputs that mimic human creativity in various fields

Ethical and security concerns

  • Traditional: Concerns are related to privacy and data security in automated systems, but less so with generating deceptive content
  • Generative: It can generate realistic and potentially misleading content, posing risks to privacy, security, and the spread of misinformation

Testing method

  • Traditional: Validating general quality of the AI system to ensure it behaves as intended (benchmarking). Checking for biases in the data that can influence the quality
  • Generative: Applying red teaming, checking for potential harms or the capability to extract sensitive data (i.e. PII) data. (See list of techniques below.)

Why is it challenging to test generative AI systems?

There are several traits common to generative AI systems that complicate testing:

  • Non-deterministic: A single input can yield a variety of outputs. Think about your human brain. If you ask your human brain a set of questions, will it always produce the same answer? Generative AI systems are similar. They are more unpredictable, rendering traditional testing methods for generative AI inadequate.
  • Lack of transparency on how AI models learn: Unlike traditional or rule-based systems, where the logic and rules are explicitly defined, generative AI systems ‘learn’ from vast amounts of data and are designed to improve over time. However, research has shown that they don’t consistently improve in all aspects in the same way. How they derive their learning and apply it to new data can be less transparent, making it hard to explain their reasoning.
  • Resource intensive: Another challenge is resource consumption. Generative AI quickly makes billions of calculations every time it processes an input and returns a result. If you are going to automate testing, you need to limit cost as much as possible without compromising quality. Only use the live model if your test goal cannot be accomplished without it.
  • Lack of ability to automate testing: Because of the inherent unpredictability of the outputs of generative AI systems, evaluating the nuances of their generated content often requires subjective human judgment, which cannot be easily replicated with automated tests.
  • Evolving domain: The continuous evolvement and change of the field of generative AI means that testing techniques and protocols have to be regularly updated, and testers must stay ahead of the latest developments to ensure that evaluations remain relevant and effective.
  • Ethical considerations: Evaluating and mitigating ethical issues without a clear ethical framework or standardized benchmarks is a daunting task, as it involves not only technical proficiency but also deep understanding of ethical considerations and societal norms that are subject to debate and change.

How do you test a generative AI system? Two important starting points: benchmarking and red teaming

Testing a generative AI system through benchmarking

How do you ensure the overall quality of the AI system? One of the most important questions to ask when you’re releasing any AI-infused application to the public is whether it does the job it was intended for. For instance, if a bot cannot answer basic questions, it cannot be deployed because it’s not doing its job and it can therefore impact your business bottom line.

An essential step in this process involves establishing benchmarks for the AI system’s intended capabilities. This involves a set of questions or tasks for your AI systems. Remember, you don’t start with building a bot and then verifying it against a benchmark. Benchmarking comes first, and building a bot comes second. The process requires close collaboration between testers, product managers, and AI engineers to establish benchmarks that are relevant and challenging enough to ensure the system can handle real-world applications.

Considerations for effective benchmarking include:

  • Defining benchmarks: Start by defining a comprehensive set of benchmarks tailored to the system’s intended capabilities. This should be done before the system is developed to guide the design process.
  • Establishing metrics: How good is the AI at performing certain tasks? You might answer, “pretty good.” But how do we quantify “good?“ An essential part of this process involves identifying a set of quality metrics that can be measured in terms of percentages, not simply passes and fails. For example, let’s say you have an AI-powered testing tool that suggests tests based on your requirements. How good is the AI at generating tests? Quantifying this is paramount. Google published an effective guide on metrics that relate to how good their PaLM 2 Transformer-based model was at reasoning, code generation, memorization, translation, language proficiency, and question answering abilities. (See evaluation section in their article)
  • Diversity of data: Ensure that the data used for benchmarking reflects a wide diversity, including data from various regions, demographic groups, or various products if necessary. This diversity is crucial for assessing the system’s ability to perform well across different contexts.
  • Primary role of tester: It is the job of the tester to determine what the benchmark should be, and that quality metrics are in place. This involves ensuring alignment between product management and AI engineers regarding these quality standards.
  • Exploratory testing: Besides automated benchmarking tests, incorporate exploratory testing to uncover unexpected behaviors or weaknesses in the AI system.
  • Regular monitoring: Continuously monitor the system for errors, biases, or hallucinations in the outputs. This helps in maintaining the quality and reliability of the AI system over time.

Applying red teaming for testing generative AI systems

Red teaming is a critical security and testing practice for generative AI systems, where the stakes are higher. In instances where sensitive data becomes exposed or harm is inflicted, the repercussions for a business can be severe, leading to financial losses, reputational damage, and more. Unlike traditional AI, where quality validation or benchmarking is common, red teaming specifically focuses on generative AI systems to ensure they are robust against attacks and data leaks, thus safeguarding against reputational damage.

Let’s take a closer look at red teaming: what it is, who is part of the team, how to plan for it, and what techniques can be used. As widely discussed topic, red teaming is reshaping the software testing and development industry, highlighting its importance in advancing standards.

What is red teaming? Key considerations

“Red teaming” is a term that dates to the Cold War era, where a team of trained military personnel (red team) attempted to find ways to beat the home team (blue team). The idea was to allow a motivated team of experts to try to find holes in the strategy, with full knowledge of the training, tactics, and expertise of the home team, to increase the effectiveness of the defense.

The AI red team is responsible for identifying specific risks that an implementation of generative AI may cause and exploring the fallout. These risks, known as harms, are vectors by which the AI can act in a way that may cause harm to the organization or its customers, or act outside of the programs’ intent.

What is not part of red teaming?

Validating the quality of the responses is not part of red teaming. Like security and penetration testing, the content and nature of the response is unimportant, while the exposure to risk factors is. For example, an AI tasked with generating stories for children that fails to appropriately rhyme its verse may be a poor implementation, but it is not a harm. An AI that gives children stories that cause them nightmares is a harm.

Personas involved in red teaming

Two personas come into play in red teaming, distinguished by their approach to discovering harms:

  • Benign: The benign persona is not an ally of the blue team, but a persona who is exploring potential avenues of harm but not through adversarial means. This means they are using the software as the designer intended, or within the realm of normal user behavior, and are testing whether the software under normal operation will exercise the harms outlined in the charter.

It’s important to note that harms that can be exercised in a benign way are often more damaging than ones found in an adversarial way. For example, an adversarial red teamer can make your LLM say that it hates humans but can only trick it into making that statement by getting it to print a python script that says it. This is far less damaging than the LLM simply volunteering that information to a normal user.

  • Adversarial: The adversarial persona uses any means necessary to force the AI to misbehave. This may include prompt hacking, token overflowing, context forcing and distraction, or even more security-focused approaches like prompt injection or modifying the accessible code.

Adversarial personas are much more likely to uncover exercised harms, but the harms identified will often be of lower impact due to the lengths that users had to go to retrieve them. An exception to this case is global harms, such as data leakage, where the ‘user’ is expected to be adversarial — in this case a hacker who is attempting to extract information.

It is important to have outside influence in red teaming, since it guards against the confirmation bias that naturally occurs when you are invested in a products success. The red team can have and, in some cases, should have members of the product team on it, but must also comprise members of other teams.

Setting up and planning for red teaming

Establishing a comprehensive red teaming framework is crucial for a thorough evaluation of system vulnerabilities. At Tricentis, we’ve developed a structured approach to red teaming that involves several key steps:

  1. Design a charter: Hold a session to create a red team working charter, outlining some of the harms that you expect to find. Leave at least 30% of your timespan free to explore more harms as the need arises.
  2. Assign roles: Clearly define roles within the team—benign, adversarial, and advisory—and consider rotating these roles periodically to leverage diverse perspectives. It is vital for each team member to receive an open book detailing the prompt, techniques, and various strategies to attack the system effectively.
  3. Document everything: Evidence of each harm must be documented, including the whole sequence. Often, understanding the context of exercising a harm is just as crucial as the final prompt that was used.
  4. Outline a scope: It is beneficial to have focused sessions to ensure that the harms outlined in the session are effectively exercised and time is not re-focused on more promising avenues. Do not be too open-ended with the harms addressed in each round of red teaming.
  5. Timeboxing: While scope is flexible, time is not. Specify dedicated time per session and number of sessions before the red team exercise to ensure focus is not lost and that the exercise has a conclusion.
  6. Test in rounds: As the exercise progresses, you will learn. Regroup after each round and share your information, using this to inform your next round of testing.


Techniques for testing generative AI systems and identifying harms

Below are examples of different techniques you can employ in your red teaming efforts for testing generative AI systems. While not an exhaustive list, these methods have proven invaluable at Tricentis for assessing the resilience of our own generative AI-driven products and capabilities.

Data extraction

It’s important to remember that generative AI systems generate a lot of data and sometimes it’s easy to get overwhelmed. When you’re dealing with data, you need to identify vulnerabilities. Your job as a tester is to find ways to extract sensitive data, such as internal data, or PII data that may have inadvertently been incorporated into the LLM. A core component is to understand whether the AI trained on any internal or customer sensitive data. If so, testers must be adept at extracting such data from generative AI systems. This could involve getting information that should not be available to users or the public. For instance:

  • Extracting meta-prompts and training data: Imagine a situation where someone figures out how to make the AI reveal the instructions or examples it was given during its training, which are not meant to be public. This could reveal sensitive insights into how the AI operates or was programmed.
  • Accessing information from unintended sources: This could happen if the AI, through some manipulation, starts to pull information from databases or sources it was not intended to access. For example, this occurs if an AI designed to provide customer support starts accessing and revealing financial records.
  • Extracting PII: This is a scenario where the AI is manipulated into divulging personal user information, like addresses or phone numbers, or even more sensitive data like passwords or security keys.

Prompt overflowing

“Prompt overflow,” a technique where a red team intentionally overloads the system with a large input to disrupt its primary function, is a common tactic. Such actions can potentially repurpose the AI for unintended outcomes, from revealing sensitive data to getting the system to produce irrelevant or even harmful content. For instance, in one of our tests at Tricentis we presented a Copilot system with the entire text of “War and Peace,” leading to a system crash. This also enabled us to explore the boundaries of the AI’s prompt handling capabilities, including attempts to sidetrack the AI into performing completely unrelated tasks, such as singing Taylor Swift lyrics. Familiarity with industry-standard techniques for compromising AI capabilities is also necessary. Testing could involve assessing whether a bot designed for product recommendations could be manipulated to execute unauthorized actions, highlighting the importance of understanding these attack vectors.


Hijacking involves taking control of an AI system or its components to use them for purposes unintended by the creators or owners. This could include:

  • Misusing tools for unintended means: If an AI has access to the internet or other external tools, hijacking might involve tricking it into downloading large files, accessing prohibited sites, or even running harmful scripts. An example would be convincing the AI to use its web browsing capability to download illegal content.
  • Repurposing the AI with new instructions: This form of hijacking changes what the AI is doing or how it’s supposed to function; for instance, when an AI designed for educational purposes is manipulated into promoting a specific product or ideology.
  • Using AI to execute remote code: This could be a situation where the AI is used as a vector to execute malicious code on another system; for example, sending commands that cause the AI to interact with another system in a way that initiates unauthorized actions or downloads.

Making legal commitments

This technique assesses the AI’s propensity to make unauthorized commitments or convey false information about company policies, discounts, or services. It involves evaluating the AI’s responses for potential legal implications or misrepresentations that could bind the company or damage its reputation.

Societal harms

Here you can use the AI to break the law, such as by generating copywritten work, engaging in hate speech, or otherwise breaking local regulations. This could involve regionally protected speech; for example, some jurisdictions make saying certain words an offense, or commenting on certain topics. Another example is engaging in behavior that is not appropriate for a representative of the company, like insulting or offending customers, or making prejudicial comments.


This type of technique examines how the AI’s tone and style of interaction might affect user experience. For instance, you can test for aggressive disagreement with users, disrespectful tone, or the implication that the user is stupid.

Malware resistance

Assessing the generative AI system’s potential to inadvertently execute malicious actions, such as downloading a virus, examines whether the LLM can interact with external systems and access other systems or sensitive information, aiming to uncover vulnerabilities that could be exploited to compromise security. When the LLM uses other tools, always question what tools it has access to and what it could potentially do. By simulating attacks, testers can evaluate the AI’s safeguards against executing harmful code and ensure it does not become a vector for cyber threats.

API and system access

This evaluates the generative AI system’s access to external tools and APIs to identify risks associated with unauthorized data manipulation or deletion. It involves simulating scenarios where the AI might misuse its permissions, assessing both the feasibility of these actions and their potential impact. By simulating scenarios where the AI is prompted to interact with these external systems in unintended ways, testers can evaluate the risk level and the ease with which the AI could generate or facilitate unauthorized access or data manipulation.

What about other forms of testing for generative AI?

In addition to the specific harm-identification techniques previously mentioned, there are broader categories of testing that play a critical role in evaluating generative AI systems for performance, security, and reliability:

  • Load/performance testing: This form of testing evaluates how well a generative AI system performs under varying degrees of demand. It measures the system’s responsiveness, stability, and scalability when handling a large number of requests simultaneously. Performance testing is essential for ensuring that the AI system can maintain its efficiency and accuracy under real-world conditions, where user demands can be unpredictable and fluctuate significantly.
  • Adversarial testing: Adversarial testing involves crafting inputs that are designed to trick or confuse the AI into making errors. This differs from red teaming, which is a broader security assessment method involving simulated attacks to identify vulnerabilities. Adversarial testing specifically targets the model’s algorithms to uncover weaknesses in its learning and decision-making processes, thereby improving its resilience against malicious inputs designed to cause the model to fail.

What part of generative AI testing is “automated?”

Fully automating the testing of generative AI systems is an evolving area. The complexity, unpredictability, and highly nuanced output of generative AI systems make automation difficult but not entirely out of reach. Microsoft recently released its Python Risk Identification tool for generative AI (PyRIT), an open access automation framework to empower professionals to build red team foundations for their applications. But what exactly is “automated” in this process? Despite the challenges, researchers are actively working on methods to automate aspects of generative AI testing, like generating prompts, automating evaluation metrics, or anomaly detection. In the Microsoft PyRIT model, scoring mechanisms are built for each response to a given prompt, helping to quickly scan for outputs that deviate from established patterns.

What about using AI to test AI?

You could theoretically use AI to test another AI. But be warned: there be dragons. Using LLMs to evaluate AI-generated content can introduce challenges like inaccurate test results or false positives. For example, when testing a chatbot for copyright issues, simply asking it if it handles copyrighted data and accepting its “no” as a complete check can be misleading and ineffective. Another example is using AI to distract another chatbot by generating various prompts. Based on many experiments we’ve conducted at Tricentis, many of these prompts end up being nonsensical, which doesn’t effectively test the chatbot’s responses. Even when the AI thinks it has succeeded in distracting the chatbot, human review often reveals it hasn’t.

A critical question therefore arises: “Who tests the AI that is testing the AI?” Very likely, it will be a human. As you can see, this ends up forming a never-ending circle. This process is not as autonomous as it sounds. Human judgment, review, and validation will always be required.

When you rely on AI to test AI, you also have to be very wary of false positives. This is where the AI incorrectly assesses that there’s no issue when in fact there actually is one. For instance, an AI might dismiss a 25% discount offered to a customer as not being a legal commitment due to the absence of specific legal jargon, even though it is legitimate.

Our recommendation is to simplify the prompt given to the AI, which is very important. A common mistake is giving the testing AI too much context, which doesn’t improve its accuracy but instead sets two AIs against each other with a 50% chance of getting it right. Instead, removing unnecessary context and focusing on strict comparisons can help in validating whether an answer captures all essential information. The best use of AI in testing is generating varied ways to pose questions and validating the accuracy of answers in imprecise situations.

Final thoughts

To ensure AI systems are safe, responsible, and secure, it’s crucial to embrace thorough testing practices, especially with the unpredictable nature of generative AI. Here are some practical tips:

  1. Red teaming: Utilize red teaming to rigorously challenge AI by simulating real-world attacks and unexpected scenarios. This helps uncover vulnerabilities that routine tests might miss.
  2. Benchmarking: Establish clear benchmarks early on in your AI development process. These benchmarks act as a checklist that ensures your AI meets necessary standards before going live.
  3. Human review: Don’t forget the importance of human oversight. Even the most advanced AI cannot replace the nuanced judgment of experienced testers who can interpret and respond to the results in ways AI might not fully grasp.
  4. Make it fun: Keep testing fun and dynamic by regularly updating your techniques to keep pace with AI advancements, and always be prepared to adapt your strategies as new challenges emerge.


These steps will not only enhance the reliability of your AI systems, but also safeguard them against potential misuses in the ever-evolving landscape of technology.

Want to learn more about testing generative AI? Sign up to our ShiftSync community, or check out our webinar  Introducing Tricentis Copilot: AI assistants for test automation at scale.


Simona Domazetoska

Product Marketing Manager

Date: Apr. 16, 2024

Related resources

You may also be interested in...