

You ask an AI tool a complex question like, “Which test cases should I prioritize for this release?” and it returns an answer that sounds confident but with no logic that you can trace.
There’s no problem with how you chose to prompt the AI, but there can often be a problem with how the AI was asked to think. Chain-of-thought (CoT) prompting is a technique that fixes that.
This guide breaks down what CoT is, how it works, and where it shines, with real examples built for engineers and QA professionals who work with code and testing every day.
What is chain-of-thought prompting?
TL;DR: A technique where AI explains its reasoning step by step to improve clarity, transparency, and decision quality.
Chain-of-thought (CoT) prompting is a technique where you ask an AI to reason through a problem step by step before giving a final answer, rather than jumping straight to a conclusion.
Think of CoT like the difference between asking a student to show their work on a math test versus writing down the final answer. Or asking a junior engineer to “walk me through your thinking” before they submit a PR.
When the student or the junior engineer has to articulate each step, they catch their own mistakes. They slow down and arrive at better answers. The same thing happens with language models.
Here is what Jason Wei, a research scientist at Google, had to say about the chain-of-thought prompting technique: “The idea is that a model-generated chain of thought would mimic an intuitive thought process when working through a multi-step reasoning problem.”
How does chain-of-thought prompting actually work?
There are two main ways to trigger chain-of-thought reasoning: show the model examples of step-by-step reasoning, or simply tell it to reason step by step.
Different approaches to CoT prompting
TL;DR: CoT can be applied in multiple ways—zero-shot (simple instruction), few-shot (guided examples), self-consistency (multiple reasoning paths), and Auto-CoT (model-generated reasoning)—each suited for different complexity levels.
Zero-shot CoT
Zero-shot CoT is the simplest version of chain-of-thought prompting. No examples needed. Just add an instruction like, “Let’s think through this step by step.” One extra sentence can meaningfully change the quality of the output.
Example of prompting without zero-shot CoT
Prompt
We have 847 test cases. Our CI pipeline takes 4.2 hours to run them all. We need to cut that to under 90 minutes. How many tests should we run per build?
AI Response
You should run about 300 test cases per build.
That answer might be right, but there is no logic to check. You cannot tell if the AI reasoned it out or just guessed.
Example of prompting with zero-shot CoT
Prompt
We have 847 test cases. Our CI pipeline takes 4.2hours to run them all. We need to cut that to under 90 minutes. How many tests should we run per build? Let’s go through this step by step.
AI Response
- 4.2 hours = 252 minutes for 847 tests.
- Average time per test = 252 / 847 = 0.297 minutes.
- Target = 90 minutes. So 90 / 0.297 = 303 tests per build.
- That is 36% of the suite per build.
Final answer: Run approximately 300 tests, selecting the highest-priority 36% based on change impact, failure history, or risk.
Now you can audit the reasoning, spot any errors, and trust the answers.
The idea is that a model-generated chain of thought would mimic an intuitive thought process when working through a multi-step reasoning problem.
Few-shot CoT
This technique gives the model one or more examples of how to reason before asking your actual question. It works especially well in specialized domains like software testing, where the model benefits from seeing how your team thinks.
Example of prompting with few-shot CoT
Prompt
Here is how I want you to analyze test coverage gaps:
Example
Feature: User Login
Changed Files: auth_controller.rb, session_handler.rb
Existing Tests: test_login_success, test_login_failure
- Identify which functions in changed files are covered.
- Flag changed functions with no coverage.
- Recommend new tests.
Result: session_handler.rb has a new refresh_token() function not covered by any test.
Recommend: test_refresh_token_expiry, test_refresh_token_invalid
Now apply this same process to:
Feature: Password Reset
Changed Files: password_controller.rb, email_service.rb
Existing Tests: test_reset_email_sent, test_reset_link_expired
Now the model knows exactly the depth and format of analysis you want.
Self-consistency
With self-consistency, instead of accepting the model’s first response, you run the same prompt multiple times and take the most common final answer. Though it takes more time, it gives more reliable results when it really matters.
Example of prompting with self-consistency
You send the same prompt three times:
Prompt
A login test keeps failing after a recent deployment. What is the most likely cause? Think step by step.
Final AI Response 1st time:
Expired authentication token.
Final AI Response 2nd time:
Expired authentication token.
Final AI Response 3rd time:
Database connection timeout.
Here you see that two out of the three answers agree on an expired token. That is your answer. You let the majority reasoning win rather than trusting a single chain of thought.
The automatic CoT technique is when the model generates its own reasoning examples rather than you writing them manually.
Automatic CoT (Auto-CoT)
The automatic CoT technique is when the model generates its own reasoning examples rather than you writing them manually. It’s very useful when working at scale across many different test types.
Example of prompting with automatic CoT
Prompt
Show me one example of how you would decide if a bug is critical or minor. Then use that same reasoning to classify this bug: “The app crashes when a user tries to upload a profile picture.”
AI Response
If a bug blocks a core user action, it is critical. If it only affects a cosmetic detail, it is minor. Uploading a profile picture is a core user action, and the app crashes completely when attempting it.
Classification: Critical.
We can see that the model created its own reasoning rule, then applied it immediately. You did not write a single example yourself.
Those first couple of steps are exactly what Auto-CoT does: the model teaches itself how to think about the problem, then uses that thinking to answer it.
Chain-of-thought prompting: Practical examples
TL;DR: Real-world use cases like debugging flaky tests, prioritizing test cases, evaluating release risks, and generating edge cases show how CoT leads to more structured and actionable decisions.
1. Root cause analysis of a flaky test
Prompt
Our test for checkout fails 30% of the time with a StaleElementReferenceException on the payment step. The page uses React. Diagnose this step by step.
AI Response
- A StaleElementReferenceException means the test grabbed an element reference, but the DOM refreshed before it could use it.
- React re-renders the payment form after the cart API call completes, which invalidates any reference captured before that re-render.
- The 30% rate points to inconsistent API timing. Sometimes the re-render finishes before the test gets there, sometimes it does not.
- Use an explicit wait for the payment button to become enabled rather than a fixed sleep.
Recommendation
Instead of making the test wait a fixed number of seconds, tell it to wait until the payment button is actually ready to be clicked. Once it is active, you know React has finished re-rendering, and it is safe to proceed.
Here, we can agree that the test is not broken; it is just moving faster than the page sometimes. CoT traces the failure to one specific cause, which leads directly to the fix. Without CoT, the AI might have just said “add a wait” and left you guessing where and why.
Instead of making the test wait a fixed number of seconds, tell it to wait until the payment button is actually ready to be clicked.
2. Test case prioritization for a release
Prompt
We are releasing a new pricing module tomorrow. Changes were made to billing_service.py, pricing_calculator.py, and currency_converter.py. Our regression suite has 1,200 tests, and we have 3 hours. Think step by step about which test cases to prioritize.
AI Response
- Start with direct unit tests for the three changed files.
- Run integration tests covering billing and pricing flows.
- Add end-to-end checkout scenarios.
- Fill the remaining time with tests that have failed recently in adjacent modules.
Priority order
- Unit tests for changed files
- Integration tests for billing and pricing flows
- End-to-end checkout scenarios
- Recent historical failures in adjacent modules
- In a 3-hour window, focus on the top two tiers first.
Three hours may or may not be enough to run 1200 tests, depending on a lot of factors. But CoT prompting forces the AI to think like a senior engineer: start closest to the change, work outward by risk, then stop when the time runs out.
This way, the team gets a clear, defensible order to follow, not a random list.
3. Deciding whether a bug warrants blocking a release
Prompt
A bug was found in staging: the coupon discount does not apply correctly when a cart has more than 10 items and uses a percentage-based discount code.
Affected coupon codes represent 8% of all coupon usage. We are releasing in 6 hours. Think step by step whether to block the release.
AI Response
- 8% of coupon usage is affected.
- Customers are charged full price when they should not be. This is a revenue and trust issue.
- Carts with more than 10 items are a subset of all orders, limiting total exposure.
- Affected customers can be identified post-release and issued refunds.
- Assess the complexity of the fix. Is this a one-line fix, or does it require changes across billing logic?
Recommendation
If the fix is low-risk and deployable within 2 hours, block and fix. If complex, release with a mitigation: disable percentage-based coupons temporarily and alert customer support.
A yes or no question about blocking a release carries real business weight.
Instead of giving a gut-feeling answer, CoT walks the AI through every angle that matters: how many users are affected, what it actually costs them, whether it can be fixed after the fact, and how risky the fix itself is.
The recommendation at the end is not just an opinion. It is a conclusion backed by five clear checkpoints that any engineer or manager can read, challenge, or build on.
The recommendation at the end is not just an opinion. It is a conclusion backed by five clear checkpoints that any engineer or manager can read, challenge, or build on.
4. Writing edge-case test scenarios
Prompt
I am writing tests for a password reset feature. The flow is: user enters email, system sends reset link, user clicks link, user sets new password. Think step by step about edge cases I should test.
AI Response
- Email input: invalid formats, emails not in the system, SQL injection attempts.
- Reset link: expired links, links used more than once, tampered links.
- New password: too short, matches old password, special characters that break encoding.
- Concurrency: two reset requests for the same email in quick succession.
- Security: rate limiting on reset requests, account enumeration via error messages.
That is at least 15 distinct test scenarios from a single flow.
CoT forces the AI to think through the features category by category rather than stopping at the obvious cases. That is how one simple flow produces 15 test scenarios instead of three.
CoT prompting vs. few-shot prompting vs. prompt chaining
TL;DR: CoT focuses on reasoning depth, few-shot on guiding format through examples, and prompt chaining on breaking tasks into sequential steps—often combined for stronger workflows.
Chain-of-thought prompting
Chain-of-thought prompting makes the model think through a problem step by step. It can be combined with few-shot prompting (showing examples of good reasoning) or used on its own (zero-shot CoT).
Few-shot prompting
Few-shot prompting gives the model a few (as the name implies) examples of the output format you want. It does not require step-by-step reasoning; it just shows the model what a good answer looks like.
Prompt chaining
Prompt chaining breaks a complex task into multiple sequential prompts, where each output feeds the next. It’s a workflow design pattern, not a reasoning technique.
In practice, the most powerful workflows combine all three: step by step (CoT), structured examples (few-shot), and a pipeline of connected steps (chaining).
Limitations of chain-of-thought prompting
TL;DR: CoT does not guarantee correctness, can increase token usage and latency, may still produce hallucinations, and heavily depends on the quality and clarity of the prompt.
CoT prompting is powerful, but it has its own limits. There are some important considerations worth discussing.
1. Confident-sounding wrong reasoning
A model can reason step by step and still build an incorrect conclusion on top of a wrong assumption in an earlier step. Always review the reasoning, not just the answer.
2. More tokens and more cost
Step-by-step reasoning is longer than a direct answer. On high-volume automated workflows, that adds up to cost and latency. So, on simple tasks, this overhead is unnecessary. You don’t need a model to reason step by step to tell you what color a red apple is.
3. Does not fix hallucinations
If the model lacks accurate knowledge about what you asked, CoT will not supply it, it will just reason more elaborately toward a wrong conclusion.
4. Quality of the reasoning depends on the quality of the prompt
Vague prompts produce vague reasoning chains. Just telling the model “think step by step” is not enough if the question is missing the right context or details.
Best practices for CoT prompting
TL;DR: Use clear context, define how reasoning should happen, avoid unnecessary use for simple tasks, and leverage examples for consistency in repeated workflows.
There are some specific techniques that make CoT prompting much more effective in real-world use, and not just in theory:
1. Give the model enough context
Include error messages, file names, test results, and business rules, not just a summary. Do not paraphrase the problem; give the raw material.
2. Be explicit about what “step by step” means in your domain
For test prioritization, name the dimensions to reason across: risk, change impact, failure history, etc.
3. Do not use CoT for simple tasks
Adding reasoning steps where they are not needed wastes tokens and can introduce errors.
4. Use few-shot examples for repeatable tasks
If you run daily release readiness checks or sprint retrospectives with AI, show the model one or two examples of the reasoning format you want before each prompt.
In all agentic testing workflows, chain-of-thought reasoning is often what gives the agent its decision-making backbone.
How agentic AI applies to chain-of-thought prompting in testing
TL;DR: In agentic systems, CoT acts as the reasoning backbone, helping AI agents make step-by-step decisions in workflows like CI/CD, test execution, and failure handling.
If you are familiar with both agentic AI and chain-of-thought prompting, the connection between them is worth thinking through carefully, because this is where things get practical for engineering teams.
An agentic AI system is one that takes a sequence of actions on its own, using tools like APIs or other systems, to accomplish a goal over multiple steps.
CoT prompting and agentic AI are not the same thing, but they work well together. In all agentic testing workflows, chain-of-thought reasoning is often what gives the agent its decision-making backbone.
Consider an AI agent integrated into your CI/CD pipeline. When a build fails, it needs to decide what to do.
Without structured reasoning, it might just pick the first plausible action: restart the build, flag the failure, or create a ticket.
But with chain-of-thought reasoning embedded in its decision logic, it works through the failure systematically: What type of failure is this? Which tests failed? Are they in recently changed code? Have these tests failed before?
Each question is a step in a reasoning chain. The agent uses the answer to each one to inform the next action it takes.
This is how you get AI agents that make defensible decisions rather than just fast ones, and it is exactly the mechanism behind why the AI agents your team already relies on are able to reason through problems instead of just reacting to them.
How Tricentis supports AI-enabled quality practices
TL;DR: Tricentis integrates AI into testing workflows, enabling teams to generate, prioritize, and manage tests more intelligently while maintaining visibility and control.
Knowing how to prompt AI well is only half the job. The other half is having the right setup to use it consistently across your whole testing workflow, not just once in a while.
That is what Tricentis is built for. It gives teams a structured way to bring AI into their quality process, whether that is generating tests, picking the right tests to run before a release, or keeping things moving in complex delivery pipelines.
The goal is simple: AI that helps your team move faster and make smarter calls, while still being able to see exactly what is happening and why. You can explore how Tricentis helps teams do this with Tricentis AI-Powered Solutions.
Conclusion
TL;DR: Chain-of-thought prompting improves how AI approaches problems, leading to more reliable, explainable, and actionable outputs with minimal changes to prompts.
The way you ask an AI a question determines the quality of the answer you get back. Chain-of-thought prompting is the technique that closes that gap, shifting AI from guessing to reasoning—and that matters when wrong calls have real consequences.
The technique is not complicated. Zero-shot CoT costs you one sentence. Few-shot CoT costs you one example. The return is output you can actually audit, trust, and act on.
Tricentis gives teams the infrastructure to apply that reasoning consistently across every stage of the testing workflow, not just in one-off prompts. That is where real value compounds.
This post was written by Theophilus Onyejiaku. Theophilus has over five years of experience as a data scientist and a machine learning engineer. He has garnered expertise in the fields of data science, machine learning, computer vision, deep learning, object detection, and model development and deployment. He has written well over 660+ articles in the aforementioned fields, Python programming, data analytics, and so much more.