Six Months of Agentic Testing: A Production Report
My team’s biggest bottleneck wasn’t shipping new features—it was the soul-crushing cycle of writing unit tests and chasing regression bugs. Six months ago, we stopped treating AI as a glorified autocomplete and started building a dynamic agentic workflow using LangGraph and Pytest. The results? Rock-solid. We’ve cut time spent on boilerplate testing by 40%, freeing us to focus on high-level architecture rather than assert statements.
Most devs treat AI like a static oracle. You feed it code, it spits out a test, and you manually fix the errors when it hallucinates. An agentic workflow closes that loop. It writes the test, runs it, analyzes the traceback, and iterates until the code is green. No manual intervention required.
The 5-Minute Setup: Building an AI Tester
You only need Python and a handful of libraries to get started. We use LangGraph to orchestrate the state and Pytest to handle the execution. For the brain, any high-reasoning model like GPT-4o or Claude 3.5 Sonnet works best.
pip install langgraph langchain-openai pytest pytest-json-report
This minimal script defines a state that tracks our source code, generated tests, and the failure logs:
from typing import TypedDict, List
from langgraph.graph import StateGraph, END
class AgentState(TypedDict):
code: str
tests: str
test_results: str
iterations: int
# The Writer: Generates the initial test suite
def coder_node(state: AgentState):
# Prompt the LLM to write tests based on the source code
return {"tests": "def test_logic(): ...", "iterations": state['iterations'] + 1}
# The Executor: Runs the tests and captures errors
def executor_node(state: AgentState):
# Trigger pytest.main() and capture the output
return {"test_results": "failed: AssertionError..."}
# Define the flow
workflow = StateGraph(AgentState)
workflow.add_node("writer", coder_node)
workflow.add_node("tester", executor_node)
workflow.set_entry_point("writer")
workflow.add_edge("writer", "tester")
workflow.add_edge("tester", END)
app = workflow.compile()
The Self-Correction Loop: How It Heals
The workflow truly shines when you introduce conditional routing. In production, writing a test isn’t enough; the agent must prove the test is valid. I’ve divided our system into three distinct roles:
- The Architect: Scans the source code to identify critical edge cases.
- The Developer: Generates the Pytest files and necessary mocks.
- The Debugger: Interprets stack traces to fix either the test logic or the source code itself.
Using add_conditional_edges, the graph inspects the Pytest exit code. If it sees a “FAIL”, it routes the state back to the Debugger. If it’s a “PASS”, the process finishes. This isn’t just a script; it’s a self-healing loop that saves roughly 15 hours of developer time every week.
Security Tip: Sandboxing the Agent
Early on, I realized that giving an AI the power to run pytest locally is a security nightmare. It could accidentally (or logically) delete files. To solve this, we moved the execution to a Dockerized environment. The LangGraph node ships the code to a temporary container, runs the tests, and returns a JSON report. It’s clean, isolated, and safe.
Beyond the Basics: Context and Mocking
Real-world code is messy. It relies on Postgres connections, Redis caches, and third-party APIs. To make the agent effective, we implemented “Context Injection.”
Before the agent starts typing, a preprocessing node extracts existing conftest.py fixtures and function signatures. We feed these into the system prompt so the AI knows exactly what mocks are available. For instance, if you have a db_session fixture, the agent should use it rather than trying to instantiate a new database driver.
# Feeding the agent context
system_msg = f"""
You are a Lead QA.
Available fixtures: {active_fixtures}
Constraints: Use 'unittest.mock' for all external calls.
"""
Adding this context reduced hallucinated imports by 80% and improved our first-pass success rate significantly.
Lessons from the Trenches
If you’re integrating this into your CI/CD pipeline, keep these four rules in mind:
- Cap the Iterations: Agents can get stuck in a “fix-break-fix” loop. We cap our
AgentStateat 5 iterations. If it’s still failing by then, it flags the PR for human intervention. - Structure Your Logs: Don’t make the LLM parse raw terminal text. Use the
pytest-json-reportplugin to provide structured data. It’s much cheaper and more accurate for the model to process. - Minimize the Payload: Sending a 2,000-line file for every small test fix is a waste of tokens. We use a “Snippet Extractor” to send only the relevant function and the specific lines highlighted in the traceback.
- Keep a Human in the Loop: We use LangGraph’s
interrupt_beforefeature. The AI proposes the fix, but a human must click “Approve” before it merges.
Agentic workflows aren’t here to replace us. They are here to kill the repetitive tasks that cause burnout. My current setup handles the grunt work while I sleep, turning unit testing from a chore into a background process.

