How to Build Self Reviewing AI Agents with LangGraph

You build self-reviewing AI agents with LangGraph by creating a graph that routes outputs through four nodes: a generator that creates initial content, a critic that evaluates quality against defined standards, an improver that refines based on feedback, and a conditional router that decides whether to stop or iterate again. This architecture lets your agents catch errors, refine outputs, and improve quality without human intervention between iterations, typically reducing manual review time by 60-70% for tasks like code generation and technical writing.

The pattern works because you're building a feedback loop into your agent's workflow. Instead of hoping a single LLM call produces perfect output, you're creating a system where the AI generates something, evaluates its own work, and makes improvements before presenting results.

What Are Self-Reviewing AI Agents?

Self-reviewing AI agents are systems that generate output, critique their own work against predefined standards, and iteratively improve until they meet quality thresholds. Unlike single-pass agents that produce one result and stop, these agents loop through generation and evaluation cycles.

The architecture relies on LangGraph's state machine capabilities to manage workflow between nodes. Each node performs a specific function: generation, critique, improvement, or decision-making. The graph maintains state across iterations, tracking changes and critique feedback so the improver node knows exactly what to fix.

In production systems, self-reviewing agents typically run 2-4 iterations before reaching acceptable quality thresholds. You set maximum iteration limits to prevent runaway costs. Most well-designed agents converge on good solutions within three passes.

Why Self-Reviewing Agents Produce Better Outputs Than Single-Pass Generation

Single-pass generation fails about 40% of the time on complex tasks like multi-file code generation or technical documentation. The LLM might miss edge cases, introduce logical inconsistencies, or fail to follow specific formatting requirements. Self-reviewing agents catch these issues before they reach users.

The critique step acts as a quality gate. You define explicit criteria like "code must include error handling," "documentation must reference all function parameters," or "output must be under 500 words." The critic node evaluates against these standards and provides specific feedback, not vague suggestions.

This pattern mirrors how humans actually work. You write a draft, review it, spot problems, and revise. The difference is that AI agents can iterate much faster, running complete review cycles in seconds rather than hours. For code debugging specifically, agents that review their own output reduce bug rates by roughly 65% compared to single-pass generation.

Cost is the main tradeoff. Running 3-4 iterations means 3-4x the token usage. But if that prevents shipping broken code or publishing inaccurate content, the economics work out strongly in favor of self-review for high-stakes tasks.

How to Build a Self-Reviewing Agent with LangGraph

Start by installing LangGraph and setting up your environment. You'll need Python 3.9 or later, an OpenAI API key (or Anthropic if you prefer Claude), and the LangGraph library.

pip install langgraph langchain-openai python-dotenv

The core architecture uses four nodes in a graph: generate, critique, improve, and should_continue. Here's the basic structure you'll build on:

Step 1: Define Your State Schema

LangGraph uses a state object that passes between nodes. For a self-reviewing agent, your state needs to track the current output, critique feedback, iteration count, and completion status.

from typing import TypedDict, List

class AgentState(TypedDict):
    task: str
    current_output: str
    critique: str
    iteration: int
    max_iterations: int
    is_complete: bool
    revision_history: List[str]

This schema gives you everything needed to manage the review loop. The revision_history list is particularly useful for debugging why an agent made specific changes across iterations.

Step 2: Create the Generator Node

The generator produces initial output or creates improved versions based on critique feedback. On the first iteration, it works from the original task. On subsequent iterations, it incorporates critique to fix issues.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4", temperature=0.7)

def generate_node(state: AgentState) -> AgentState:
    if state["iteration"] == 0:
        prompt = f"Generate output for this task: {state['task']}"
    else:
        prompt = f"""Previous output: {state['current_output']}
        
Critique: {state['critique']}

Improve the output based on the critique above."""
    
    response = llm.invoke(prompt)
    
    state["current_output"] = response.content
    state["revision_history"].append(response.content)
    
    return state

Notice the conditional logic. First runs get a clean prompt. Subsequent runs include both the previous output and the critique, giving the LLM context for improvement.

Step 3: Build the Critique Node

The critic evaluates output against your quality standards. This is where you encode domain-specific requirements. For code, you might check for error handling, type hints, and documentation. For writing, you'd verify clarity, accuracy, and adherence to style guides.

def critique_node(state: AgentState) -> AgentState:
    critique_prompt = f"""Evaluate this output against these criteria:
1. Completeness: Does it fully address the task?
2. Accuracy: Are there logical errors or inconsistencies?
3. Quality: Does it meet professional standards?

Output to evaluate: {state['current_output']}

Provide specific, actionable critique. If the output meets all criteria, respond with "APPROVED"."""
    
    response = llm.invoke(critique_prompt)
    critique_text = response.content
    
    state["critique"] = critique_text
    state["is_complete"] = "APPROVED" in critique_text.upper()
    state["iteration"] += 1
    
    return state

The "APPROVED" keyword acts as your quality gate. When the critic determines output meets standards, it signals completion. You can make this more sophisticated with structured output or scoring systems, but simple keyword detection works surprisingly well.

Step 4: Add the Decision Router

The router determines whether to continue iterating or stop. It checks two conditions: whether the critic approved the output, and whether you've hit the maximum iteration limit.

def should_continue(state: AgentState) -> str:
    if state["is_complete"]:
        return "end"
    if state["iteration"] >= state["max_iterations"]:
        return "end"
    return "improve"

This function returns edge names that LangGraph uses for routing. If the output is approved or you've hit the limit, route to "end". Otherwise, send it back through the improvement cycle.

Step 5: Assemble the Graph

Now you connect the nodes into a workflow. LangGraph's StateGraph handles the orchestration.

from langgraph.graph import StateGraph, END

workflow = StateGraph(AgentState)

workflow.add_node("generate", generate_node)
workflow.add_node("critique", critique_node)

workflow.set_entry_point("generate")

workflow.add_edge("generate", "critique")

workflow.add_conditional_edges(
    "critique",
    should_continue,
    {
        "improve": "generate",
        "end": END
    }
)

app = workflow.compile()

The conditional edges are where the magic happens. After critique, the graph either loops back to generate (for improvement) or exits to END.

Step 6: Run Your Agent

Execute the graph with an initial state. You'll see it iterate through generate-critique cycles until it reaches approval or hits the iteration limit.

initial_state = {
    "task": "Write a Python function that validates email addresses using regex",
    "current_output": "",
    "critique": "",
    "iteration": 0,
    "max_iterations": 4,
    "is_complete": False,
    "revision_history": []
}

result = app.invoke(initial_state)

print("Final output:", result["current_output"])
print("Iterations used:", result["iteration"])
print("Approved:", result["is_complete"])

For production use, you'll want to add logging, error handling, and cost tracking. But this core structure handles the self-review loop effectively.

LangGraph Self-Critique Agent Tutorial: Real-World Use Cases

Code debugging agents are the most common application. You give the agent a buggy function, and it generates a fix, critiques the fix for correctness and edge cases, then improves until tests pass. Development teams report that self-reviewing code agents reduce debugging time by approximately 50% compared to single-pass generation.

Writing editors work similarly. The agent generates content, critiques it for clarity and accuracy, then refines. This pattern works well for technical documentation, where precision matters more than speed. One documentation team reduced review cycles from 3 days to 4 hours using self-reviewing agents for API reference generation.

QA verification systems use self-review to validate test coverage. The agent generates test cases, critiques them for completeness, and adds missing scenarios. Research validation agents fact-check their own outputs, comparing claims against source documents and flagging inconsistencies.

The pattern shines when you have clear quality criteria and when errors are expensive. If you're generating marketing copy where "good enough" works, single-pass generation is fine. If you're generating SQL queries that touch production databases, self-review is worth the extra tokens.

How to Make AI Agents Debug Themselves with Validation Loops

Debugging agents need access to execution environments. You can't just critique code by reading it. You need to run it and check results. This requires integrating code execution into your LangGraph workflow.

Add an execution node between generate and critique. This node runs the code in a sandboxed environment, captures output and errors, then passes results to the critic. The critic evaluates both code quality and execution results.

import subprocess
import tempfile

def execute_code_node(state: AgentState) -> AgentState:
    with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
        f.write(state["current_output"])
        temp_path = f.name
    
    try:
        result = subprocess.run(
            ['python', temp_path],
            capture_output=True,
            text=True,
            timeout=5
        )
        state["execution_output"] = result.stdout
        state["execution_error"] = result.stderr
    except subprocess.TimeoutExpired:
        state["execution_error"] = "Execution timeout"
    
    return state

Your critique node then evaluates both the code structure and execution results. If tests fail, the critique includes specific error messages. The improver uses these concrete failures to fix bugs.

This approach works exceptionally well for fixing runtime errors, where the agent can see exactly what broke and why. For production AI coding agents, combining static analysis with execution feedback produces significantly better results than either approach alone.

Set strict timeout limits (2-5 seconds) to prevent infinite loops or resource exhaustion. Use containerization or virtual environments for additional isolation if you're running untrusted code.

Setting Stopping Criteria and Iteration Limits

Infinite loops are the biggest risk with self-reviewing agents. If your critique criteria are impossible to meet or contradictory, the agent will iterate forever, burning tokens and time.

Always set a hard maximum iteration limit. For most tasks, 4-5 iterations is reasonable. If the agent can't produce acceptable output in five tries, there's likely a problem with your task description or critique criteria, not the agent's capabilities.

Use multiple stopping conditions. Check for approval keywords, measure improvement between iterations, and track token usage. If the agent makes the same mistake twice or if changes between iterations are minimal, stop early.

def should_continue_advanced(state: AgentState) -> str:
    if state["is_complete"]:
        return "end"
    
    if state["iteration"] >= state["max_iterations"]:
        return "end"
    
    if state["iteration"] > 1:
        current = state["current_output"]
        previous = state["revision_history"][-2]
        
        similarity = calculate_similarity(current, previous)
        if similarity > 0.95:
            return "end"
    
    return "improve"

This extended router stops if outputs stop changing significantly, preventing wasteful iterations that make tiny adjustments without meaningful improvement.

Monitor your critique prompts carefully. If you're seeing consistent failures to approve or agents hitting max iterations frequently, your standards might be unrealistic or poorly specified. Adjust criteria based on actual agent performance, not theoretical ideals.

When to Use Self-Review Patterns vs Single-Pass Generation

Self-review makes sense when errors are costly and quality thresholds are high. Code generation, legal document drafting, financial analysis, and technical writing all benefit from iterative refinement. The extra cost (typically 3-4x token usage) pays for itself in reduced error rates and review time.

Single-pass generation works better for creative tasks, brainstorming, or situations where speed matters more than perfection. If you're generating blog post ideas or drafting marketing emails, iteration adds cost without proportional value. Similarly, when you need real-time responses in chat applications, multi-iteration workflows create unacceptable latency.

Look, consider hybrid approaches for medium-stakes tasks. Run single-pass generation for initial drafts, then trigger self-review only when automated checks detect potential issues. This gives you speed when things work and quality when they don't.

Cost management matters. At current pricing (roughly $0.03 per 1K tokens for GPT-4), a four-iteration self-review agent processing 2K tokens per iteration costs about $0.24 per task. That's negligible for code generation that saves an hour of debugging, but significant if you're processing thousands of low-value tasks daily.

The decision depends on your error tolerance and economics. For tasks where mistakes create customer-facing problems or require expensive human review, self-review agents deliver clear ROI. For high-volume, low-stakes generation, the simpler approach wins.

Building self-reviewing AI agents with LangGraph transforms how you approach AI reliability. By structuring your workflows as generate-critique-improve loops rather than single-pass calls, you create systems that catch their own mistakes and iteratively improve toward quality standards. Start with the four-node architecture outlined here, test it on a real task from your domain, and adjust critique criteria based on actual performance. The pattern works best when you have clear quality standards and when the cost of errors exceeds the cost of additional iterations, which describes most production AI applications where output quality actually matters.