How to Build AI Agents That Critique and Improve Work
Blog Post

How to Build AI Agents That Critique and Improve Work

Jake McCluskey
Back to blog

The Reflexion pattern is a three-component AI architecture where one model generates content, a second component scores that output against a defined rubric, and an orchestrator manages the loop until quality standards are met. Instead of accepting whatever your AI produces on the first try, you're building a system that automatically rewrites, re-evaluates, and refines until it hits your threshold. This guide shows you exactly how to implement this pattern, from defining scoring rubrics to managing iteration costs, with working examples in cold email generation and beyond.

What Is the Reflexion Pattern in AI Agents

The Reflexion pattern breaks AI content generation into three distinct roles. The generator creates initial output based on your prompt. The critic evaluates that output against specific criteria, typically scoring from 0 to 10. The orchestrator manages the loop, feeding critique back to the generator and deciding when to stop.

This isn't running the same prompt twice. You're creating a feedback mechanism where each iteration uses the previous attempt plus its scored weaknesses. In practice, this means your second draft addresses specific shortcomings the critic identified, not random variations.

Research teams implementing this pattern report quality improvements of roughly 35% to 60% compared to single-shot generation, measured by human preference scores. The key difference: structured critique creates directional improvement, not just more attempts.

Why Building AI Agents with Quality Control Feedback Matters

Single-shot AI outputs are inconsistent. You've probably noticed that GPT-4 or Claude can produce brilliant work one moment and generic nonsense the next. The Reflexion pattern solves this by making quality control automatic rather than manual.

Without this approach, you're stuck in a loop yourself: generate, read, adjust prompt, generate again, repeat until frustrated. That works for one-off tasks but breaks completely at scale. If you're generating 50 cold emails daily or producing code snippets for documentation, manual review becomes your bottleneck.

The economic case is straightforward. A typical implementation might cost 3x to 5x more in tokens than single-shot generation, but it eliminates the human review step entirely. For a business sending 200 personalized emails weekly, that's 4 to 6 hours of staff time saved, easily justifying the extra API costs of $15 to $30 monthly.

How to Make AI Rewrite Its Own Output Until It's Good

Implementation starts with defining your rubric. You need measurable criteria that the critic can score consistently. Vague goals like "make it better" produce vague results.

Define Your Scoring Rubric

For cold emails, a practical rubric includes five dimensions: personalization (does it reference specific details about the recipient?), clarity (is the value proposition obvious in 10 seconds?), brevity (under 150 words?), and tone (professional but not stiff?). Each dimension gets scored 0 to 10. Plus call-to-action strength (single clear next step?).

Your threshold determines when to stop iterating. Setting it at 8/10 average across all dimensions typically produces professional-grade output. Pushing to 9/10 often triggers diminishing returns where the AI starts over-optimizing and losing naturalness. And honestly, most teams skip testing this threshold properly.

Build the Three Components

Here's a minimal implementation in Python using OpenAI's API:


import openai
import json

def generate_email(context, previous_attempt=None, critique=None):
    prompt = f"Write a cold email for: {context}"
    if previous_attempt:
        prompt += f"\n\nPrevious attempt:\n{previous_attempt}\n\nCritique:\n{critique}\n\nRewrite to address these issues."
    
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )
    return response.choices[0].message.content

def critique_email(email, rubric):
    prompt = f"""Score this email on each criterion (0-10):
{rubric}

Email:
{email}

Return JSON with scores and specific feedback for each criterion."""
    
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )
    return json.loads(response.choices[0].message.content)

def reflexion_loop(context, rubric, threshold=8.0, max_iterations=5):
    attempt = None
    critique_result = None
    
    for i in range(max_iterations):
        attempt = generate_email(context, attempt, critique_result)
        critique_result = critique_email(attempt, rubric)
        
        avg_score = sum(critique_result['scores'].values()) / len(critique_result['scores'])
        
        if avg_score >= threshold:
            return attempt, i + 1, critique_result
    
    return attempt, max_iterations, critique_result

This basic structure handles 80% of use cases. The orchestrator runs up to five iterations, stopping early if the average score hits your threshold. You'll want to add error handling and logging for production use.

Choose Your Model Strategy

You have two options: use the same model for generation and critique, or use different models. Same-model setups (GPT-4 for both) are simpler and cheaper. Different-model setups (GPT-4 for generation, Claude Sonnet for critique) can provide more objective evaluation since the critic isn't reviewing its own work.

Testing shows that using Claude Sonnet 3.5 as the critic while GPT-4 generates produces roughly 15% more substantial revisions per iteration. The models have different biases, so the critic catches issues the generator might overlook. For more on choosing between Claude models, see Claude Sonnet 3.5 vs Opus 3 for Coding.

AI Agent Self-Critique Loop Tutorial: Cold Email Example

Let's walk through a complete example. You're a sales development rep targeting marketing directors at Series B startups. Your product is an analytics dashboard.

Your context input: "Recipient: Sarah Chen, Marketing Director at Flowstate (Series B, 80 employees, recently raised $15M). They use HubSpot but struggle with attribution. Our product: multi-touch attribution dashboard that integrates with their existing stack."

Your rubric:


Personalization (0-10): References specific company details, role, or pain points
Clarity (0-10): Value proposition understandable in first 20 words
Brevity (0-10): 100-150 words total (10 = perfect length, -2 points per 25 words over)
Tone (0-10): Professional but conversational, avoids jargon
CTA strength (0-10): Single specific action, low friction

First iteration produces a generic 200-word email scoring 5.2/10. The critic flags: too long, buries the value prop in paragraph two, weak personalization, vague CTA ("let's chat sometime").

Second iteration addresses these: 145 words, leads with attribution pain point, references their recent funding, specific CTA (15-minute call next Tuesday or Wednesday). Score: 7.8/10. Still flagged for slightly stiff tone.

Third iteration loosens the language, adds a brief relevant insight about Series B marketing challenges. Score: 8.4/10. Loop terminates, email ships.

This took three API calls for generation and three for critique. At current GPT-4 pricing, that's roughly $0.08 per email. Compared to 10 minutes of your time manually refining, the ROI is obvious.

Agentic AI Generator Critic Pattern Explained: Beyond Email

The same architecture works across content types. You just swap the rubric and adjust thresholds.

Code Generation with Quality Control

For code, your rubric might score: correctness (passes test cases), readability (clear variable names, comments where needed), efficiency (avoids obvious performance issues), and style consistency (matches project conventions). Set your threshold at 9/10 because code errors are expensive.

This pattern pairs well with self-verifying coding agent loops where the critic can actually run the code and check outputs, not just evaluate style.

Content Creation for Marketing

Blog introductions, product descriptions, and ad copy all benefit from iterative refinement. Your rubric here focuses on: hook strength (does it grab attention in 5 seconds?), keyword integration (natural inclusion without stuffing), readability (Flesch score above 60), and brand voice match (scored against example content).

Marketing teams report that Reflexion-generated content requires roughly 40% less editing time than single-shot AI outputs. The critic catches the obvious issues before human review.

Data Analysis Reports

So, when generating insights from datasets, the critic evaluates: statistical accuracy (claims supported by data), clarity of visualizations (appropriate chart types), actionability (specific recommendations), and executive summary quality (key points in 3 sentences). For more on generating accurate analytical reports, check out how to make Claude AI generate accurate reports.

Managing Iteration Costs and Diminishing Returns

Every loop costs tokens. You need to know when to stop before you're spending $2 in API calls to improve a $0.10 output.

Track your quality improvement per iteration. Most implementations see the biggest jump from iteration 1 to 2 (average improvement: 2.1 points). Iteration 2 to 3 typically adds 0.8 points. By iteration 4, you're often seeing 0.3 points or less. That's your signal to cap iterations at 3 for most use cases.

Set hard maximums. Even with a threshold of 8/10, include a max_iterations parameter (typically 5). Some prompts are genuinely difficult, and you don't want an infinite loop burning through your API budget. If you hit max iterations without reaching threshold, return the best attempt with a flag for human review.

Consider using cheaper models for early iterations. GPT-3.5-turbo for iterations 1 and 2, then GPT-4 for final polish. This cuts costs by roughly 60% while maintaining output quality within 5% of all-GPT-4 implementations.

Monitor your cost per successful output. For cold emails, you should be under $0.15 per final email. For code generation, under $0.50 per function. If you're exceeding these, your rubric might be too strict or your generator prompt needs improvement.

When to Use Reflexion vs. Other Patterns

The Reflexion pattern isn't always the right choice. It shines when output quality variation is high and quality requirements are specific. It's overkill when single-shot generation already works 90% of the time.

Use Reflexion for: client-facing content, code that ships to production, analysis informing major decisions, or any scenario where the cost of poor quality exceeds $5. Skip it for: internal drafts, brainstorming, exploratory analysis, or high-volume low-stakes tasks where speed matters more than perfection.

For complex tasks requiring multiple specialized agents, consider whether you need full multi-agent orchestration systems instead of just generator-critic loops.

The Reflexion pattern gives you a systematic way to raise AI output quality without adding human review time. By splitting generation from evaluation and running structured feedback loops, you're building AI systems that actually improve their work instead of just producing more of it. Start with a simple three-component architecture, define clear rubrics for your use case, and track your quality improvements per iteration to find the sweet spot between cost and output quality. The pattern works because it mirrors how humans actually write: draft, critique, revise, repeat until good enough.

Ready to stop reading and start shipping?

Get a free AI-powered SEO audit of your site

We'll crawl your site, benchmark your local pack, and hand you a prioritized fix list in minutes. No call required.

Run my free audit
WANT THE SHORTCUT

Need help applying this to your business?

The post above is the framework. Spend 30 minutes with me and we'll map it to your specific stack, budget, and timeline. No pitch, just a real scoping conversation.