How to Build a Self Debugging AI Coding Agent in Python
Blog Post

How to Build a Self Debugging AI Coding Agent in Python

Jake McCluskey
Back to blog

You can build an autonomous AI coding agent that writes, executes, debugs, and rewrites code by creating a loop that sends code to a language model, runs it in a sandboxed environment, captures errors, and feeds those errors back to the model with instructions to fix them. This process repeats until the code runs successfully or hits a retry limit. The core components you need are an LLM API (OpenAI, Anthropic, or a local model), a safe code execution environment (subprocess with timeouts or Docker), error parsing logic, and retry handling that validates success.

What Makes an AI Agent Self-Debugging Instead of Just a Code Generator?

A standard code generator produces code based on your prompt and stops there. You get the output, try to run it, hit an error, and manually go back to the LLM with context about what broke. A self-debugging agent automates this entire feedback cycle.

The difference is the closed loop. The agent generates code, executes it automatically, reads any error messages or stack traces, analyzes what went wrong, and rewrites the code without you lifting a finger. This continues until the code works or reaches a maximum retry limit, typically between two and six attempts based on token costs and diminishing returns.

This approach transforms LLMs from suggestion engines into autonomous problem solvers. Instead of getting 70% working code that you debug yourself, you get an agent that iterates toward 100% working code. The success rate for simple to moderate tasks typically reaches 85-90% within three retry cycles when you're using models like GPT-4 or Claude 3.5 Sonnet.

Why Autonomous Debugging Matters for Real Development Work

Developers spend roughly 35-50% of their coding time debugging rather than writing new features. When you use AI-generated code without a self-debugging loop, you're just shifting the problem: instead of writing bugs yourself, you're debugging someone else's bugs. The AI's bugs.

A self-debugging agent changes the economics. For tasks like data analysis scripts, API integrations, or automation workflows, the agent can go from prompt to working output in minutes instead of the back-and-forth that eats up hours. This matters most for repetitive tasks where you need working code fast but don't want to invest deep debugging time.

Small business owners and non-technical users benefit even more. You can describe what you need in plain language and get working automation without knowing how to read Python tracebacks. The agent handles the technical troubleshooting that would normally require a developer.

There's a practical limit, though. Complex applications with multiple dependencies, database connections, or intricate business logic still need human oversight. The agent works best for self-contained scripts under 200 lines where errors are clear and fixable through code changes alone.

How Does the Write-Execute-Debug-Rewrite Loop Actually Work?

The core pattern is simple: generate code, run it, check the result, and decide whether to retry with error context or return success. This is a variation of the ReAct pattern (Reasoning + Acting) adapted specifically for code execution.

Here's the step-by-step flow. First, you send your task description to the LLM with a system prompt that instructs it to write Python code. Second, you extract the code from the response, usually wrapped in markdown code fences. Third, you execute that code in a controlled environment and capture both stdout and stderr.

Fourth, you check if execution succeeded (exit code 0, no exceptions). If it worked, you're done. If it failed, you capture the full error message and traceback. Fifth, you send a follow-up prompt to the LLM that includes the original task, the code it wrote, the error message, and instructions to fix the problem. Sixth, repeat from step two until you get working code or hit your retry limit.

The magic happens in step five. You're giving the model concrete feedback about what broke, which is far more effective than vague instructions. Error messages are goldmines of debugging information, and modern LLMs are trained to interpret them.

Setting Up Your Python Environment and Dependencies

You need three categories of tools: an LLM API client, a code execution method, and error handling utilities. For the LLM, install either OpenAI's library (pip install openai) or Anthropic's (pip install anthropic). Both work well, though Claude tends to produce slightly more reliable code for data tasks in my experience.

For code execution, Python's built-in subprocess module works for basic sandboxing. You'll run generated code as a separate process with timeout limits. For stronger isolation, use Docker containers, but that adds complexity most projects don't need initially.

You'll also want re for parsing code blocks and potentially ast for validating Python syntax before execution. No exotic dependencies required. The entire agent can run with standard library tools plus your LLM client of choice.

Building the Code Execution Sandbox

Never use exec() or eval() directly on AI-generated code. Even with good prompting, the model might produce code that runs infinite loops, consumes excessive memory, or attempts file system operations you don't want.

Instead, write generated code to a temporary file and execute it as a subprocess with strict resource limits. Here's a safe execution function:


import subprocess
import tempfile
import os

def execute_code_safely(code, timeout=10):
    """Execute Python code in a subprocess with timeout and capture output."""
    with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
        f.write(code)
        temp_file = f.name
    
    try:
        result = subprocess.run(
            ['python', temp_file],
            capture_output=True,
            text=True,
            timeout=timeout
        )
        return {
            'success': result.returncode == 0,
            'stdout': result.stdout,
            'stderr': result.stderr,
            'exit_code': result.returncode
        }
    except subprocess.TimeoutExpired:
        return {
            'success': False,
            'stdout': '',
            'stderr': f'Execution timed out after {timeout} seconds',
            'exit_code': -1
        }
    finally:
        os.unlink(temp_file)

This function writes code to a temporary file, runs it with a 10-second timeout, captures all output, and cleans up afterward. The timeout prevents runaway processes. You can adjust it based on your expected task complexity.

Crafting Prompts That Produce Debuggable Code

Your initial system prompt shapes how the agent behaves across all retry attempts. You want to encourage the LLM to write clear, simple code that's easy to debug when it fails.

Here's an effective system prompt template:


SYSTEM_PROMPT = """You are a Python coding assistant that writes complete, executable code.

Rules:
- Write only valid Python 3 code
- Include all necessary imports
- Add brief comments explaining key steps
- Keep code simple and readable
- Do not use external files or network requests unless specifically requested
- Wrap your code in triple backticks with 'python' language tag

When you receive error messages, analyze them carefully and fix the specific issue.
Do not rewrite the entire approach unless the error indicates a fundamental problem."""

The last instruction is critical. Without it, models sometimes abandon working logic and try completely different approaches when they hit minor syntax errors. You want incremental fixes, not full rewrites.

Implementing the Retry Loop with Error Feedback

The retry loop is where everything comes together. You need to track attempt counts, build error context messages, and know when to give up. Here's a complete implementation:


import re
from openai import OpenAI

client = OpenAI(api_key='your-api-key')

def extract_code_from_response(response_text):
    """Extract Python code from markdown code blocks."""
    pattern = r'```python\n(.*?)```'
    matches = re.findall(pattern, response_text, re.DOTALL)
    return matches[0] if matches else response_text.strip()

def self_debugging_agent(task_description, max_attempts=5):
    """Run a self-debugging coding agent that iterates until success."""
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Task: {task_description}"}
    ]
    
    for attempt in range(1, max_attempts + 1):
        print(f"\nAttempt {attempt}/{max_attempts}")
        
        response = client.chat.completions.create(
            model="gpt-4",
            messages=messages,
            temperature=0.2
        )
        
        code = extract_code_from_response(response.choices[0].message.content)
        print(f"Generated code:\n{code}\n")
        
        execution_result = execute_code_safely(code)
        
        if execution_result['success']:
            print("Success! Code executed without errors.")
            print(f"Output:\n{execution_result['stdout']}")
            return {
                'success': True,
                'code': code,
                'output': execution_result['stdout'],
                'attempts': attempt
            }
        
        error_message = execution_result['stderr']
        print(f"Error:\n{error_message}\n")
        
        messages.append({"role": "assistant", "content": response.choices[0].message.content})
        messages.append({
            "role": "user",
            "content": f"The code produced this error:\n\n{error_message}\n\nPlease fix the error and provide corrected code."
        })
    
    return {
        'success': False,
        'error': 'Max attempts reached without successful execution',
        'attempts': max_attempts
    }

This function maintains conversation history so the model has full context across attempts. Temperature is set low (0.2) to get consistent, predictable fixes rather than creative experimentation. Each failed attempt adds both the assistant's code and your error feedback to the message history.

How Do You Validate Success Beyond Just Exit Codes?

A zero exit code means the code didn't crash, but that doesn't guarantee it did what you wanted. For a self-debugging agent to be truly useful, you need output validation.

The simplest approach is to check for expected output patterns. If you asked the agent to calculate something, verify the result appears in stdout. If you wanted it to create a file, check that the file exists and isn't empty. Add validation logic after the execution step but before declaring success.

Here's an enhanced validation example for a data analysis task:


def validate_output(task_description, execution_result):
    """Validate that output matches task requirements."""
    if not execution_result['success']:
        return False
    
    output = execution_result['stdout'].strip()
    
    if 'calculate' in task_description.lower() or 'compute' in task_description.lower():
        return bool(re.search(r'\d+', output))
    
    if 'create' in task_description.lower() and 'file' in task_description.lower():
        return len(output) > 0 or execution_result.get('file_created', False)
    
    return len(output) > 0

This validation catches cases where code runs but produces empty or nonsensical output. You can extend it with task-specific checks. For API integrations, verify the response structure. For data processing, check that output has the expected number of rows or columns.

Roughly 15-20% of "successful" executions fail validation on first pass, which is why this step matters. The agent can then retry with feedback like "Code ran but produced no output" rather than falsely reporting success.

What Are the Real-World Use Cases and Limitations?

Self-debugging agents excel at self-contained tasks with clear success criteria. Data analysis scripts are perfect: "Read this CSV, calculate monthly averages, and output a summary." The agent can write pandas code, hit an import error, add the import, hit a column name typo, fix it, and deliver working results.

API integrations work well too. "Call this weather API and format the response as JSON" gives the agent enough structure to debug authentication issues, parse errors, and formatting problems. You can chain this with AI agents that automate repetitive tasks to build more complex workflows.

File manipulation and automation tasks are also strong candidates. Converting formats, renaming batches of files, or scraping structured data from HTML all fit the pattern. The agent can see file not found errors, permission issues, or parsing failures and correct them.

The limitations show up with stateful applications, databases, and complex dependencies. If your code needs to connect to a production database, the agent can't safely retry destructive operations. If it requires API keys or credentials, you need secure injection methods. If it depends on system libraries or compiled extensions, installation errors are hard to fix programmatically.

You also hit walls with vague requirements. "Build a web scraper" is too open-ended. The agent will make assumptions, and when those assumptions are wrong, error messages won't provide enough guidance. Specific tasks like "Scrape product names and prices from this HTML structure" work far better.

For developers building agentic AI infrastructure for production use, you'll want additional safeguards: logging all attempts, setting cost limits (each retry consumes tokens), and implementing human-in-the-loop approval for certain operations. The agent should assist and accelerate, not run completely unsupervised in critical systems.

When Should You Choose This Over Standard AI Coding Assistants?

Standard AI coding assistants like GitHub Copilot and Cursor are better for interactive development where you're writing code yourself and want suggestions. They integrate into your editor, understand your codebase context, and help you write faster. But they don't execute code or handle errors autonomously.

A self-debugging agent makes sense when you need working output, not just code suggestions. If you're a business analyst who needs a one-off data transformation, you don't want to learn Python error messages. You want to describe the task and get a working script. Simple as that.

The agent approach also wins for batch operations. If you need to generate 50 different data processing scripts for different departments, the autonomous loop means you can queue tasks and walk away. Each one iterates to success without your involvement. This is particularly relevant for teams exploring how to implement AI tools in small business settings where technical resources are limited.

Cost is a consideration. A self-debugging agent uses more tokens per task because of retry attempts. A simple task might use 2,000 tokens with a standard assistant but 8,000 tokens across three retry cycles with an agent. For high-volume use, this adds up. For occasional automation needs, the time savings outweigh the token costs.

The choice depends on your role and workflow. Developers actively writing code benefit more from assistants. Non-technical users needing working automation benefit more from agents. If you're somewhere in between, you'll likely use both for different tasks.

Look, building a self-debugging AI coding agent isn't about replacing human developers or eliminating all manual work. It's about automating the tedious cycle of running code, reading errors, and making small fixes until something works. For the right tasks, this pattern turns hours of frustration into minutes of autonomous iteration. Start with simple, self-contained scripts to learn the pattern, then gradually expand to more complex use cases as you understand the limitations and strengths. The code examples above give you everything you need to build a working prototype today.

Ready to stop reading and start shipping?

Get a free AI-powered SEO audit of your site

We'll crawl your site, benchmark your local pack, and hand you a prioritized fix list in minutes. No call required.

Run my free audit
WANT THE SHORTCUT

Need help applying this to your business?

The post above is the framework. Spend 30 minutes with me and we'll map it to your specific stack, budget, and timeline. No pitch, just a real scoping conversation.