AutoResearch: Autonomous Paper-Writing Agent with Claude
White Paper

AutoResearch: Autonomous Paper-Writing Agent with Claude

Jake McCluskeyUpdated
Back to white papers

Source post: @datasciencebrain Telegram (crosspost from Hasan Toor on X)
Scraped claim: "Takes a research idea and outputs a full academic paper" with genuine citations, experiments, and conference-ready LaTeX, no human intervention.
Stack built here: Claude + LangGraph + arXiv API + Tavily search + LaTeX

The core idea

A single-prompt LLM call can give you a blog post. A research paper needs:

  1. Literature review grounded in real citations
  2. A genuine research question / hypothesis
  3. Experiments or analysis, actual runs, not made-up numbers
  4. Structured LaTeX output, abstract, intro, methods, results, conclusion
  5. Iterative refinement, the first draft is always rough

This is an agent orchestration problem, not a prompting problem. LangGraph is the right tool. Claude is the right brain (strong at structured writing, code, and reasoning over long contexts).

Architecture

   topic
     │
     ▼
 [planner]────────▶ research_questions, sections_plan
     │
     ▼
 [literature_searcher]──▶ arXiv + web → papers + bibtex
     │
     ▼
 [experiment_designer]──▶ executable Python
     │
     ▼
 [experiment_runner] ───▶ results.json + plots
     │
     ▼
 [writer]──────────▶ LaTeX draft per section
     │
     ▼
 [reviewer]────────▶ critique
     │
     ├─ needs_work ──▶ [writer] (revise)
     └─ approved ───▶ [compiler]──▶ paper.pdf

Complete implementation

1. Install

pip install langgraph anthropic arxiv tavily-python matplotlib numpy
# LaTeX: install TeXLive or MiKTeX locally for pdflatex

2. State + setup

import os, subprocess, json
from typing import List, Dict
from typing_extensions import TypedDict
from langgraph.graph import StateGraph, START, END
import anthropic, arxiv
from tavily import TavilyClient

claude = anthropic.Anthropic()
tavily = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])
MODEL = "claude-opus-4-7"

class PaperState(TypedDict):
    topic: str
    research_questions: List[str]
    outline: Dict[str, str]      # section -> what it should cover
    papers: List[Dict]           # {title, authors, year, abstract, bibtex}
    experiment_code: str
    experiment_results: Dict
    draft: Dict[str, str]        # section -> latex
    critique: str
    revision_count: int
    final_tex: str
    pdf_path: str

def ask_claude(system: str, user: str, max_tokens: int = 4096) -> str:
    resp = claude.messages.create(
        model=MODEL, max_tokens=max_tokens,
        system=system,
        messages=[{"role": "user", "content": user}],
    )
    return resp.content[0].text

3. Planner: decompose topic into research questions and outline

def planner(state: PaperState) -> Dict:
    out = ask_claude(
        system="You are a research planner. Output ONLY valid JSON.",
        user=(
            f"Topic: {state['topic']}\n\n"
            "Produce:\n"
            '1. "research_questions": 2-4 specific, testable questions\n'
            '2. "outline": {abstract, introduction, related_work, methods, experiments, results, discussion, conclusion}, '
            "one sentence each describing what that section should cover\n\n"
            "Return JSON only."
        ),
    )
    parsed = json.loads(out)
    return {
        "research_questions": parsed["research_questions"],
        "outline": parsed["outline"],
        "revision_count": 0,
    }

4. Literature searcher: arXiv + web, with real citations

def literature_searcher(state: PaperState) -> Dict:
    papers = []

    # arXiv: academic source
    for q in state["research_questions"]:
        search = arxiv.Search(query=q, max_results=5, sort_by=arxiv.SortCriterion.Relevance)
        for p in search.results():
            papers.append({
                "title": p.title,
                "authors": [a.name for a in p.authors],
                "year": p.published.year,
                "abstract": p.summary[:500],
                "arxiv_id": p.entry_id.split("/")[-1],
                "bibtex": (
                    f"@article{{{p.entry_id.split('/')[-1]},\n"
                    f"  title={{{p.title}}},\n"
                    f"  author={{{' and '.join(a.name for a in p.authors)}}},\n"
                    f"  year={{{p.published.year}}},\n"
                    f"  eprint={{{p.entry_id.split('/')[-1]}}},\n"
                    f"  archivePrefix={{arXiv}}\n}}"
                ),
            })

    # Tavily: recent web context (blogs, industry reports)
    for q in state["research_questions"]:
        for r in tavily.search(q, max_results=3)["results"]:
            papers.append({"title": r["title"], "url": r["url"], "abstract": r["content"][:400]})

    return {"papers": papers}

5. Experiment designer + runner: real code, real numbers

def experiment_designer(state: PaperState) -> Dict:
    code = ask_claude(
        system=(
            "You are a research scientist. Write a Python script that runs a real, small "
            "experiment answering the research questions. Use numpy, matplotlib, or sklearn. "
            "Save numerical results to `results.json` and plots to `fig_*.png`. "
            "Keep runtime under 60 seconds. Output ONLY the Python code, no markdown."
        ),
        user=f"Research questions: {state['research_questions']}\nTopic: {state['topic']}",
    )
    # Strip markdown fences if present
    code = code.replace("```python", "").replace("```", "").strip()
    return {"experiment_code": code}

def experiment_runner(state: PaperState) -> Dict:
    with open("experiment.py", "w") as f:
        f.write(state["experiment_code"])

    try:
        subprocess.run(["python", "experiment.py"], check=True, timeout=120, capture_output=True)
        with open("results.json") as f:
            results = json.load(f)
    except Exception as e:
        results = {"error": str(e), "fallback": "experiment failed, will note in paper"}

    return {"experiment_results": results}

6. Writer: LaTeX per section

LATEX_SYSTEM = (
    "You are an academic paper writer. Output only LaTeX body content "
    "(no preamble, no \\begin{document}). Use \\cite{arxiv_id} for citations. "
    "Be precise, hedged, and quantitative. No marketing language."
)

def writer(state: PaperState) -> Dict:
    draft = {}
    context = (
        f"Topic: {state['topic']}\n"
        f"Research questions: {state['research_questions']}\n"
        f"Experiment results: {json.dumps(state['experiment_results'])}\n"
        f"Available citations:\n"
        + "\n".join(f"- {p['title']} ({p.get('arxiv_id','web')})" for p in state['papers'][:20])
    )
    if state.get("critique"):
        context += f"\n\nPREVIOUS CRITIQUE TO ADDRESS:\n{state['critique']}"

    for section, desc in state["outline"].items():
        section_tex = ask_claude(
            system=LATEX_SYSTEM,
            user=f"{context}\n\nWrite the '{section}' section. Target: {desc}. 150-400 words.",
            max_tokens=2048,
        )
        draft[section] = section_tex

    return {"draft": draft}

7. Reviewer: critique loop (self-correction)

def reviewer(state: PaperState) -> Dict:
    full = "\n\n".join(f"=== {k} ===\n{v}" for k, v in state["draft"].items())
    critique = ask_claude(
        system=(
            "You are a tough peer reviewer. Identify: unsupported claims, missing citations, "
            "logical gaps, overclaims, missing limitations. Return JSON: "
            '{"approved": bool, "issues": [str], "suggestions": [str]}'
        ),
        user=full,
    )
    parsed = json.loads(critique)
    return {"critique": json.dumps(parsed)}

def should_revise(state: PaperState) -> str:
    critique = json.loads(state["critique"])
    if critique["approved"] or state["revision_count"] >= 2:
        return "compile"
    return "revise"

def increment_revision(state: PaperState) -> Dict:
    return {"revision_count": state["revision_count"] + 1}

8. Compiler: assemble + pdflatex

LATEX_PREAMBLE = r"""
\documentclass[11pt]{article}
\usepackage[margin=1in]{geometry}
\usepackage{graphicx,amsmath,cite,hyperref}
\title{%s}
\author{AutoResearch Agent (Claude)}
\date{\today}
\begin{document}
\maketitle
"""

def compiler(state: PaperState) -> Dict:
    body = LATEX_PREAMBLE % state["topic"]
    for section in ["abstract", "introduction", "related_work", "methods",
                    "experiments", "results", "discussion", "conclusion"]:
        if section in state["draft"]:
            body += f"\n\\section{{{section.replace('_',' ').title()}}}\n{state['draft'][section]}\n"

    # Bibliography
    body += "\n\\begin{thebibliography}{99}\n"
    for p in state["papers"]:
        if "bibtex" in p:
            body += f"\\bibitem{{{p['arxiv_id']}}} {p['authors'][0]} et al. ({p['year']}). \\emph{{{p['title']}}}.\n"
    body += "\\end{thebibliography}\n\\end{document}\n"

    with open("paper.tex", "w", encoding="utf-8") as f:
        f.write(body)

    subprocess.run(["pdflatex", "-interaction=nonstopmode", "paper.tex"], capture_output=True)
    return {"final_tex": body, "pdf_path": "paper.pdf"}

9. Wire the graph

g = StateGraph(PaperState)
g.add_node("plan", planner)
g.add_node("search_lit", literature_searcher)
g.add_node("design_exp", experiment_designer)
g.add_node("run_exp", experiment_runner)
g.add_node("write", writer)
g.add_node("review", reviewer)
g.add_node("revise_counter", increment_revision)
g.add_node("compile", compiler)

g.add_edge(START, "plan")
g.add_edge("plan", "search_lit")
g.add_edge("search_lit", "design_exp")
g.add_edge("design_exp", "run_exp")
g.add_edge("run_exp", "write")
g.add_edge("write", "review")
g.add_conditional_edges("review", should_revise, {"revise": "revise_counter", "compile": "compile"})
g.add_edge("revise_counter", "write")  # loop back, writer uses state['critique']
g.add_edge("compile", END)

app = g.compile()

# Run
result = app.invoke(
    {"topic": "Are smaller LLMs with chain-of-thought competitive with larger LLMs on math word problems?"},
    {"recursion_limit": 25},
)
print("Paper compiled:", result["pdf_path"])

Safety notes and limitations

  • Hallucinated citations are the #1 failure mode. The arXiv search here returns real papers. Never let the writer invent citation keys. Always constrain to the retrieved set (the writer prompt does this).
  • Experiments can fail silently. The experiment_runner catches errors and the writer must acknowledge them. Don't let the reviewer pass a paper with fabricated results.
  • Reviewer bias toward approval. Set a hard revision_count >= 2 cap to avoid infinite loops, then force compile with flagged limitations.
  • This is not peer review. Outputs are research drafts, useful for accelerating writing, not replacing scholarly work.

Why Claude is the right choice for this

  1. Long context (200K). The reviewer can see the entire paper draft at once.
  2. Structured output. JSON for plans, LaTeX for sections, both reliable.
  3. Code generation. The experiment designer needs runnable Python and Claude is strong here.
  4. Hedged writing. Claude's default style suits academic voice (vs. ChatGPT's promotional tendency).

Resume angle

"Built an autonomous research agent with LangGraph + Claude: planner decomposes a topic, arXiv-grounded literature searcher retrieves real citations, code-gen agent designs and runs experiments, writer produces LaTeX per section, reviewer enforces self-critique loops, pdflatex compiles the output. End-to-end: idea to conference-format PDF in roughly 5 minutes."

Common questions

Frequently asked

What models and tools are used in the AutoResearch agent stack?

The stack uses Claude Opus 4.7 as the core LLM, LangGraph for agent orchestration, the arXiv API for retrieving academic papers, Tavily for web search, and pdflatex for compiling the final PDF. Python libraries including anthropic, arxiv, tavily-python, matplotlib, and numpy support the implementation.

How does AutoResearch prevent hallucinated citations in the generated paper?

AutoResearch retrieves real papers from arXiv and returns actual arXiv IDs and bibtex entries before writing begins. The writer is constrained to cite only from this retrieved set, and citation keys are derived from the actual arXiv IDs in the papers list. The system never allows the LLM to invent citation keys outside the retrieved corpus.

How many revision cycles does the agent perform before finalizing the paper?

The reviewer node runs after the initial draft and checks for unsupported claims, missing citations, and logical gaps. If the reviewer does not approve the draft, the agent increments a revision counter and loops back to the writer with critique context. The loop terminates after 2 revisions or when the reviewer approves, whichever comes first.

Does the agent run real experiments or use synthetic data?

The agent generates and executes real Python code that runs a small experiment, typically using numpy, matplotlib, or sklearn. The experiment runner saves numerical results to a JSON file and plots to PNG files, which the writer then references. If the experiment fails, the error is captured and the writer must acknowledge the failure in the paper text.

Why is Claude chosen over other LLMs for this research agent?

Claude offers a 200K token context window that lets the reviewer see the entire draft at once, reliable structured output for JSON plans and LaTeX sections, strong code generation for executable Python experiments, and a naturally hedged writing style that fits academic tone better than more promotional alternatives.

READY TO IMPLEMENT

Want to talk through this in your business?

The paper above is the thinking. Let's spend 30 minutes on what it would actually look like to ship in your shop, no pitch, just a real scoping conversation.

AutoResearch: Autonomous Paper-Writing Agent with Claude