Back to white papers
White Paper

AutoResearch: Autonomous Paper-Writing Agent with Claude

Jake McCluskey
AutoResearch: Autonomous Paper-Writing Agent with Claude

Source post: @datasciencebrain Telegram (crosspost from Hasan Toor on X)
Scraped claim: "Takes a research idea and outputs a full academic paper" with genuine citations, experiments, and conference-ready LaTeX, no human intervention.
Stack built here: Claude + LangGraph + arXiv API + Tavily search + LaTeX

The core idea

A single-prompt LLM call can give you a blog post. A research paper needs:

  1. Literature review grounded in real citations
  2. A genuine research question / hypothesis
  3. Experiments or analysis, actual runs, not made-up numbers
  4. Structured LaTeX output, abstract, intro, methods, results, conclusion
  5. Iterative refinement, the first draft is always rough

This is an agent orchestration problem, not a prompting problem. LangGraph is the right tool. Claude is the right brain (strong at structured writing, code, and reasoning over long contexts).

Architecture

   topic
     │
     ▼
 [planner]────────▶ research_questions, sections_plan
     │
     ▼
 [literature_searcher]──▶ arXiv + web → papers + bibtex
     │
     ▼
 [experiment_designer]──▶ executable Python
     │
     ▼
 [experiment_runner] ───▶ results.json + plots
     │
     ▼
 [writer]──────────▶ LaTeX draft per section
     │
     ▼
 [reviewer]────────▶ critique
     │
     ├─ needs_work ──▶ [writer] (revise)
     └─ approved ───▶ [compiler]──▶ paper.pdf

Complete implementation

1. Install

pip install langgraph anthropic arxiv tavily-python matplotlib numpy
# LaTeX: install TeXLive or MiKTeX locally for pdflatex

2. State + setup

import os, subprocess, json
from typing import List, Dict
from typing_extensions import TypedDict
from langgraph.graph import StateGraph, START, END
import anthropic, arxiv
from tavily import TavilyClient

claude = anthropic.Anthropic()
tavily = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])
MODEL = "claude-opus-4-7"

class PaperState(TypedDict):
    topic: str
    research_questions: List[str]
    outline: Dict[str, str]      # section -> what it should cover
    papers: List[Dict]           # {title, authors, year, abstract, bibtex}
    experiment_code: str
    experiment_results: Dict
    draft: Dict[str, str]        # section -> latex
    critique: str
    revision_count: int
    final_tex: str
    pdf_path: str

def ask_claude(system: str, user: str, max_tokens: int = 4096) -> str:
    resp = claude.messages.create(
        model=MODEL, max_tokens=max_tokens,
        system=system,
        messages=[{"role": "user", "content": user}],
    )
    return resp.content[0].text

3. Planner: decompose topic into research questions and outline

def planner(state: PaperState) -> Dict:
    out = ask_claude(
        system="You are a research planner. Output ONLY valid JSON.",
        user=(
            f"Topic: {state['topic']}\n\n"
            "Produce:\n"
            '1. "research_questions": 2-4 specific, testable questions\n'
            '2. "outline": {abstract, introduction, related_work, methods, experiments, results, discussion, conclusion} — '
            "one sentence each describing what that section should cover\n\n"
            "Return JSON only."
        ),
    )
    parsed = json.loads(out)
    return {
        "research_questions": parsed["research_questions"],
        "outline": parsed["outline"],
        "revision_count": 0,
    }

4. Literature searcher: arXiv + web, with real citations

def literature_searcher(state: PaperState) -> Dict:
    papers = []

    # arXiv: academic source
    for q in state["research_questions"]:
        search = arxiv.Search(query=q, max_results=5, sort_by=arxiv.SortCriterion.Relevance)
        for p in search.results():
            papers.append({
                "title": p.title,
                "authors": [a.name for a in p.authors],
                "year": p.published.year,
                "abstract": p.summary[:500],
                "arxiv_id": p.entry_id.split("/")[-1],
                "bibtex": (
                    f"@article{{{p.entry_id.split('/')[-1]},\n"
                    f"  title={{{p.title}}},\n"
                    f"  author={{{' and '.join(a.name for a in p.authors)}}},\n"
                    f"  year={{{p.published.year}}},\n"
                    f"  eprint={{{p.entry_id.split('/')[-1]}}},\n"
                    f"  archivePrefix={{arXiv}}\n}}"
                ),
            })

    # Tavily: recent web context (blogs, industry reports)
    for q in state["research_questions"]:
        for r in tavily.search(q, max_results=3)["results"]:
            papers.append({"title": r["title"], "url": r["url"], "abstract": r["content"][:400]})

    return {"papers": papers}

5. Experiment designer + runner: real code, real numbers

def experiment_designer(state: PaperState) -> Dict:
    code = ask_claude(
        system=(
            "You are a research scientist. Write a Python script that runs a real, small "
            "experiment answering the research questions. Use numpy, matplotlib, or sklearn. "
            "Save numerical results to `results.json` and plots to `fig_*.png`. "
            "Keep runtime under 60 seconds. Output ONLY the Python code, no markdown."
        ),
        user=f"Research questions: {state['research_questions']}\nTopic: {state['topic']}",
    )
    # Strip markdown fences if present
    code = code.replace("```python", "").replace("```", "").strip()
    return {"experiment_code": code}

def experiment_runner(state: PaperState) -> Dict:
    with open("experiment.py", "w") as f:
        f.write(state["experiment_code"])

    try:
        subprocess.run(["python", "experiment.py"], check=True, timeout=120, capture_output=True)
        with open("results.json") as f:
            results = json.load(f)
    except Exception as e:
        results = {"error": str(e), "fallback": "experiment failed, will note in paper"}

    return {"experiment_results": results}

6. Writer: LaTeX per section

LATEX_SYSTEM = (
    "You are an academic paper writer. Output only LaTeX body content "
    "(no preamble, no \\begin{document}). Use \\cite{arxiv_id} for citations. "
    "Be precise, hedged, and quantitative. No marketing language."
)

def writer(state: PaperState) -> Dict:
    draft = {}
    context = (
        f"Topic: {state['topic']}\n"
        f"Research questions: {state['research_questions']}\n"
        f"Experiment results: {json.dumps(state['experiment_results'])}\n"
        f"Available citations:\n"
        + "\n".join(f"- {p['title']} ({p.get('arxiv_id','web')})" for p in state['papers'][:20])
    )
    if state.get("critique"):
        context += f"\n\nPREVIOUS CRITIQUE TO ADDRESS:\n{state['critique']}"

    for section, desc in state["outline"].items():
        section_tex = ask_claude(
            system=LATEX_SYSTEM,
            user=f"{context}\n\nWrite the '{section}' section. Target: {desc}. 150-400 words.",
            max_tokens=2048,
        )
        draft[section] = section_tex

    return {"draft": draft}

7. Reviewer: critique loop (self-correction)

def reviewer(state: PaperState) -> Dict:
    full = "\n\n".join(f"=== {k} ===\n{v}" for k, v in state["draft"].items())
    critique = ask_claude(
        system=(
            "You are a tough peer reviewer. Identify: unsupported claims, missing citations, "
            "logical gaps, overclaims, missing limitations. Return JSON: "
            '{"approved": bool, "issues": [str], "suggestions": [str]}'
        ),
        user=full,
    )
    parsed = json.loads(critique)
    return {"critique": json.dumps(parsed)}

def should_revise(state: PaperState) -> str:
    critique = json.loads(state["critique"])
    if critique["approved"] or state["revision_count"] >= 2:
        return "compile"
    return "revise"

def increment_revision(state: PaperState) -> Dict:
    return {"revision_count": state["revision_count"] + 1}

8. Compiler: assemble + pdflatex

LATEX_PREAMBLE = r"""
\documentclass[11pt]{article}
\usepackage[margin=1in]{geometry}
\usepackage{graphicx,amsmath,cite,hyperref}
\title{%s}
\author{AutoResearch Agent (Claude)}
\date{\today}
\begin{document}
\maketitle
"""

def compiler(state: PaperState) -> Dict:
    body = LATEX_PREAMBLE % state["topic"]
    for section in ["abstract", "introduction", "related_work", "methods",
                    "experiments", "results", "discussion", "conclusion"]:
        if section in state["draft"]:
            body += f"\n\\section{{{section.replace('_',' ').title()}}}\n{state['draft'][section]}\n"

    # Bibliography
    body += "\n\\begin{thebibliography}{99}\n"
    for p in state["papers"]:
        if "bibtex" in p:
            body += f"\\bibitem{{{p['arxiv_id']}}} {p['authors'][0]} et al. ({p['year']}). \\emph{{{p['title']}}}.\n"
    body += "\\end{thebibliography}\n\\end{document}\n"

    with open("paper.tex", "w", encoding="utf-8") as f:
        f.write(body)

    subprocess.run(["pdflatex", "-interaction=nonstopmode", "paper.tex"], capture_output=True)
    return {"final_tex": body, "pdf_path": "paper.pdf"}

9. Wire the graph

g = StateGraph(PaperState)
g.add_node("plan", planner)
g.add_node("search_lit", literature_searcher)
g.add_node("design_exp", experiment_designer)
g.add_node("run_exp", experiment_runner)
g.add_node("write", writer)
g.add_node("review", reviewer)
g.add_node("revise_counter", increment_revision)
g.add_node("compile", compiler)

g.add_edge(START, "plan")
g.add_edge("plan", "search_lit")
g.add_edge("search_lit", "design_exp")
g.add_edge("design_exp", "run_exp")
g.add_edge("run_exp", "write")
g.add_edge("write", "review")
g.add_conditional_edges("review", should_revise, {"revise": "revise_counter", "compile": "compile"})
g.add_edge("revise_counter", "write")  # loop back — writer uses state['critique']
g.add_edge("compile", END)

app = g.compile()

# Run
result = app.invoke(
    {"topic": "Are smaller LLMs with chain-of-thought competitive with larger LLMs on math word problems?"},
    {"recursion_limit": 25},
)
print("Paper compiled:", result["pdf_path"])

Safety notes and limitations

  • Hallucinated citations are the #1 failure mode. The arXiv search here returns real papers. Never let the writer invent citation keys. Always constrain to the retrieved set (the writer prompt does this).
  • Experiments can fail silently. The experiment_runner catches errors and the writer must acknowledge them. Don't let the reviewer pass a paper with fabricated results.
  • Reviewer bias toward approval. Set a hard revision_count >= 2 cap to avoid infinite loops, then force compile with flagged limitations.
  • This is not peer review. Outputs are research drafts, useful for accelerating writing, not replacing scholarly work.

Why Claude is the right choice for this

  1. Long context (200K). The reviewer can see the entire paper draft at once.
  2. Structured output. JSON for plans, LaTeX for sections, both reliable.
  3. Code generation. The experiment designer needs runnable Python and Claude is strong here.
  4. Hedged writing. Claude's default style suits academic voice (vs. ChatGPT's promotional tendency).

Resume angle

"Built an autonomous research agent with LangGraph + Claude: planner decomposes a topic, arXiv-grounded literature searcher retrieves real citations, code-gen agent designs and runs experiments, writer produces LaTeX per section, reviewer enforces self-critique loops, pdflatex compiles the output. End-to-end: idea to conference-format PDF in roughly 5 minutes."