Scaling Agent Systems: The First Predictive Law

Source: Towards a Science of Scaling Agent Systems (arXiv:2512.08296) (Cornell, 2025)
Series: The 10 Agent Whitepapers Every Builder Should Read

TL;DR

Cornell's scaling-agents paper is the first public predictive framework for multi-agent LLM systems. Across 260 configurations spanning six agentic benchmarks, five canonical architectures (Single, Independent, Centralized, Decentralized, Hybrid), and three LLM families, the authors derive a model that predicts agent-system performance from coordination metrics with R-squared = 0.373 (0.413 with a task-grounded capability term) and correctly picks the best architecture for 87% of held-out configurations. Performance varies from +80.8% to -70.0% depending on fit. The biggest finding: adding more agents isn't monotonically better. Tool-heavy tasks and uncoordinated architectures actually degrade.

If you're choosing between "one big agent" and "orchestrator with N workers," this is the paper that tells you which one to pick before spending a month finding out the hard way.

1. What it is

1.1. The experiment

Yubin Kim et al. built a controlled-evaluation platform and ran 260 system configurations across:

Six agentic benchmarks (financial reasoning, sequential planning, browsing, code. Examples include Finance-Agent, BrowseComp-Plus, PlanCraft, Workbench).

Five canonical architectures:

Architecture	Description
Single	One agent, no peers. The baseline.
Independent	Multiple agents acting in parallel with no coordination.
Centralized	An orchestrator routes work to specialists.
Decentralized	Agents pass work peer-to-peer.
Hybrid	Mix. E.g., orchestrator plus peer exchange between specialists.

Three LLM families, so the law isn't one-model-specific.

1.2. Coordination metrics

The authors quantify how well a multi-agent system actually coordinates using four measurable quantities:

Efficiency: compute spent per unit of useful work
Overhead: compute spent on coordination/handoff (not task progress)
Error amplification: the degree to which one agent's mistake propagates through peers
Redundancy: the fraction of work duplicated across agents

These go into a predictive model plus a task-grounded capability term (how strong each underlying model is on the specific task). The model achieves:

R-squared = 0.373 cross-validated (coordination metrics alone)
R-squared = 0.413 with the task-grounded capability term
87% correct architecture selection on held-out configs

1.3. The three empirical patterns

The paper surfaces three durable findings that should change how you design multi-agent systems.

1. Diminishing returns past single-agent threshold. Once the single-agent score is high enough on a task, multi-agent systems don't help. They just add coordination overhead.

2. Tool-intensive tasks penalize multi-agent designs. When a task requires frequent tool calls, handoffs across agents break tool state continuity. A single agent doing all the calls wins.

3. Decentralized architectures propagate errors more. Without a centralized verifier, one agent's mistake cascades through peers. Centralized or hybrid architectures with an orchestrator-as-verifier catch errors earlier.

1.4. The delta range: +80.8% to -70.0%

Same task, same models, different architecture, performance swings by about 150 points. Architecture selection is not a detail. It is the choice.

2. Why it matters

2.1. We finally have a principled architecture choice

For two years, teams have picked "single agent vs. multi-agent" based on vibes, Twitter threads, or whatever CrewAI/AutoGen template was convenient. This paper gives you a decision framework grounded in measurements.

2.2. It explains why "just add more agents" keeps failing

Almost every team that bolts on a second agent expecting improvements gets disappointed. This paper says why: the architecture-task fit matters more than the number of agents. A centralized duo can beat a decentralized swarm.

2.3. It gives you scaling laws you can actually use

The classical LLM scaling laws (Chinchilla, etc.) tell you how to spend training compute. This is the first equivalent framework for inference-time multi-agent compute. You can now reason about "should I spend my compute on thinking longer or on more agents?" with some empirical grounding.

2.4. It formalizes what practitioners were already learning the hard way

Tool-heavy, single agent. (Now: measurable reason.)
Lots of easy parallel subtasks, independent agents. (Now: quantified returns.)
Complex sequential plan, centralized orchestrator with verifier. (Now: lowest error amplification.)

3. How to do it

3.1. Pick architecture before you pick prompts

Before implementing, answer:

Can subtasks be parallelized, and are they independent?
Does the task require many tool calls with persistent state?
Is there a verification step that benefits from a central "judge"?
Is the single-agent score already high?

Decision rubric (distilled from the paper's findings):

If single-agent ≥ 0.8 on this task:
    → Stay single. Save coordination overhead.

Elif task has many independent subtasks and tools are per-subtask:
    → Independent (parallel, no coordination).

Elif task needs verification or plan-following:
    → Centralized (orchestrator + specialists).

Elif task involves peer collaboration with little shared state:
    → Decentralized.

Else:
    → Hybrid (orchestrator for plan, peers for exchange).

3.2. Instrument the coordination metrics in your system

The paper's four metrics are tractable to log. Do it.

# Per-task trace:
{
    "task_id": "...",
    "tokens_per_step": {...},          # efficiency
    "handoff_tokens": ...,             # overhead
    "post_handoff_error_rate": ...,    # error amplification proxy
    "duplicate_step_count": ...,       # redundancy
    "final_score": ...,
}

Fit the paper's predictive model against your own traces. After 200 to 300 runs, you can predict which architecture wins for a new task class, before running it.

3.3. Example: "Code review agent", which architecture?

Task: review a PR for style, logic, security.

Analysis:

Three subtasks: parallelizable, Independent is viable
Each subtask involves its own tools (style = linter, logic = tests, security = scanner), tool-heavy per subtask, not across subtasks
The final verdict is a synthesis, benefits from a verifier

Recommendation per the paper's framework: Hybrid. Three independent reviewers fan out, orchestrator fans in and synthesizes.

# Hybrid: fan-out + fan-in
orchestrator_plan = await orch.plan(pr)
reviews = await asyncio.gather(
    style_agent.review(pr),
    logic_agent.review(pr),
    security_agent.review(pr),
)
verdict = await orch.synthesize(pr, reviews)

That beats a decentralized pattern where the three reviewers chat with each other (error amplification) and beats a single-agent pattern that serializes the three checks (efficiency loss).

3.4. Example: "Browse-and-book flight", which architecture?

Task: find a flight, hold a seat, pay.

Analysis:

Tool-intensive (browser, payments, calendar), penalty for multi-agent
Sequential, not parallel, independent doesn't help
State lives in the browser session, handoff breaks it

Recommendation: Single. Or Single plus human-in-the-loop (see Magentic-UI). Adding agents here makes it worse.

3.5. Example: "Research report on 10 companies", which architecture?

Task: produce a report comparing 10 public companies across finance, news, tech.

Analysis:

10 companies = 10 independent subtasks
Each research task is self-contained
Final synthesis benefits from a verifier

Recommendation: Hybrid. 10 independent research agents plus one synthesizer. Big win over single-agent serial execution.

3.6. Stop blindly cloning CrewAI templates

CrewAI, AutoGen, LangGraph all ship "X agents in a fancy shape" templates. The paper's finding is that shape matters. Pick the shape that matches your task structure, not the shape that looks coolest in your demo.

4. Applied scaling-law workflow

┌──────────────────────────────────────────────────┐
│  1. Describe task                                │
│  2. Score single-agent baseline                  │
│  3. Classify: tool-heavy? parallelizable?        │
│     needs-verifier? shared-state?                │
│  4. Apply decision rubric → pick architecture    │
│  5. Implement with logging for 4 coord metrics   │
│  6. After N runs, fit predictive model           │
│  7. Reuse model to pick architectures for        │
│     new-but-similar tasks                        │
└──────────────────────────────────────────────────┘

5. Key takeaways

Architecture selection swings performance by 150 points. It is not a detail; it is the decision.
More agents does not equal better. Often worse, especially for tool-heavy tasks.
Centralized verifiers catch error propagation. Prefer Centralized or Hybrid over Decentralized for anything that can go wrong.
Measure four metrics: efficiency, overhead, error amplification, redundancy. Fit the model to your own traces.
The 87% held-out accuracy is the big number. With just coordination metrics plus task capability, you can predict the right architecture most of the time.

This paper is the one to re-read whenever you find yourself about to spin up "one more agent." It'll save you a sprint.