ARE and Gaia2: Benchmarking Agents in Async Worlds
White Paper

ARE and Gaia2: Benchmarking Agents in Async Worlds

Jake McCluskeyUpdated
Back to white papers

Source: ARE: Scaling Up Agent Environments and Evaluations (arXiv:2509.17158) (Meta Superintelligence Labs, 2025)
Series: The 10 Agent Whitepapers Every Builder Should Read

TL;DR

Meta retired GAIA and replaced it with Gaia2, a benchmark built inside their ARE (Meta Agents Research Environments) platform. Gaia2 runs 1,120 verifiable scenarios inside a simulated mobile phone (email, messaging, calendar, apps) and here's the critical innovation: it runs asynchronously. Events happen whether the agent acts or not. Other agents show up. Time passes. This surfaces failure modes that every static benchmark misses. Key empirical finding: no system dominates across the intelligence spectrum. Stronger reasoning often comes at the cost of efficiency, and budget scaling plateaus.

If you're building agents for the real world (support, ops, scheduling, personal assistants), Gaia2 is the closest public benchmark to your actual problem.

1. What it is

1.1. ARE, the platform

ARE (Meta Agents Research Environments) is Meta's framework for building agent environments at scale. Three things it does:

  1. Scalable environment creation: authoring tools for new scenarios.
  2. Application integration: plug synthetic or real apps into the environment (calendar, email, web).
  3. Agentic orchestration execution: a runner that drives agents through scenarios and scores them.

It's open-source and available on Meta's GitHub (facebookresearch/meta-agents-research-environments). If you want to build your own agent benchmark, ARE is the substrate.

1.2. Gaia2, the benchmark

Gaia2 is the flagship benchmark built in ARE. Key facts:

  • 1,120 scenarios (800 base plus 320 augmentation from Gaia2-mini)
  • Mobile environment: simulates a smartphone with email, messaging, calendar, contacts, apps
  • Verifiable: each scenario has a programmatic success criterion
  • Annotated: scenarios carry metadata about required capabilities
  • Async: time passes; other actors act; the world changes independently of the agent

1.3. What "async" actually means

Most agent benchmarks (classical GAIA, WebArena, SWE-bench) are static. The environment is frozen, the agent takes as long as it wants. Gaia2 is different:

  • A meeting invitation might expire while the agent is thinking
  • A second agent (simulated user or simulated coworker) might send a new message mid-task
  • Reminders fire; notifications arrive; calendars mutate
  • Temporal constraints ("reply within 10 minutes") are real

This surfaces new failure modes:

  • Agents that take too long to reason miss windows
  • Agents that don't re-check state after each action act on stale info
  • Agents that can't handle mid-task interruptions lock up
  • Agents that don't collaborate with other agents fail multi-party tasks

1.4. What scenarios cover

Gaia2 tests:

  • Search and execution: find information, take action on it
  • Ambiguity and noise handling: under-specified instructions, conflicting info
  • Dynamic environment adaptation: plans change mid-execution
  • Agent-to-agent collaboration: teams of agents coordinating
  • Temporal constraints: deadlines, reminders, windows
  • Irreversible actions: delete, send, publish, with verification

1.5. What the empirical results say

Meta's released analysis surfaces three durable findings:

  1. No system dominates. The Pareto frontier of "accuracy vs. cost" is populated; different models win different slices.
  2. Stronger reasoning costs efficiency. Deep-thinking models that score higher on correctness-heavy scenarios use more tokens and time, which then makes them miss temporal-constraint scenarios.
  3. Budget scaling curves plateau. Spending 10x more compute buys diminishing returns, typically a single-digit percentage improvement past a threshold.

2. Why it matters

2.1. GAIA was retired for a reason

Classical GAIA (2023) was a static knowledge-and-tools benchmark. It peaked. Top agents saturated it. Continuing to quote "X% on GAIA" stopped being informative. Gaia2 replaces it with a harder target that hasn't been saturated and won't be soon, because async behavior is genuinely hard.

2.2. It tests the skill gap that matters most in production

Every agent team that ships something real hits the same wall: benchmark scores didn't predict production behavior. The usual reason: benchmarks are static, but production is async. Gaia2 is the first benchmark that closes that gap at scale.

2.3. It exposes the reasoning-vs-latency tradeoff

The "stronger reasoning hurts efficiency" finding has direct product implications:

  • You can't pick "the smartest model" and call it done.
  • Router systems (like GPT-5's) become essential: fast path for latency-critical scenarios, deep path for correctness-critical scenarios.
  • Multi-agent architectures where a fast agent triages and a smart agent handles hard cases will look better on Gaia2 than a single-model baseline.

2.4. It's open. You can run it.

Because ARE is open source, your team can:

  • Run Gaia2 in CI against your agent before shipping
  • Author custom scenarios that mirror your actual product (e.g., "the CRM-integrated support agent scenario")
  • Use the async primitives to build your own private benchmarks

3. How to do it

3.1. Run Gaia2 on your own agent

git clone https://github.com/facebookresearch/meta-agents-research-environments
cd meta-agents-research-environments
pip install -e .
huggingface-cli download meta-agents-research-environments/gaia2 --repo-type dataset

Minimal runner script:

from meta_are import Environment, Runner
from meta_are.benchmarks import gaia2

env = Environment.load("gaia2")  # mobile + apps
agent = YourAgent()                # whatever wrapper you want

results = Runner(env).run(agent, scenarios=gaia2.all_scenarios())
print(f"Pass rate: {results.pass_rate:.1%}")
print(f"Median latency per scenario: {results.median_latency_ms} ms")
print(f"Missed-deadline rate: {results.deadline_miss_rate:.1%}")

Plug in any agent that implements the simple step(observation) -> action interface. The framework handles async mechanics (clock advancement, event injection).

3.2. Write a custom async scenario

Scenarios are declarative. Schema:

scenario_id: support_escalation_001
description: >
  User reports billing issue. A manager requests status update 5 minutes in.
  Agent must continue investigation AND provide manager update without losing thread.

environment:
  apps: [email, slack, ticketing]
  initial_state:
    email:
      inbox: [{from: [email protected], subj: "Charged twice", body: "..."}]
    slack:
      channels: [#support-internal]

events:
  - at: "+00:05:00"
    type: slack_message
    channel: "#support-internal"
    from: manager
    body: "What's the status on that billing ticket?"

success_criteria:
  - agent.sent_email_containing("refund")
  - agent.replied_in_slack_within("00:01:00")
  - ticket.status == "resolved"

3.3. The key engineering patterns Gaia2 rewards

After running Gaia2 against many architectures, Meta's data highlights patterns that score well.

Re-observe after every action. Don't cache state across steps. The world changes.

# Bad
state = observe()
plan = make_plan(state)
for step in plan: execute(step)

# Good
state = observe()
plan = make_plan(state)
for step in plan:
    current_state = observe()          # re-check
    if plan_still_valid(plan, current_state):
        execute(step)
    else:
        plan = replan(current_state)   # adapt

Budget time, not tokens. Gaia2 scores you on wall-clock as much as correctness. Set explicit timeouts.

async def bounded_think(task, deadline_ms):
    try:
        return await asyncio.wait_for(
            deep_think(task), timeout=deadline_ms / 1000
        )
    except asyncio.TimeoutError:
        return await fast_fallback(task)   # cheap approximation

Handle mid-task events gracefully. Use a listener pattern that can interrupt the agent to inject new info.

Split work across collaborating agents when the scenario allows. A dedicated "monitor" agent that tails events while a "worker" agent drives the plan.

3.4. What to measure, not just pass/fail

Gaia2's richer metric surface is itself a lesson. Track these on your own benchmarks:

  • Deadline-miss rate: did the agent respond in time?
  • Stale-state-action rate: did the agent act on info that had changed?
  • Token-per-success: how efficient is each resolution?
  • Interrupt-handling accuracy: when mid-task events arrived, did the agent adapt?
  • Multi-agent coordination: if there was a peer, did the pair outperform either alone?

4. When to use Gaia2 vs. alternatives

If you're buildingBenchmark to use
Code-edit agentSWE-bench Verified
Research/deep-question agentOriginal GAIA (legacy) or Gaia2 research slice
Production assistant / ops / scheduling agentGaia2
Web-navigation agentWebArena, VisualWebArena, Gaia2 web slice
Multi-agent coordinationGaia2 collaboration slice
Long-context retrieval agentLOFT, Needle-in-Haystack suites

For most teams shipping agents into real products, Gaia2 should be your primary eval, with domain-specific slices layered on top.

5. Key takeaways

  • Async is not static. Production agents face clocks, interruptions, and peer agents. Your benchmark must too.
  • No single model wins everything. Pareto frontiers, not leaderboards.
  • Smarter models miss deadlines. Design routing or fallback explicitly.
  • Re-observe between actions. State caching is the hidden source of most async failures.
  • Use ARE to build your own scenarios. Authoring the ones that match your product is the highest-ROI eval work.
Common questions

Frequently asked

What is the difference between Gaia2 and the original GAIA benchmark?

Gaia2 runs asynchronously in a simulated mobile phone environment where time passes and events happen whether the agent acts or not, while the original GAIA was a static knowledge-and-tools benchmark where the environment was frozen. Gaia2 includes 1,120 scenarios and tests failure modes like missed deadlines, stale state handling, and mid-task interruptions that static benchmarks cannot surface. Meta retired the original GAIA because top agents had saturated it and benchmark scores stopped being informative.

How does asynchronous execution in Gaia2 affect agent performance?

Asynchronous execution surfaces failure modes where agents that take too long to reason miss time windows, agents that do not re-check state after each action operate on stale information, and agents that cannot handle mid-task interruptions lock up. Meta's empirical analysis found that stronger reasoning models score higher on correctness but use more tokens and time, which causes them to miss temporal-constraint scenarios. This creates a fundamental tradeoff between reasoning depth and efficiency.

Can I run Gaia2 benchmarks on my own agent system?

Yes, Gaia2 is open source and available through Meta's ARE framework on GitHub. You clone the repository, install the package, download the Gaia2 dataset from Hugging Face, and plug in any agent that implements a simple step function interface. The framework handles the async mechanics like clock advancement and event injection, and you can also author custom scenarios that mirror your actual product requirements.

What engineering patterns does Gaia2 reward in agent design?

Gaia2 rewards agents that re-observe state after every action rather than caching, set explicit time budgets and timeouts instead of optimizing only for tokens, handle mid-task events gracefully using listener patterns that can interrupt execution, and split work across collaborating agents when scenarios allow. Meta's data shows these patterns score better because they adapt to the changing environment and meet temporal constraints.

What metrics should I track beyond simple pass or fail rates when benchmarking agents?

Track deadline-miss rate to measure whether agents respond in time, stale-state-action rate to detect acting on outdated information, token-per-success for efficiency, interrupt-handling accuracy when mid-task events arrive, and multi-agent coordination success when peer agents are involved. Gaia2's richer metric surface provides a more complete picture of production readiness than binary pass-fail scoring.

READY TO IMPLEMENT

Want to talk through this in your business?

The paper above is the thinking. Let's spend 30 minutes on what it would actually look like to ship in your shop, no pitch, just a real scoping conversation.

ARE and Gaia2: Benchmarking Agents in Async Worlds