ARE and Gaia2: Benchmarking Agents in Async Worlds

Source: ARE: Scaling Up Agent Environments and Evaluations (arXiv:2509.17158) (Meta Superintelligence Labs, 2025)
Series: The 10 Agent Whitepapers Every Builder Should Read

TL;DR

Meta retired GAIA and replaced it with Gaia2, a benchmark built inside their ARE (Meta Agents Research Environments) platform. Gaia2 runs 1,120 verifiable scenarios inside a simulated mobile phone (email, messaging, calendar, apps) and here's the critical innovation: it runs asynchronously. Events happen whether the agent acts or not. Other agents show up. Time passes. This surfaces failure modes that every static benchmark misses. Key empirical finding: no system dominates across the intelligence spectrum. Stronger reasoning often comes at the cost of efficiency, and budget scaling plateaus.

If you're building agents for the real world (support, ops, scheduling, personal assistants), Gaia2 is the closest public benchmark to your actual problem.

1. What it is

1.1. ARE, the platform

ARE (Meta Agents Research Environments) is Meta's framework for building agent environments at scale. Three things it does:

Scalable environment creation: authoring tools for new scenarios.
Application integration: plug synthetic or real apps into the environment (calendar, email, web).
Agentic orchestration execution: a runner that drives agents through scenarios and scores them.

It's open-source and available on Meta's GitHub (facebookresearch/meta-agents-research-environments). If you want to build your own agent benchmark, ARE is the substrate.

1.2. Gaia2, the benchmark

Gaia2 is the flagship benchmark built in ARE. Key facts:

1,120 scenarios (800 base plus 320 augmentation from Gaia2-mini)
Mobile environment: simulates a smartphone with email, messaging, calendar, contacts, apps
Verifiable: each scenario has a programmatic success criterion
Annotated: scenarios carry metadata about required capabilities
Async: time passes; other actors act; the world changes independently of the agent

1.3. What "async" actually means

Most agent benchmarks (classical GAIA, WebArena, SWE-bench) are static. The environment is frozen, the agent takes as long as it wants. Gaia2 is different:

A meeting invitation might expire while the agent is thinking
A second agent (simulated user or simulated coworker) might send a new message mid-task
Reminders fire; notifications arrive; calendars mutate
Temporal constraints ("reply within 10 minutes") are real

This surfaces new failure modes:

Agents that take too long to reason miss windows
Agents that don't re-check state after each action act on stale info
Agents that can't handle mid-task interruptions lock up
Agents that don't collaborate with other agents fail multi-party tasks

1.4. What scenarios cover

Gaia2 tests:

Search and execution: find information, take action on it
Ambiguity and noise handling: under-specified instructions, conflicting info
Dynamic environment adaptation: plans change mid-execution
Agent-to-agent collaboration: teams of agents coordinating
Temporal constraints: deadlines, reminders, windows
Irreversible actions: delete, send, publish, with verification

1.5. What the empirical results say

Meta's released analysis surfaces three durable findings:

No system dominates. The Pareto frontier of "accuracy vs. cost" is populated; different models win different slices.
Stronger reasoning costs efficiency. Deep-thinking models that score higher on correctness-heavy scenarios use more tokens and time, which then makes them miss temporal-constraint scenarios.
Budget scaling curves plateau. Spending 10x more compute buys diminishing returns, typically a single-digit percentage improvement past a threshold.

2. Why it matters

2.1. GAIA was retired for a reason

Classical GAIA (2023) was a static knowledge-and-tools benchmark. It peaked. Top agents saturated it. Continuing to quote "X% on GAIA" stopped being informative. Gaia2 replaces it with a harder target that hasn't been saturated and won't be soon, because async behavior is genuinely hard.

2.2. It tests the skill gap that matters most in production

Every agent team that ships something real hits the same wall: benchmark scores didn't predict production behavior. The usual reason: benchmarks are static, but production is async. Gaia2 is the first benchmark that closes that gap at scale.

2.3. It exposes the reasoning-vs-latency tradeoff

The "stronger reasoning hurts efficiency" finding has direct product implications:

You can't pick "the smartest model" and call it done.
Router systems (like GPT-5's) become essential: fast path for latency-critical scenarios, deep path for correctness-critical scenarios.
Multi-agent architectures where a fast agent triages and a smart agent handles hard cases will look better on Gaia2 than a single-model baseline.

2.4. It's open. You can run it.

Because ARE is open source, your team can:

Run Gaia2 in CI against your agent before shipping
Author custom scenarios that mirror your actual product (e.g., "the CRM-integrated support agent scenario")
Use the async primitives to build your own private benchmarks

3. How to do it

3.1. Run Gaia2 on your own agent

git clone https://github.com/facebookresearch/meta-agents-research-environments
cd meta-agents-research-environments
pip install -e .
huggingface-cli download meta-agents-research-environments/gaia2 --repo-type dataset

Minimal runner script:

from meta_are import Environment, Runner
from meta_are.benchmarks import gaia2

env = Environment.load("gaia2")  # mobile + apps
agent = YourAgent()                # whatever wrapper you want

results = Runner(env).run(agent, scenarios=gaia2.all_scenarios())
print(f"Pass rate: {results.pass_rate:.1%}")
print(f"Median latency per scenario: {results.median_latency_ms} ms")
print(f"Missed-deadline rate: {results.deadline_miss_rate:.1%}")

Plug in any agent that implements the simple step(observation) -> action interface. The framework handles async mechanics (clock advancement, event injection).

3.2. Write a custom async scenario

Scenarios are declarative. Schema:

scenario_id: support_escalation_001
description: >
  User reports billing issue. A manager requests status update 5 minutes in.
  Agent must continue investigation AND provide manager update without losing thread.

environment:
  apps: [email, slack, ticketing]
  initial_state:
    email:
      inbox: [{from: [email protected], subj: "Charged twice", body: "..."}]
    slack:
      channels: [#support-internal]

events:
  - at: "+00:05:00"
    type: slack_message
    channel: "#support-internal"
    from: manager
    body: "What's the status on that billing ticket?"

success_criteria:
  - agent.sent_email_containing("refund")
  - agent.replied_in_slack_within("00:01:00")
  - ticket.status == "resolved"

3.3. The key engineering patterns Gaia2 rewards

After running Gaia2 against many architectures, Meta's data highlights patterns that score well.

Re-observe after every action. Don't cache state across steps. The world changes.

# Bad
state = observe()
plan = make_plan(state)
for step in plan: execute(step)

# Good
state = observe()
plan = make_plan(state)
for step in plan:
    current_state = observe()          # re-check
    if plan_still_valid(plan, current_state):
        execute(step)
    else:
        plan = replan(current_state)   # adapt

Budget time, not tokens. Gaia2 scores you on wall-clock as much as correctness. Set explicit timeouts.

async def bounded_think(task, deadline_ms):
    try:
        return await asyncio.wait_for(
            deep_think(task), timeout=deadline_ms / 1000
        )
    except asyncio.TimeoutError:
        return await fast_fallback(task)   # cheap approximation

Handle mid-task events gracefully. Use a listener pattern that can interrupt the agent to inject new info.

Split work across collaborating agents when the scenario allows. A dedicated "monitor" agent that tails events while a "worker" agent drives the plan.

3.4. What to measure, not just pass/fail

Gaia2's richer metric surface is itself a lesson. Track these on your own benchmarks:

Deadline-miss rate: did the agent respond in time?
Stale-state-action rate: did the agent act on info that had changed?
Token-per-success: how efficient is each resolution?
Interrupt-handling accuracy: when mid-task events arrived, did the agent adapt?
Multi-agent coordination: if there was a peer, did the pair outperform either alone?

4. When to use Gaia2 vs. alternatives

If you're building	Benchmark to use
Code-edit agent	SWE-bench Verified
Research/deep-question agent	Original GAIA (legacy) or Gaia2 research slice
Production assistant / ops / scheduling agent	Gaia2
Web-navigation agent	WebArena, VisualWebArena, Gaia2 web slice
Multi-agent coordination	Gaia2 collaboration slice
Long-context retrieval agent	LOFT, Needle-in-Haystack suites

For most teams shipping agents into real products, Gaia2 should be your primary eval, with domain-specific slices layered on top.

5. Key takeaways

Async is not static. Production agents face clocks, interruptions, and peer agents. Your benchmark must too.
No single model wins everything. Pareto frontiers, not leaderboards.
Smarter models miss deadlines. Design routing or fallback explicitly.
Re-observe between actions. State caching is the hidden source of most async failures.
Use ARE to build your own scenarios. Authoring the ones that match your product is the highest-ROI eval work.