ARE and Gaia2: Benchmarking Agents in Async Worlds

Source: ARE: Scaling Up Agent Environments and Evaluations (arXiv:2509.17158) (Meta Superintelligence Labs, 2025)
Series: The 10 Agent Whitepapers Every Builder Should Read
TL;DR
Meta retired GAIA and replaced it with Gaia2, a benchmark built inside their ARE (Meta Agents Research Environments) platform. Gaia2 runs 1,120 verifiable scenarios inside a simulated mobile phone (email, messaging, calendar, apps) and here's the critical innovation: it runs asynchronously. Events happen whether the agent acts or not. Other agents show up. Time passes. This surfaces failure modes that every static benchmark misses. Key empirical finding: no system dominates across the intelligence spectrum. Stronger reasoning often comes at the cost of efficiency, and budget scaling plateaus.
If you're building agents for the real world (support, ops, scheduling, personal assistants), Gaia2 is the closest public benchmark to your actual problem.
1. What it is
1.1. ARE, the platform
ARE (Meta Agents Research Environments) is Meta's framework for building agent environments at scale. Three things it does:
- Scalable environment creation: authoring tools for new scenarios.
- Application integration: plug synthetic or real apps into the environment (calendar, email, web).
- Agentic orchestration execution: a runner that drives agents through scenarios and scores them.
It's open-source and available on Meta's GitHub (facebookresearch/meta-agents-research-environments). If you want to build your own agent benchmark, ARE is the substrate.
1.2. Gaia2, the benchmark
Gaia2 is the flagship benchmark built in ARE. Key facts:
- 1,120 scenarios (800 base plus 320 augmentation from Gaia2-mini)
- Mobile environment: simulates a smartphone with email, messaging, calendar, contacts, apps
- Verifiable: each scenario has a programmatic success criterion
- Annotated: scenarios carry metadata about required capabilities
- Async: time passes; other actors act; the world changes independently of the agent
1.3. What "async" actually means
Most agent benchmarks (classical GAIA, WebArena, SWE-bench) are static. The environment is frozen, the agent takes as long as it wants. Gaia2 is different:
- A meeting invitation might expire while the agent is thinking
- A second agent (simulated user or simulated coworker) might send a new message mid-task
- Reminders fire; notifications arrive; calendars mutate
- Temporal constraints ("reply within 10 minutes") are real
This surfaces new failure modes:
- Agents that take too long to reason miss windows
- Agents that don't re-check state after each action act on stale info
- Agents that can't handle mid-task interruptions lock up
- Agents that don't collaborate with other agents fail multi-party tasks
1.4. What scenarios cover
Gaia2 tests:
- Search and execution: find information, take action on it
- Ambiguity and noise handling: under-specified instructions, conflicting info
- Dynamic environment adaptation: plans change mid-execution
- Agent-to-agent collaboration: teams of agents coordinating
- Temporal constraints: deadlines, reminders, windows
- Irreversible actions: delete, send, publish, with verification
1.5. What the empirical results say
Meta's released analysis surfaces three durable findings:
- No system dominates. The Pareto frontier of "accuracy vs. cost" is populated; different models win different slices.
- Stronger reasoning costs efficiency. Deep-thinking models that score higher on correctness-heavy scenarios use more tokens and time, which then makes them miss temporal-constraint scenarios.
- Budget scaling curves plateau. Spending 10x more compute buys diminishing returns, typically a single-digit percentage improvement past a threshold.
2. Why it matters
2.1. GAIA was retired for a reason
Classical GAIA (2023) was a static knowledge-and-tools benchmark. It peaked. Top agents saturated it. Continuing to quote "X% on GAIA" stopped being informative. Gaia2 replaces it with a harder target that hasn't been saturated and won't be soon, because async behavior is genuinely hard.
2.2. It tests the skill gap that matters most in production
Every agent team that ships something real hits the same wall: benchmark scores didn't predict production behavior. The usual reason: benchmarks are static, but production is async. Gaia2 is the first benchmark that closes that gap at scale.
2.3. It exposes the reasoning-vs-latency tradeoff
The "stronger reasoning hurts efficiency" finding has direct product implications:
- You can't pick "the smartest model" and call it done.
- Router systems (like GPT-5's) become essential: fast path for latency-critical scenarios, deep path for correctness-critical scenarios.
- Multi-agent architectures where a fast agent triages and a smart agent handles hard cases will look better on Gaia2 than a single-model baseline.
2.4. It's open. You can run it.
Because ARE is open source, your team can:
- Run Gaia2 in CI against your agent before shipping
- Author custom scenarios that mirror your actual product (e.g., "the CRM-integrated support agent scenario")
- Use the async primitives to build your own private benchmarks
3. How to do it
3.1. Run Gaia2 on your own agent
git clone https://github.com/facebookresearch/meta-agents-research-environments
cd meta-agents-research-environments
pip install -e .
huggingface-cli download meta-agents-research-environments/gaia2 --repo-type dataset
Minimal runner script:
from meta_are import Environment, Runner
from meta_are.benchmarks import gaia2
env = Environment.load("gaia2") # mobile + apps
agent = YourAgent() # whatever wrapper you want
results = Runner(env).run(agent, scenarios=gaia2.all_scenarios())
print(f"Pass rate: {results.pass_rate:.1%}")
print(f"Median latency per scenario: {results.median_latency_ms} ms")
print(f"Missed-deadline rate: {results.deadline_miss_rate:.1%}")
Plug in any agent that implements the simple step(observation) -> action interface. The framework handles async mechanics (clock advancement, event injection).
3.2. Write a custom async scenario
Scenarios are declarative. Schema:
scenario_id: support_escalation_001
description: >
User reports billing issue. A manager requests status update 5 minutes in.
Agent must continue investigation AND provide manager update without losing thread.
environment:
apps: [email, slack, ticketing]
initial_state:
email:
inbox: [{from: [email protected], subj: "Charged twice", body: "..."}]
slack:
channels: [#support-internal]
events:
- at: "+00:05:00"
type: slack_message
channel: "#support-internal"
from: manager
body: "What's the status on that billing ticket?"
success_criteria:
- agent.sent_email_containing("refund")
- agent.replied_in_slack_within("00:01:00")
- ticket.status == "resolved"
3.3. The key engineering patterns Gaia2 rewards
After running Gaia2 against many architectures, Meta's data highlights patterns that score well.
Re-observe after every action. Don't cache state across steps. The world changes.
# Bad
state = observe()
plan = make_plan(state)
for step in plan: execute(step)
# Good
state = observe()
plan = make_plan(state)
for step in plan:
current_state = observe() # re-check
if plan_still_valid(plan, current_state):
execute(step)
else:
plan = replan(current_state) # adapt
Budget time, not tokens. Gaia2 scores you on wall-clock as much as correctness. Set explicit timeouts.
async def bounded_think(task, deadline_ms):
try:
return await asyncio.wait_for(
deep_think(task), timeout=deadline_ms / 1000
)
except asyncio.TimeoutError:
return await fast_fallback(task) # cheap approximation
Handle mid-task events gracefully. Use a listener pattern that can interrupt the agent to inject new info.
Split work across collaborating agents when the scenario allows. A dedicated "monitor" agent that tails events while a "worker" agent drives the plan.
3.4. What to measure, not just pass/fail
Gaia2's richer metric surface is itself a lesson. Track these on your own benchmarks:
- Deadline-miss rate: did the agent respond in time?
- Stale-state-action rate: did the agent act on info that had changed?
- Token-per-success: how efficient is each resolution?
- Interrupt-handling accuracy: when mid-task events arrived, did the agent adapt?
- Multi-agent coordination: if there was a peer, did the pair outperform either alone?
4. When to use Gaia2 vs. alternatives
| If you're building | Benchmark to use |
|---|---|
| Code-edit agent | SWE-bench Verified |
| Research/deep-question agent | Original GAIA (legacy) or Gaia2 research slice |
| Production assistant / ops / scheduling agent | Gaia2 |
| Web-navigation agent | WebArena, VisualWebArena, Gaia2 web slice |
| Multi-agent coordination | Gaia2 collaboration slice |
| Long-context retrieval agent | LOFT, Needle-in-Haystack suites |
For most teams shipping agents into real products, Gaia2 should be your primary eval, with domain-specific slices layered on top.
5. Key takeaways
- Async is not static. Production agents face clocks, interruptions, and peer agents. Your benchmark must too.
- No single model wins everything. Pareto frontiers, not leaderboards.
- Smarter models miss deadlines. Design routing or fallback explicitly.
- Re-observe between actions. State caching is the hidden source of most async failures.
- Use ARE to build your own scenarios. Authoring the ones that match your product is the highest-ROI eval work.