Magentic-UI: Human-Centered Browser Agents That Ship
White Paper

Magentic-UI: Human-Centered Browser Agents That Ship

Jake McCluskeyUpdated
Back to white papers

Source: Magentic-UI, Microsoft Research (Microsoft, 2025)
Series: The 10 Agent Whitepapers Every Builder Should Read

TL;DR

Magentic-UI is Microsoft's open-source (MIT) human-centered web agent, a browser-driving agent that treats the human not as a user to be automated around, but as a collaborator to be steered by. Built on AutoGen's Magentic-One multi-agent framework, it uses four specialized agents (Orchestrator, WebSurfer, Coder, FileSurfer) and adds four human-in-the-loop primitives: co-planning, co-tasking, action guards, and plan learning. Result: on GAIA, autonomous Magentic-UI hits 30.3%; with a simulated user offering side-information, 51.9%. A 71% relative improvement from a human-in-the-loop that's only pinged on about 10% of tasks.

The philosophical claim: transparent, efficiently-steerable agents outperform fully autonomous ones at the current frontier. If you're building web automation, this is the UX pattern that works today.

1. What it is

1.1. The multi-agent architecture

Magentic-UI inherits its structure from Magentic-One, AutoGen's production multi-agent pattern. Four specialists:

AgentRole
OrchestratorLead. Plans with user, decides when to ask for feedback, delegates to the others, replans on failure.
WebSurferLLM with a browser. Clicks, types, scrolls, navigates pages.
CoderLLM with a Docker-sandboxed shell. Runs Python and bash.
FileSurferReads/converts documents via MarkItDown. Returns plain text for the Orchestrator.

The Orchestrator owns the task plan, chooses the next step, and picks the specialist for each step. If a step fails, it can replan rather than halt.

1.2. The four human-steering primitives

This is the novel contribution, efficient human collaboration surfaces that don't drown the user in confirmations.

1.2.1. Co-planning

Before execution, the Orchestrator proposes a step-by-step plan and shows it to the user. The user can:

  • Edit steps directly in a plan editor
  • Delete steps
  • Regenerate specific steps
  • Add steps via freeform text

Nothing runs until the user accepts or implicitly accepts by kicking off. This front-loads alignment cost.

1.2.2. Co-tasking

During execution, the agent shows what it's about to do, in real time:

"About to click the blue 'Submit' button on checkout page."

The user can take over the browser at any point, do something manually, then hand control back. This is the "driving school dual pedals" model. The agent drives, the human can intervene instantly.

1.2.3. Action guards

Before potentially irreversible actions, the agent requests approval with Approve/Reject buttons. What counts as irreversible is configurable:

  • Closing a tab
  • Submitting a form
  • Clicking a button with side effects (pay, send, delete)
  • Navigating to a domain not on the allowlist

Users set the policy once; the agent enforces it every run.

1.2.4. Plan learning

After a successful task, the user can request: "Save this as a reusable plan." The agent distills the trace into a step-by-step template stored in a gallery. Recall a saved plan later and the agent executes it 3x faster than generating a new plan, using AutoGen's Task-Centric Memory.

1.3. Transparency features

  • Real-time progress display: every intermediate step is shown
  • Pre-execution plan visibility: nothing is hidden
  • Action transparency: the agent narrates what it's seeing on each page
  • Pause/interrupt: the user can stop mid-run and inject new instructions

1.4. Safety mechanisms

Magentic-UI bakes in defenses you'd otherwise bolt on:

  • Allow-listing: users pre-approve sites; going to a non-allowed site requires explicit approval
  • Docker sandbox: browser runs in a container with no credentials; the Coder agent runs in a separate container
  • Anytime interruption: stop button is always live
  • Red-team-tested: evaluated against cross-site prompt injection and phishing scenarios; documented behavior is refuse, pause for user, or fail gracefully because of sandboxing

1.5. Benchmark results (GAIA validation, 162 tasks)

ConfigurationCompletion rateHelp requests
Magentic-UI autonomous30.3%0
Magentic-UI plus simulated user (side-info)51.9%10% of tasks, 1.1 requests on average
Magentic-UI plus simulated user (o4-mini)42.6%4.3% of tasks

The takeaway is the 10% help rate to 71% completion improvement ratio. Small amounts of human input, placed well, produce big jumps.

2. Why it matters

2.1. Fully autonomous is the wrong target for 2025 web agents

For the current frontier, autonomous web agents hit a wall around 30% on GAIA. The remaining 70% is not a model-quality problem. It's a context and authority problem:

  • The agent doesn't know which of several reasonable paths the user wants
  • The agent can't authenticate (correctly refuses to enter passwords)
  • The agent encounters ambiguity it can't resolve from the task text alone

A human in the loop resolves all of these cheaply. 10% help rate, 71% boost.

2.2. The steering UX is the product

Magentic-UI's novelty is not the agent, it's the human interface to the agent. Co-planning, action guards, plan learning are UX primitives that become the actual product. Teams that build great steering UX ship working agents; teams that don't ship demos.

2.3. Saved plans are a memory primitive

The "3x faster recall" finding points at a simple optimization: turn successful runs into reusable templates. Same pattern as cached execution plans in databases. For any agent with repeated task shapes (onboarding, data-entry, report-generation), this is low-effort-high-payoff.

2.4. Multi-agent plus sandboxing plus allow-list is a working safety stack

Magentic-UI is the clearest public example of a defense-in-depth agent safety stack:

  • Least privilege per agent (the browser agent cannot run code; the coder cannot navigate)
  • Docker isolation (no host-level access)
  • Allow-list for network destinations
  • Action guards for irreversible operations
  • Human pause/interrupt at any time

Ship your browser agent without this stack at your peril.

3. How to do it

3.1. Run Magentic-UI

It's MIT-licensed and on GitHub.

git clone https://github.com/microsoft/Magentic-UI
cd Magentic-UI
pip install -e .
# set your model key
export OPENAI_API_KEY=...   # or ANTHROPIC_API_KEY / AZURE_OPENAI_API_KEY
magentic-ui serve --port 8080
# open http://localhost:8080

You can also run it through Azure AI Foundry Labs with Foundry-hosted models.

3.2. The co-planning flow in practice

User types: "Book me a United flight DEN to JFK on April 25, morning, under $500, aisle seat."

The Orchestrator responds with a plan, before doing anything:

1. Navigate to united.com
2. Enter route DEN → JFK, date 2026-04-25
3. Filter to morning departures (before noon)
4. Select cheapest option under $500
5. Confirm aisle seat selection
6. Stop at payment page, PAYMENT REQUIRES YOUR INPUT

The user edits step 4 to "cheapest non-basic-economy under $500," accepts, run starts. Action guards trigger at step 5 (irreversible-ish: pricing can change). Payment is never attempted by the agent.

3.3. Wire up allow-listing for a customer-facing deployment

# magentic-ui.config.yaml
allow_list:
  - united.com
  - delta.com
  - kayak.com
  - *.google.com
deny_on_unknown: true
action_guard_rules:
  - pattern: "click .button--pay"
    require_approval: true
  - pattern: "click .button--submit"
    require_approval: true
  - pattern: "input[type=password]"
    require_approval: true  # will also refuse by default

Every agent your team ships should have a config like this before it touches production.

3.4. Turn a successful run into a saved plan

  1. Run a task to completion (e.g., "Onboard new hire John Smith in Okta, AWS, and Github")
  2. Click "Save as reusable plan"
  3. The agent extracts the generalized template ("Onboard new hire {name} in {services}")
  4. Next time you ask it, the agent recalls the template and executes 3x faster

Internally, saved plans are stored via AutoGen's Task-Centric Memory abstraction. Retrieval by task similarity, not keyword match.

3.5. Build your own Magentic-UI-style UX on Claude

You don't have to use Magentic-UI literally. The pattern ports to the Claude Agent SDK:

from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions

def pre_tool_hook(ctx):
    tool, args = ctx["tool_name"], ctx["args"]
    if tool == "browser_click" and looks_irreversible(args):
        ok = input(f"Approve click {args['selector']}? [y/N] ")
        return None if ok.lower() == "y" else {"block": "Denied"}
    return None

def plan_preview(plan_text):
    print("Proposed plan:\n", plan_text)
    return input("Accept? [y/edit/n] ")

options = ClaudeAgentOptions(
    model="claude-opus-4-7",
    permission_mode="default",  # ask before non-edit tools
    hooks={"PreToolUse": [{"matcher": "*", "hooks": [pre_tool_hook]}]},
)

Add a front-end that renders the plan, streams the action log, and surfaces the approval prompts. You've rebuilt the Magentic-UI steering pattern in about 200 lines of UI code.

3.6. Benchmarks to target

If you're building web agents, aim for:

  • GAIA (legacy): at least 35% autonomous is competitive
  • Gaia2: the successor; treat as your primary benchmark
  • WebArena: static web-navigation with ground-truth actions
  • Real customer tasks: the only benchmark that actually matters

Gate your rollouts on all three.

4. Design patterns to steal

4.1. Plan-first, execute-second

Never start execution on a multi-step task without surfacing the plan. It's cheap (one extra model call) and catches alignment errors before they become side effects.

4.2. Narrate actions

"About to click X" or "about to send Y." Feels verbose, but in practice users read the first few and zone out until something important. When they do look, they catch mistakes.

4.3. Allow-list, don't deny-list

Whitelist the domains you're willing to act on. You can't enumerate every malicious site; you can enumerate every legitimate one.

4.4. Sandbox per capability

One container for the browser. A separate container for code execution. A separate container for file I/O. Compromise of one doesn't propagate.

4.5. Pause is a first-class button

Every agent-running UI should have a giant stop button that's impossible to miss. Users will trust the agent more when they know they can always stop it.

4.6. Post-run reflection

"Would you like to save this as a reusable plan?" Offer it after every successful run. You're building a library of templates for free.

5. Key takeaways

  • "Human in the loop" is the highest-leverage design decision in 2025 web agents. 10% help, 71% boost.
  • UX primitives beat better models. Co-planning, action guards, plan learning.
  • Multi-agent plus sandboxing plus allow-list is the working safety stack. Copy it.
  • Saved plans are essentially free memoization. Turn every win into a template.
  • Don't ship autonomous web agents. Ship steerable ones.
Common questions

Frequently asked

What completion rate improvement does Magentic-UI achieve when humans provide help compared to fully autonomous operation?

Magentic-UI achieves a 71% relative improvement in completion rate when humans provide input. The autonomous version completes 30.3% of GAIA tasks, while the version with simulated human side-information reaches 51.9%. This improvement comes from humans being pinged on only about 10% of tasks, with an average of 1.1 help requests per task where help is needed.

What are the four human-in-the-loop primitives in Magentic-UI?

The four primitives are co-planning, co-tasking, action guards, and plan learning. Co-planning lets users edit and approve the step-by-step plan before execution starts. Co-tasking allows users to take over the browser at any point during execution. Action guards require approval before potentially irreversible actions like submitting forms or closing tabs. Plan learning saves successful task traces as reusable templates that execute 3 times faster on recall.

How does Magentic-UI implement safety for browser automation?

Magentic-UI uses a defense-in-depth safety stack with five layers. It runs the browser in a Docker container with no credentials. It enforces domain allow-listing where any non-approved site requires explicit user approval. It provides action guards for irreversible operations. It maintains agent separation where the browser agent cannot run code and the coder agent cannot navigate. Users can interrupt execution at any time with an always-available stop button.

What is the architectural difference between the four specialized agents in Magentic-UI?

The Orchestrator acts as lead, creates plans with the user, decides when to request feedback, delegates to specialists, and replans on failures. WebSurfer is an LLM with browser access that clicks, types, scrolls, and navigates pages. Coder is an LLM with a Docker-sandboxed shell that runs Python and bash. FileSurfer reads and converts documents via MarkItDown and returns plain text to the Orchestrator. Each agent has least-privilege access limited to its specific capability.

How much faster does Magentic-UI execute when using a saved plan versus generating a new plan?

Magentic-UI executes saved plans 3 times faster than generating a new plan from scratch. After a successful task completion, users can save the execution trace as a reusable template stored in a gallery. The agent later recalls these templates using AutoGen's Task-Centric Memory, which retrieves by task similarity rather than keyword matching.

READY TO IMPLEMENT

Want to talk through this in your business?

The paper above is the thinking. Let's spend 30 minutes on what it would actually look like to ship in your shop, no pitch, just a real scoping conversation.

Magentic-UI: Human-Centered Browser Agents That Ship