Magentic-UI: Human-Centered Browser Agents That Ship

Source: Magentic-UI, Microsoft Research (Microsoft, 2025)
Series: The 10 Agent Whitepapers Every Builder Should Read
TL;DR
Magentic-UI is Microsoft's open-source (MIT) human-centered web agent, a browser-driving agent that treats the human not as a user to be automated around, but as a collaborator to be steered by. Built on AutoGen's Magentic-One multi-agent framework, it uses four specialized agents (Orchestrator, WebSurfer, Coder, FileSurfer) and adds four human-in-the-loop primitives: co-planning, co-tasking, action guards, and plan learning. Result: on GAIA, autonomous Magentic-UI hits 30.3%; with a simulated user offering side-information, 51.9%. A 71% relative improvement from a human-in-the-loop that's only pinged on about 10% of tasks.
The philosophical claim: transparent, efficiently-steerable agents outperform fully autonomous ones at the current frontier. If you're building web automation, this is the UX pattern that works today.
1. What it is
1.1. The multi-agent architecture
Magentic-UI inherits its structure from Magentic-One, AutoGen's production multi-agent pattern. Four specialists:
| Agent | Role |
|---|---|
| Orchestrator | Lead. Plans with user, decides when to ask for feedback, delegates to the others, replans on failure. |
| WebSurfer | LLM with a browser. Clicks, types, scrolls, navigates pages. |
| Coder | LLM with a Docker-sandboxed shell. Runs Python and bash. |
| FileSurfer | Reads/converts documents via MarkItDown. Returns plain text for the Orchestrator. |
The Orchestrator owns the task plan, chooses the next step, and picks the specialist for each step. If a step fails, it can replan rather than halt.
1.2. The four human-steering primitives
This is the novel contribution, efficient human collaboration surfaces that don't drown the user in confirmations.
1.2.1. Co-planning
Before execution, the Orchestrator proposes a step-by-step plan and shows it to the user. The user can:
- Edit steps directly in a plan editor
- Delete steps
- Regenerate specific steps
- Add steps via freeform text
Nothing runs until the user accepts or implicitly accepts by kicking off. This front-loads alignment cost.
1.2.2. Co-tasking
During execution, the agent shows what it's about to do, in real time:
"About to click the blue 'Submit' button on checkout page."
The user can take over the browser at any point, do something manually, then hand control back. This is the "driving school dual pedals" model. The agent drives, the human can intervene instantly.
1.2.3. Action guards
Before potentially irreversible actions, the agent requests approval with Approve/Reject buttons. What counts as irreversible is configurable:
- Closing a tab
- Submitting a form
- Clicking a button with side effects (pay, send, delete)
- Navigating to a domain not on the allowlist
Users set the policy once; the agent enforces it every run.
1.2.4. Plan learning
After a successful task, the user can request: "Save this as a reusable plan." The agent distills the trace into a step-by-step template stored in a gallery. Recall a saved plan later and the agent executes it 3x faster than generating a new plan, using AutoGen's Task-Centric Memory.
1.3. Transparency features
- Real-time progress display: every intermediate step is shown
- Pre-execution plan visibility: nothing is hidden
- Action transparency: the agent narrates what it's seeing on each page
- Pause/interrupt: the user can stop mid-run and inject new instructions
1.4. Safety mechanisms
Magentic-UI bakes in defenses you'd otherwise bolt on:
- Allow-listing: users pre-approve sites; going to a non-allowed site requires explicit approval
- Docker sandbox: browser runs in a container with no credentials; the Coder agent runs in a separate container
- Anytime interruption: stop button is always live
- Red-team-tested: evaluated against cross-site prompt injection and phishing scenarios; documented behavior is refuse, pause for user, or fail gracefully because of sandboxing
1.5. Benchmark results (GAIA validation, 162 tasks)
| Configuration | Completion rate | Help requests |
|---|---|---|
| Magentic-UI autonomous | 30.3% | 0 |
| Magentic-UI plus simulated user (side-info) | 51.9% | 10% of tasks, 1.1 requests on average |
| Magentic-UI plus simulated user (o4-mini) | 42.6% | 4.3% of tasks |
The takeaway is the 10% help rate to 71% completion improvement ratio. Small amounts of human input, placed well, produce big jumps.
2. Why it matters
2.1. Fully autonomous is the wrong target for 2025 web agents
For the current frontier, autonomous web agents hit a wall around 30% on GAIA. The remaining 70% is not a model-quality problem. It's a context and authority problem:
- The agent doesn't know which of several reasonable paths the user wants
- The agent can't authenticate (correctly refuses to enter passwords)
- The agent encounters ambiguity it can't resolve from the task text alone
A human in the loop resolves all of these cheaply. 10% help rate, 71% boost.
2.2. The steering UX is the product
Magentic-UI's novelty is not the agent, it's the human interface to the agent. Co-planning, action guards, plan learning are UX primitives that become the actual product. Teams that build great steering UX ship working agents; teams that don't ship demos.
2.3. Saved plans are a memory primitive
The "3x faster recall" finding points at a simple optimization: turn successful runs into reusable templates. Same pattern as cached execution plans in databases. For any agent with repeated task shapes (onboarding, data-entry, report-generation), this is low-effort-high-payoff.
2.4. Multi-agent plus sandboxing plus allow-list is a working safety stack
Magentic-UI is the clearest public example of a defense-in-depth agent safety stack:
- Least privilege per agent (the browser agent cannot run code; the coder cannot navigate)
- Docker isolation (no host-level access)
- Allow-list for network destinations
- Action guards for irreversible operations
- Human pause/interrupt at any time
Ship your browser agent without this stack at your peril.
3. How to do it
3.1. Run Magentic-UI
It's MIT-licensed and on GitHub.
git clone https://github.com/microsoft/Magentic-UI
cd Magentic-UI
pip install -e .
# set your model key
export OPENAI_API_KEY=... # or ANTHROPIC_API_KEY / AZURE_OPENAI_API_KEY
magentic-ui serve --port 8080
# open http://localhost:8080
You can also run it through Azure AI Foundry Labs with Foundry-hosted models.
3.2. The co-planning flow in practice
User types: "Book me a United flight DEN to JFK on April 25, morning, under $500, aisle seat."
The Orchestrator responds with a plan, before doing anything:
1. Navigate to united.com
2. Enter route DEN → JFK, date 2026-04-25
3. Filter to morning departures (before noon)
4. Select cheapest option under $500
5. Confirm aisle seat selection
6. Stop at payment page — PAYMENT REQUIRES YOUR INPUT
The user edits step 4 to "cheapest non-basic-economy under $500," accepts, run starts. Action guards trigger at step 5 (irreversible-ish: pricing can change). Payment is never attempted by the agent.
3.3. Wire up allow-listing for a customer-facing deployment
# magentic-ui.config.yaml
allow_list:
- united.com
- delta.com
- kayak.com
- *.google.com
deny_on_unknown: true
action_guard_rules:
- pattern: "click .button--pay"
require_approval: true
- pattern: "click .button--submit"
require_approval: true
- pattern: "input[type=password]"
require_approval: true # will also refuse by default
Every agent your team ships should have a config like this before it touches production.
3.4. Turn a successful run into a saved plan
- Run a task to completion (e.g., "Onboard new hire John Smith in Okta, AWS, and Github")
- Click "Save as reusable plan"
- The agent extracts the generalized template ("Onboard new hire
{name}in{services}") - Next time you ask it, the agent recalls the template and executes 3x faster
Internally, saved plans are stored via AutoGen's Task-Centric Memory abstraction. Retrieval by task similarity, not keyword match.
3.5. Build your own Magentic-UI-style UX on Claude
You don't have to use Magentic-UI literally. The pattern ports to the Claude Agent SDK:
from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions
def pre_tool_hook(ctx):
tool, args = ctx["tool_name"], ctx["args"]
if tool == "browser_click" and looks_irreversible(args):
ok = input(f"Approve click {args['selector']}? [y/N] ")
return None if ok.lower() == "y" else {"block": "Denied"}
return None
def plan_preview(plan_text):
print("Proposed plan:\n", plan_text)
return input("Accept? [y/edit/n] ")
options = ClaudeAgentOptions(
model="claude-opus-4-7",
permission_mode="default", # ask before non-edit tools
hooks={"PreToolUse": [{"matcher": "*", "hooks": [pre_tool_hook]}]},
)
Add a front-end that renders the plan, streams the action log, and surfaces the approval prompts. You've rebuilt the Magentic-UI steering pattern in about 200 lines of UI code.
3.6. Benchmarks to target
If you're building web agents, aim for:
- GAIA (legacy): at least 35% autonomous is competitive
- Gaia2: the successor; treat as your primary benchmark
- WebArena: static web-navigation with ground-truth actions
- Real customer tasks: the only benchmark that actually matters
Gate your rollouts on all three.
4. Design patterns to steal
4.1. Plan-first, execute-second
Never start execution on a multi-step task without surfacing the plan. It's cheap (one extra model call) and catches alignment errors before they become side effects.
4.2. Narrate actions
"About to click X" or "about to send Y." Feels verbose, but in practice users read the first few and zone out until something important. When they do look, they catch mistakes.
4.3. Allow-list, don't deny-list
Whitelist the domains you're willing to act on. You can't enumerate every malicious site; you can enumerate every legitimate one.
4.4. Sandbox per capability
One container for the browser. A separate container for code execution. A separate container for file I/O. Compromise of one doesn't propagate.
4.5. Pause is a first-class button
Every agent-running UI should have a giant stop button that's impossible to miss. Users will trust the agent more when they know they can always stop it.
4.6. Post-run reflection
"Would you like to save this as a reusable plan?" Offer it after every successful run. You're building a library of templates for free.
5. Key takeaways
- "Human in the loop" is the highest-leverage design decision in 2025 web agents. 10% help, 71% boost.
- UX primitives beat better models. Co-planning, action guards, plan learning.
- Multi-agent plus sandboxing plus allow-list is the working safety stack. Copy it.
- Saved plans are essentially free memoization. Turn every win into a template.
- Don't ship autonomous web agents. Ship steerable ones.