
Computer Use & Browser Agents
Vision-driven OS agents and hybrid browser automation. Anthropic computer use, OpenAI Operator, Stagehand, and the decision rules for when each beats plain deterministic scraping.
What you will learn
The frontier of agentic AI in 2026 is agents that act through interfaces designed for humans — not APIs. Computer-use agents see screenshots and emit clicks. Browser agents drive Chromium with mouse, keyboard, and DOM access. The capability has moved from "novelty demo" to "production-shippable for narrow tasks" in roughly 18 months. This chapter covers the trajectory, the API surface, and the hybrid patterns that beat pure-vision agents on real tasks.
What "Computer Use" Means
A computer-use agent receives screenshots of a virtual display and emits keyboard / mouse / scroll actions. The model must (a) recognize what's on screen, (b) decide what to click or type, (c) handle the next screenshot it gets back. There is no DOM, no API — just pixels in, intents out.
Anthropic Computer Use API
Anthropic exposes computer use as a built-in tool. You provide a virtual display (Anthropic ships a reference Docker image), the model returns computer tool calls with actions, your code executes them and sends back the next screenshot.
response = client.messages.create( model="claude-sonnet-4-5", max_tokens=2048, tools=[{ "type": "computer_20250124", "name": "computer", "display_width_px": 1280, "display_height_px": 800, }], messages=[{"role": "user", "content": "Open the airline website and find a flight to Tokyo for next Friday."}], betas=["computer-use-2025-01-24"], ) # For each tool_use block: execute action, take screenshot, send back as tool_result
OpenAI Operator / CUA Model
OpenAI's Operator (Jan 2025) is browser-only — no full OS. It uses the computer_use built-in in the Responses API. Strengths: tight integration with OpenAI's tracing and guardrails. Limitations: web-only; cannot drive desktop apps or arbitrary OS tasks.
The Hybrid Alternative — Stagehand & Browser Use
Pure-vision computer use is general but expensive. For most browser tasks, the right tool is a hybrid: deterministic Playwright/CDP automation where you can, LLM resolution where you must. Two production frameworks dominate:
Stagehand
Stagehand (from Browserbase) wraps Playwright with four LLM-aware primitives:
act("click the submit button")— natural-language action.extract({schema})— Zod-validated structured extraction from the page.observe()— list interactable elements with descriptions.agent("complete the checkout")— full agent loop with the above as tools.
import { Stagehand } from "@browserbasehq/stagehand"; import { z } from "zod"; const sh = new Stagehand({ env: "LOCAL" }); await sh.init(); await sh.page.goto("https://example.com/products"); // Deterministic where you can await sh.page.act("click the 'Show prices in USD' toggle"); // LLM-validated extraction const products = await sh.page.extract({ instruction: "all products with name and price", schema: z.object({ products: z.array(z.object({ name: z.string(), price: z.number() })), }), });
Browser Use
Browser Use takes a different tack: a Python framework that gives the LLM a high-level browser API (open URL, click element by index, type, screenshot), and ships with a managed "Browser Use Cloud" for stealth-browser execution at scale. Popular for scraping, RPA, and "give Claude a Chrome" use cases.
When Each Approach Wins
| Approach | Best for | Cost / latency | Reliability |
|---|---|---|---|
| Computer use (Anthropic / Operator) | Desktop apps, novel UIs, screen-only software, end-to-end multi-app flows | High (vision tokens, slow) | ~60% on OSWorld; flaky on novel layouts |
| Stagehand (hybrid) | Structured web extraction, e-commerce, form fills, anything with a stable DOM | Medium (LLM only when needed) | High — deterministic core + LLM resolution |
| Browser Use | "Give the agent a browser" general tasks; scraping at scale | Medium-high | Better than pure vision; weaker than pure deterministic |
| Plain Playwright + LLM extract | Known sites, known flows, you control the selectors | Lowest | Highest — same as any deterministic test |
Production Concerns Specific to Computer Use
Browser sandboxing
An agent driving a real Chrome on your machine is a security disaster waiting to happen. Production stacks use disposable browsers in containers (Browserbase, Hyperbrowser, anchor.so), each session ephemeral, with network egress controlled, no persistent storage of cookies / credentials beyond the session.
CAPTCHA & anti-bot
Most production sites detect headless / agent-driven browsers. Mitigation strategies:
- Stealth browsers — patched Chromium that defeats common fingerprinting (Browserbase, Browser Use Cloud, anchor).
- Residential proxies — rotating IPs that look like consumer traffic.
- CAPTCHA-handoff — when blocked, surface to a human; don't try to defeat anti-abuse measures.
- Cookie injection — for authenticated tasks, inject a logged-in session rather than driving the login flow.
Cost economics
Computer use is the most token-expensive agent class. A single screenshot to Sonnet 4.5 is roughly 1500-2500 vision tokens; a 30-step task therefore consumes 50k+ tokens just in inputs. Mitigations:
- Resize screenshots — 1280×800 is plenty for most pages; 1920×1080 doubles the cost.
- Cache navigation — for repeat tasks, save successful trajectories and replay deterministically.
- Hybrid where possible — drop to Playwright for the parts of the flow you've stabilized.
Trust and human approval
An agent that books flights or sends payments must have explicit per-action approval gates. Anthropic and OpenAI both ship "operator-style" approval modes; LangGraph supports human-in-loop checkpoints. The pattern: agent proposes the action, the UI shows the user what's about to happen, user confirms before execution.
Show answer
act() + extract() pattern handles "find the field labeled Date of Birth" without per-form selector engineering, while keeping the deterministic Playwright base for navigation. Pure computer use would be ~10× the cost; plain Playwright would require maintaining hundreds of selector sets; Browser Use sits in between but lacks Stagehand's structured-extraction strength on form data.- Anthropic — Introducing Computer Useanthropic.com
- OpenAI — Introducing Operatoropenai.com
- OSWorld — Computer Use Benchmarkos-world.github.io
- Stagehand — Hybrid Browser Automationbrowserbase.com
- Browser Use — Make websites accessible to AI agentsbrowser-use.com
Finished reading?