The Engineering Codex/Agentic AI with LLM APIs
DAY 7 · AM
08 / 09

Computer Use & Browser Agents

schedule6 minsignal_cellular_altAdvanced1,406 words
Vision-driven OS agents and hybrid browser automation. Anthropic computer use, OpenAI Operator, Stagehand, and the decision rules for when each beats plain deterministic scraping.

What you will learn

01What "Computer Use" Means
02Anthropic Computer Use API
03OpenAI Operator / CUA Model
04The Hybrid Alternative — Stagehand & Browser Use
05When Each Approach Wins
06Production Concerns Specific to Computer Use

The frontier of agentic AI in 2026 is agents that act through interfaces designed for humans — not APIs. Computer-use agents see screenshots and emit clicks. Browser agents drive Chromium with mouse, keyboard, and DOM access. The capability has moved from "novelty demo" to "production-shippable for narrow tasks" in roughly 18 months. This chapter covers the trajectory, the API surface, and the hybrid patterns that beat pure-vision agents on real tasks.

🔑
The 18-month capability curve
On the OSWorld benchmark, Claude Sonnet 3.5 (Oct 2024) hit 14.9%. Sonnet 4 (May 2025) hit 42.2%. Sonnet 4.5 (Sep 2025) hit 61.4%. Human baseline is 72.36%. That's a 4× capability jump in 11 months — the steepest curve in the field, and it means computer-use agents now work for a meaningful set of real tasks.

What "Computer Use" Means

A computer-use agent receives screenshots of a virtual display and emits keyboard / mouse / scroll actions. The model must (a) recognize what's on screen, (b) decide what to click or type, (c) handle the next screenshot it gets back. There is no DOM, no API — just pixels in, intents out.

COMPUTER USE — VISION-ONLY LOOP Virtual display browser / OS / app Submit LLM vision · reasoning click(x=93,y=177) type("hello") key("Tab") scroll(0, 200) screenshot (PNG, 1280×720) action (click, type, key, scroll) Loop: take screenshot → model emits action → execute → repeat. ~1-3 actions/second in practice.
Computer use is the most general agent interface: anything a human can do with a screen, the model can in principle do. It's also the slowest and most expensive — every step costs a vision-quality image input.

Anthropic Computer Use API

Anthropic exposes computer use as a built-in tool. You provide a virtual display (Anthropic ships a reference Docker image), the model returns computer tool calls with actions, your code executes them and sends back the next screenshot.

Python · Anthropic computer use
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2048,
    tools=[{
        "type": "computer_20250124",
        "name": "computer",
        "display_width_px": 1280,
        "display_height_px": 800,
    }],
    messages=[{"role": "user",
               "content": "Open the airline website and find a flight to Tokyo for next Friday."}],
    betas=["computer-use-2025-01-24"],
)
# For each tool_use block: execute action, take screenshot, send back as tool_result

OpenAI Operator / CUA Model

OpenAI's Operator (Jan 2025) is browser-only — no full OS. It uses the computer_use built-in in the Responses API. Strengths: tight integration with OpenAI's tracing and guardrails. Limitations: web-only; cannot drive desktop apps or arbitrary OS tasks.

The Hybrid Alternative — Stagehand & Browser Use

Pure-vision computer use is general but expensive. For most browser tasks, the right tool is a hybrid: deterministic Playwright/CDP automation where you can, LLM resolution where you must. Two production frameworks dominate:

Stagehand

Stagehand (from Browserbase) wraps Playwright with four LLM-aware primitives:

  • act("click the submit button") — natural-language action.
  • extract({schema}) — Zod-validated structured extraction from the page.
  • observe() — list interactable elements with descriptions.
  • agent("complete the checkout") — full agent loop with the above as tools.
TypeScript · Stagehand hybrid scrape
import { Stagehand } from "@browserbasehq/stagehand";
import { z } from "zod";

const sh = new Stagehand({ env: "LOCAL" });
await sh.init();
await sh.page.goto("https://example.com/products");

// Deterministic where you can
await sh.page.act("click the 'Show prices in USD' toggle");

// LLM-validated extraction
const products = await sh.page.extract({
  instruction: "all products with name and price",
  schema: z.object({
    products: z.array(z.object({ name: z.string(), price: z.number() })),
  }),
});

Browser Use

Browser Use takes a different tack: a Python framework that gives the LLM a high-level browser API (open URL, click element by index, type, screenshot), and ships with a managed "Browser Use Cloud" for stealth-browser execution at scale. Popular for scraping, RPA, and "give Claude a Chrome" use cases.

When Each Approach Wins

ApproachBest forCost / latencyReliability
Computer use (Anthropic / Operator)Desktop apps, novel UIs, screen-only software, end-to-end multi-app flowsHigh (vision tokens, slow)~60% on OSWorld; flaky on novel layouts
Stagehand (hybrid)Structured web extraction, e-commerce, form fills, anything with a stable DOMMedium (LLM only when needed)High — deterministic core + LLM resolution
Browser Use"Give the agent a browser" general tasks; scraping at scaleMedium-highBetter than pure vision; weaker than pure deterministic
Plain Playwright + LLM extractKnown sites, known flows, you control the selectorsLowestHighest — same as any deterministic test
The decision rule
Use the most deterministic approach the task allows. Plain Playwright when you control the site. Stagehand when the DOM is stable but you want the LLM to resolve "the submit button" without selector engineering. Browser Use when DOM stability is unreliable but it's still a browser. Pure computer use only when the target is a desktop app or a deeply non-DOM-friendly UI. Each step up in agent generality is roughly 5× cost and 2× failure rate.

Production Concerns Specific to Computer Use

Browser sandboxing

An agent driving a real Chrome on your machine is a security disaster waiting to happen. Production stacks use disposable browsers in containers (Browserbase, Hyperbrowser, anchor.so), each session ephemeral, with network egress controlled, no persistent storage of cookies / credentials beyond the session.

CAPTCHA & anti-bot

Most production sites detect headless / agent-driven browsers. Mitigation strategies:

  • Stealth browsers — patched Chromium that defeats common fingerprinting (Browserbase, Browser Use Cloud, anchor).
  • Residential proxies — rotating IPs that look like consumer traffic.
  • CAPTCHA-handoff — when blocked, surface to a human; don't try to defeat anti-abuse measures.
  • Cookie injection — for authenticated tasks, inject a logged-in session rather than driving the login flow.

Cost economics

Computer use is the most token-expensive agent class. A single screenshot to Sonnet 4.5 is roughly 1500-2500 vision tokens; a 30-step task therefore consumes 50k+ tokens just in inputs. Mitigations:

  • Resize screenshots — 1280×800 is plenty for most pages; 1920×1080 doubles the cost.
  • Cache navigation — for repeat tasks, save successful trajectories and replay deterministically.
  • Hybrid where possible — drop to Playwright for the parts of the flow you've stabilized.

Trust and human approval

An agent that books flights or sends payments must have explicit per-action approval gates. Anthropic and OpenAI both ship "operator-style" approval modes; LangGraph supports human-in-loop checkpoints. The pattern: agent proposes the action, the UI shows the user what's about to happen, user confirms before execution.

⚠️
The "agent did the wrong thing on the user's behalf" risk
Computer-use agents see your screen and act on it. If the agent misreads a confirmation dialog, it can purchase the wrong item, send the wrong message, or approve a destructive operation. The fix isn't model improvement — it's a UX layer that requires explicit user approval for irreversible actions. Treat the agent like a junior analyst with a credit card: it can prepare any transaction, but a human signs.
Quick check
You're building an agent that fills out government forms on behalf of users (visa applications, tax extensions). The forms are on stable but old-fashioned government websites. Should you reach for pure computer use, Stagehand, Browser Use, or plain Playwright?
Show answer
Stagehand is the best fit. The DOM is stable enough that you don't need full vision; the forms are too varied to maintain Playwright selectors for every field. Stagehand's act() + extract() pattern handles "find the field labeled Date of Birth" without per-form selector engineering, while keeping the deterministic Playwright base for navigation. Pure computer use would be ~10× the cost; plain Playwright would require maintaining hundreds of selector sets; Browser Use sits in between but lacks Stagehand's structured-extraction strength on form data.
🔑
Key takeaways
1) Computer-use capability quadrupled in 11 months — the trajectory says these agents will be table-stakes for many tasks within the next year. 2) Pure vision is general but slow and expensive; hybrid (Stagehand-style) wins for stable-DOM tasks. 3) Always sandbox the browser/OS — agent + admin credentials = catastrophe. 4) Anti-bot is real; use stealth infrastructure or hand off to humans rather than fighting it. 5) Build approval gates for irreversible actions. The agent prepares; the human signs.

Finished reading?