---
title: "Same brain, two pairs of eyes: comparing two AI-agent browser harnesses on the same model"
date: 2026-06-17T00:00:00.000-04:00
author: "Chris Betz"
url: https://www.cbetz.com/blog/same-brain-two-pairs-of-eyes
---

# Same brain, two pairs of eyes: comparing two AI-agent browser harnesses on the same model

_I fixed the model (Claude Opus 4.8) and varied only the agent harness, Claude computer-use against browser-use, on the same web tasks; neither wins everywhere, and the reason is the perception substrate (pixels vs. the DOM), not the model._

I built [taskproof](https://github.com/taskproof/taskproof), an open-source CI harness that tests whether AI agents can actually complete tasks on a website. Building it left me with a controlled experiment I had not seen run: take the **same model** (Claude Opus 4.8) and drive it through the **same web tasks** with two different _harnesses_. One is Claude’s native computer-use, which sees the page as **pixels** and clicks coordinates. The other is browser-use, which sees the **DOM** and acts on elements. Fix the model, vary only the eyes. Who is more efficient?

The answer: **neither wins everywhere, and the reason is the perception substrate, not the model.**

Five tasks, pass@3, both harnesses on Opus 4.8. Each cell shows per-run steps, per-run cost, and the pass@3 cell total (3 runs summed):

TaskClaude computer-use (pixels)browser-use (DOM)Find a book’s UPC4 steps · $0.13 · **$0.40**3 · $0.26 · $0.79Wait out a dynamic loader3 · $0.08 · **$0.25**3-4 · $0.30 · $0.91Wikipedia lookup5 · $0.18 · **$0.53**4 · $0.43 · $1.28Buy a t-shirt (saucedemo)17 · $1.34 · **$4.01**20-25 · $2.47 · $7.41Add to cart, dense product grid16 · $1.25 · $3.74¹4-5 · $0.41 · **$1.22**Two non-obvious things fall out.

**1. Fewer steps does not mean cheaper.** browser-use often takes equal or _fewer_ steps but costs more per task, because its DOM-context steps carry more tokens. So on 4 of 5 tasks, Claude computer-use is cheaper _despite_ sometimes taking more steps.

**2. The crossover is dense, visually-ambiguous element targeting.** On a packed product grid (scrapingcourse’s shop page), Claude’s vision balloons to **16 steps hunting the small add-to-cart button**; browser-use’s DOM targeting nails it in **4**, and is about 3x cheaper there. Pixels are great for forms and navigation; the DOM wins when the right element is small and surrounded by near-identical neighbors.

So if you are choosing a harness, or building one, the question is not “which is better.” It is “what does my task _look_ like.” That is exactly the kind of thing a measurement harness should tell you, and it is why I am building taskproof in the open.

## And it holds on a real site, not just sandboxes

I ran the same two-harness setup against a live production site I run ([evintersection.com](https://www.evintersection.com), a searchable EV catalog) on three real journeys: search a vehicle, find the best-trucks list, and compare two side by side, pass@3. All six cells pass, and the same pattern shows up. Claude is cheaper on the two simpler tasks (2 steps / $0.15 and 5 / $0.66, versus browser-use’s 3 / $0.80 and 4 / $1.10 per pass@3 cell), while the dense multi-select compare is the friction point for both: Claude grinds to its 14-step limit ($2.96), and browser-use is pricier per step ($2.71). You can open the rendered report yourself, on a real site you can visit rather than a fixture: [the EV-site report](https://raw.githack.com/taskproof/taskproof/main/examples/sample-report-ev/report.html). (Dogfood, not an independent audit; I own the site.)

## Caveats, because this is easy to overstate

- This is a **harness **comparison, not a model comparison; both run the _same _Opus 4.8. (And there is no third model here yet; Gemini and OpenAI adapters are on the roadmap.)
- The dollar figures use two bases: per-run cost, and the pass@3 cell total (3 runs summed). I have labeled both; do not read a 3-run total as a single run.
- On the dense-grid task Claude is 2/3, not 3/3, but the one miss was a **budget cap**: that run hit its per-run cost cap before finishing, which is itself the efficiency story.
- browser-use’s **step count is more variable **(t-shirt: 20/22/25) than Claude’s (17/17/17), worth knowing if you care about predictability.
- **N=5**, on benchmark- and sandbox-shaped sites. This is a method pilot, not a verdict on the web.
- All of this is reproducible: the specs and the grading method are open. One documented limitation to call out rather than bury: the grader’s network capture for browser-use is same-origin only. It does not affect these results (the asserted cart call is same-origin), but a `network `assertion against a third-party origin would not compare across the two harnesses.
If you want to run this on your own site, the harness, the task specs, and the grading method are all open: [github.com/taskproof/taskproof](https://github.com/taskproof/taskproof). New runner adapters are the most-welcome contribution.

¹ The scrapingcourse Claude cell is 2/3 (one run was budget-capped before finishing); the other two passed at 16 steps and about $1.24 each.
