Same brain, two pairs of eyes: comparing two AI-agent browser harnesses on the same model

By Chris Betz
Cover image for Same brain, two pairs of eyes: comparing two AI-agent browser harnesses on the same model

I built taskproof, an open-source CI harness that tests whether AI agents can actually complete tasks on a website. Building it left me with a controlled experiment I had not seen run: take the same model (Claude Opus 4.8) and drive it through the same web tasks with two different harnesses. One is Claude’s native computer-use, which sees the page as pixels and clicks coordinates. The other is browser-use, which sees the DOM and acts on elements. Fix the model, vary only the eyes. Who is more efficient?

The answer: neither wins everywhere, and the reason is the perception substrate, not the model.

Five tasks, pass@3, both harnesses on Opus 4.8. Each cell shows per-run steps, per-run cost, and the pass@3 cell total (3 runs summed):

Task

Claude computer-use (pixels)

browser-use (DOM)

Find a book’s UPC

4 steps · $0.13 · $0.40

3 · $0.26 · $0.79

Wait out a dynamic loader

3 · $0.08 · $0.25

3-4 · $0.30 · $0.91

Wikipedia lookup

5 · $0.18 · $0.53

4 · $0.43 · $1.28

Buy a t-shirt (saucedemo)

17 · $1.34 · $4.01

20-25 · $2.47 · $7.41

Add to cart, dense product grid

16 · $1.25 · $3.74¹

4-5 · $0.41 · $1.22

Two non-obvious things fall out.

1. Fewer steps does not mean cheaper. browser-use often takes equal or fewer steps but costs more per task, because its DOM-context steps carry more tokens. So on 4 of 5 tasks, Claude computer-use is cheaper despite sometimes taking more steps.

2. The crossover is dense, visually-ambiguous element targeting. On a packed product grid (scrapingcourse’s shop page), Claude’s vision balloons to 16 steps hunting the small add-to-cart button; browser-use’s DOM targeting nails it in 4, and is about 3x cheaper there. Pixels are great for forms and navigation; the DOM wins when the right element is small and surrounded by near-identical neighbors.

So if you are choosing a harness, or building one, the question is not “which is better.” It is “what does my task look like.” That is exactly the kind of thing a measurement harness should tell you, and it is why I am building taskproof in the open.

And it holds on a real site, not just sandboxes

I ran the same two-harness setup against a live production site I run (evintersection.com, a searchable EV catalog) on three real journeys: search a vehicle, find the best-trucks list, and compare two side by side, pass@3. All six cells pass, and the same pattern shows up. Claude is cheaper on the two simpler tasks (2 steps / $0.15 and 5 / $0.66, versus browser-use’s 3 / $0.80 and 4 / $1.10 per pass@3 cell), while the dense multi-select compare is the friction point for both: Claude grinds to its 14-step limit ($2.96), and browser-use is pricier per step ($2.71). You can open the rendered report yourself, on a real site you can visit rather than a fixture: the EV-site report. (Dogfood, not an independent audit; I own the site.)

Caveats, because this is easy to overstate

  • This is a harness comparison, not a model comparison; both run the same Opus 4.8. (And there is no third model here yet; Gemini and OpenAI adapters are on the roadmap.)

  • The dollar figures use two bases: per-run cost, and the pass@3 cell total (3 runs summed). I have labeled both; do not read a 3-run total as a single run.

  • On the dense-grid task Claude is 2/3, not 3/3, but the one miss was a budget cap: that run hit its per-run cost cap before finishing, which is itself the efficiency story.

  • browser-use’s step count is more variable (t-shirt: 20/22/25) than Claude’s (17/17/17), worth knowing if you care about predictability.

  • N=5, on benchmark- and sandbox-shaped sites. This is a method pilot, not a verdict on the web.

  • All of this is reproducible: the specs and the grading method are open. One documented limitation to call out rather than bury: the grader’s network capture for browser-use is same-origin only. It does not affect these results (the asserted cart call is same-origin), but a network assertion against a third-party origin would not compare across the two harnesses.

If you want to run this on your own site, the harness, the task specs, and the grading method are all open: github.com/taskproof/taskproof. New runner adapters are the most-welcome contribution.

¹ The scrapingcourse Claude cell is 2/3 (one run was budget-capped before finishing); the other two passed at 16 steps and about $1.24 each.