taskproof

By Chris BetzJune 17, 2026

TypeScript
Python
Playwright
Claude
Anthropic SDK

Role: Solo build
Year: 2026
Type: Developer tool

Highlights

Pass@k grading with statistical thresholds instead of brittle binary gates — built for non-deterministic agents
Pluggable adapter interface: every agent harness emits the same run artifact, so Claude computer-use and browser-use compare directly in one matrix
Deterministic-first assertion engine (URL/DOM/network) with an optional LLM judge that can only fail a pass, never override it
Self-contained interactive HTML reports with per-step traces, screenshots, cost breakdowns, and CI regression diffs

taskproof checks whether AI agents — not just humans — can actually use your website. You describe tasks in YAML as a natural-language goal plus deterministic success assertions, and taskproof drives multiple agent harnesses (Claude computer-use, browser-use) through them in parallel, grading with pass@k to tolerate non-determinism.

It renders an interactive HTML report with per-step screenshots, cost breakdowns, and baseline diffs, so CI catches agent-usability regressions the way it catches broken tests.

Highlights

More work