taskproof

By Chris Betz
  • TypeScript
  • Python
  • Playwright
  • Claude
  • Anthropic SDK
Source
Cover image for taskproof
Role
Solo build
Year
2026
Type
Developer tool

Highlights

  • Pass@k grading with statistical thresholds instead of brittle binary gates — built for non-deterministic agents
  • Pluggable adapter interface: every agent harness emits the same run artifact, so Claude computer-use and browser-use compare directly in one matrix
  • Deterministic-first assertion engine (URL/DOM/network) with an optional LLM judge that can only fail a pass, never override it
  • Self-contained interactive HTML reports with per-step traces, screenshots, cost breakdowns, and CI regression diffs

taskproof checks whether AI agents — not just humans — can actually use your website. You describe tasks in YAML as a natural-language goal plus deterministic success assertions, and taskproof drives multiple agent harnesses (Claude computer-use, browser-use) through them in parallel, grading with pass@k to tolerate non-determinism.

It renders an interactive HTML report with per-step screenshots, cost breakdowns, and baseline diffs, so CI catches agent-usability regressions the way it catches broken tests.