taskproof

- Role
- Solo build
- Year
- 2026
- Type
- Developer tool
Highlights
- Pass@k grading with statistical thresholds instead of brittle binary gates — built for non-deterministic agents
- Pluggable adapter interface: every agent harness emits the same run artifact, so Claude computer-use and browser-use compare directly in one matrix
- Deterministic-first assertion engine (URL/DOM/network) with an optional LLM judge that can only fail a pass, never override it
- Self-contained interactive HTML reports with per-step traces, screenshots, cost breakdowns, and CI regression diffs
taskproof checks whether AI agents — not just humans — can actually use your website. You describe tasks in YAML as a natural-language goal plus deterministic success assertions, and taskproof drives multiple agent harnesses (Claude computer-use, browser-use) through them in parallel, grading with pass@k to tolerate non-determinism.
It renders an interactive HTML report with per-step screenshots, cost breakdowns, and baseline diffs, so CI catches agent-usability regressions the way it catches broken tests.