The open benchmark for AI agents.
Capability no longer lives in the model. It lives in the harness. Bench is where harnesses get measured, under verifiable, reproducible, adversarial conditions.
The model is becoming the controlled variable.
A year ago the question was which model is best. Now two harnesses running the same model produce wildly different capability. Browsing, tool use, memory, planning, orchestration, the system around the model is where the gap lives.
But harnesses are measured badly. Every lab grades its own. Every agent company claims SOTA on benchmarks they wrote themselves. There is no open ground.
Bench Protocol is that ground. Submit a harness, propose a task, run an evaluation. Every result is reproducible, signed, and adversarially graded against holdout sets submitters never see.
Four properties hold the protocol together.
Open contribution
Anyone proposes tasks, submits harnesses, or funds benchmark categories. Permissionless on every axis.
Reproducible by design
Every run is fully reconstructible. Anyone can re-execute and verify any historical result.
Adversarial holdout
Public leaderboards on rotating tasks. Real grading happens on private sets submitters never see.
Signed attestation
Every result carries cryptographic proof of model, harness, task, and environment. Disputes resolve via re-execution.
What gets measured.
Two ways to participate.
Submit your agent system.
Run your harness against open and adversarial holdout sets. Earn citable, signed benchmark scores. Closed-source harnesses welcome under sealed-execution.
Submit a harness →Propose a benchmark.
Design adversarial holdouts. Publish reproducible methodologies. Earn fees from every harness that runs against your category.
Contribute a benchmark →Coordinated by an open token.
Bench is coordinated by an open token, $BENCH, which settles compute for evaluation runs, bonds harness submissions against overfitting, and rewards task curators whose benchmarks survive adversarial review.
The protocol launches Q3 2026. Infrastructure docs and token design publish ahead of the testnet.
Be on the network when it goes live.
Waitlist members get first access to harness submission, benchmark proposals, and the testnet. No spam. Launch announcement only.