v0.1 · pre-beta · Q3 2026

The open benchmark for AI agents.

Capability no longer lives in the model. It lives in the harness. Bench is where harnesses get measured, under verifiable, reproducible, adversarial conditions.

For harness submissions and benchmark proposals, see below.

01 / Thesis

The model is becoming the controlled variable.

A year ago the question was which model is best. Now two harnesses running the same model produce wildly different capability. Browsing, tool use, memory, planning, orchestration, the system around the model is where the gap lives.

But harnesses are measured badly. Every lab grades its own. Every agent company claims SOTA on benchmarks they wrote themselves. There is no open ground.

Bench Protocol is that ground. Submit a harness, propose a task, run an evaluation. Every result is reproducible, signed, and adversarially graded against holdout sets submitters never see.

02 / Architecture

Four properties hold the protocol together.

Open contribution

Anyone proposes tasks, submits harnesses, or funds benchmark categories. Permissionless on every axis.

Reproducible by design

Every run is fully reconstructible. Anyone can re-execute and verify any historical result.

Adversarial holdout

Public leaderboards on rotating tasks. Real grading happens on private sets submitters never see.

Signed attestation

Every result carries cryptographic proof of model, harness, task, and environment. Disputes resolve via re-execution.

03 / Benchmark categories — v1

What gets measured.

001

Long-horizon research

multi-step investigation across web, tools, and synthesis

247 tasks

Live

002

Agentic browsing

browsecomp-class evaluation under real web conditions

1,267 tasks

Live

003

Multi-step code execution

tool-using software engineering with verifiable output

389 tasks

Q3 / 2026

004

Tool-use planning

api selection, ordering, and recovery under uncertainty

156 tasks

Q3 / 2026

005

Multi-agent coordination

cooperation, conflict, and emergent strategy in fixed games

—

Planned

04 / Built for

Two ways to participate.

Harness builders

Submit your agent system.

Run your harness against open and adversarial holdout sets. Earn citable, signed benchmark scores. Closed-source harnesses welcome under sealed-execution.

Submit a harness →

Researchers and labs

Propose a benchmark.

Design adversarial holdouts. Publish reproducible methodologies. Earn fees from every harness that runs against your category.

Contribute a benchmark →

05 / Protocol

Coordinated by an open token.

Bench is coordinated by an open token, $BENCH, which settles compute for evaluation runs, bonds harness submissions against overfitting, and rewards task curators whose benchmarks survive adversarial review.

The protocol launches Q3 2026. Infrastructure docs and token design publish ahead of the testnet.

Token

Bench Protocol — $BENCH

Address

QfPtPHnPtxBbSNpYJPwoa9exy2E5V7n1PVtxJ5wS7ZU

Compute

Settles evaluation runs across the verifier network.

Stake

Bonds harness submissions. Slashed on holdout cheating.

Curate

Rewards task curators whose benchmarks survive review.

06 / Waitlist

Be on the network when it goes live.

Waitlist members get first access to harness submission, benchmark proposals, and the testnet. No spam. Launch announcement only.