Benchmark suite format

Benchmark suites are JSON files that define reproducible test cases for the agent. Each suite contains a set of cases with a prompt, expected outcome, resource limits, and pass criteria. Suites live in bench/ by convention.

Full schema

bench/example-suite.json

{
  "name": "core-refactors",
  "description": "Tests for common refactoring tasks",
  "model": "claude-opus-4-6",
  "defaults": {
    "budget": 3.00,
    "timeout": 180
  },
  "cases": [
    {
      "id": "extract-auth",
      "prompt": "Extract auth logic into src/auth.ts",
      "expect": [
        "file:src/auth.ts exists",
        "file:src/auth.ts contains 'export function'",
        "test:npm test passes"
      ],
      "budget": 2.00,
      "timeout": 120
    },
    {
      "id": "add-tests",
      "prompt": "Add unit tests for src/budget.ts",
      "expect": [
        "file:test/budget.test.ts exists",
        "test:npm test passes"
      ]
    }
  ]
}

Top-level fields

Field	Type	Description
name	string	Suite identifier, used in results output
description	string	Human-readable description
model	string	Model to use for all cases (can be overridden per-case)
defaults	object	Default budget and timeout applied to cases that don't specify their own
cases	array	Array of test case objects

Case fields

Field	Type	Description
id	string	Unique identifier for the case
prompt	string	The task to give the agent
expect	string[]	Array of expectation strings (see below)
budget	number	Max spend in USD for this case
timeout	number	Max duration in seconds
model	string	Override the suite-level model

Expectation types

Expectations are strings with a type prefix:

file:<path> exists — Assert the file was created on disk
file:<path> contains '<string>' — Assert the file contains a substring
file:<path> matches /<regex>/ — Assert file content matches a regex
test:<command> — Run a shell command; pass if exit code is 0
diff:<path> — Assert the file was modified from its pre-run state
no-file:<path> — Assert the file does NOT exist (useful for deletion tasks)

All expectations in a case must pass for the case to pass. They're evaluated in order after the agent finishes.

Running suites

Terminal

# Run a suite
$ x bench bench/example-suite.json

# Run with concurrency
$ x bench bench/example-suite.json --concurrent 4

# Save results to file
$ x bench bench/example-suite.json --output results.json

# Verbose mode — show agent output per case
$ x bench bench/example-suite.json --verbose

Results format

Results are JSON with per-case metrics:

results.json

{
  "suite": "core-refactors",
  "passed": 1,
  "failed": 1,
  "total": 2,
  "totalCost": 3.42,
  "cases": [
    {
      "id": "extract-auth",
      "pass": true,
      "cost": 1.87,
      "duration": 45,
      "tokens": { "input": 12400, "output": 3200 }
    }
  ]
}