Benchmark suites are JSON files that define reproducible test cases for the agent. Each suite contains a set of cases with a prompt, expected outcome, resource limits, and pass criteria. Suites live in bench/ by convention.

Full schema

bench/example-suite.json
{
  "name": "core-refactors",
  "description": "Tests for common refactoring tasks",
  "model": "claude-opus-4-6",
  "defaults": {
    "budget": 3.00,
    "timeout": 180
  },
  "cases": [
    {
      "id": "extract-auth",
      "prompt": "Extract auth logic into src/auth.ts",
      "expect": [
        "file:src/auth.ts exists",
        "file:src/auth.ts contains 'export function'",
        "test:npm test passes"
      ],
      "budget": 2.00,
      "timeout": 120
    },
    {
      "id": "add-tests",
      "prompt": "Add unit tests for src/budget.ts",
      "expect": [
        "file:test/budget.test.ts exists",
        "test:npm test passes"
      ]
    }
  ]
}

Top-level fields

FieldTypeDescription
namestringSuite identifier, used in results output
descriptionstringHuman-readable description
modelstringModel to use for all cases (can be overridden per-case)
defaultsobjectDefault budget and timeout applied to cases that don't specify their own
casesarrayArray of test case objects

Case fields

FieldTypeDescription
idstringUnique identifier for the case
promptstringThe task to give the agent
expectstring[]Array of expectation strings (see below)
budgetnumberMax spend in USD for this case
timeoutnumberMax duration in seconds
modelstringOverride the suite-level model

Expectation types

Expectations are strings with a type prefix:

All expectations in a case must pass for the case to pass. They're evaluated in order after the agent finishes.

Running suites

Terminal
# Run a suite
$ x bench bench/example-suite.json

# Run with concurrency
$ x bench bench/example-suite.json --concurrent 4

# Save results to file
$ x bench bench/example-suite.json --output results.json

# Verbose mode — show agent output per case
$ x bench bench/example-suite.json --verbose

Results format

Results are JSON with per-case metrics:

results.json
{
  "suite": "core-refactors",
  "passed": 1,
  "failed": 1,
  "total": 2,
  "totalCost": 3.42,
  "cases": [
    {
      "id": "extract-auth",
      "pass": true,
      "cost": 1.87,
      "duration": 45,
      "tokens": { "input": 12400, "output": 3200 }
    }
  ]
}