Benchmark suites are JSON files that define reproducible test cases for the agent. Each suite contains a set of cases with a prompt, expected outcome, resource limits, and pass criteria. Suites live in bench/ by convention.
Full schema
bench/example-suite.json
{
"name": "core-refactors",
"description": "Tests for common refactoring tasks",
"model": "claude-opus-4-6",
"defaults": {
"budget": 3.00,
"timeout": 180
},
"cases": [
{
"id": "extract-auth",
"prompt": "Extract auth logic into src/auth.ts",
"expect": [
"file:src/auth.ts exists",
"file:src/auth.ts contains 'export function'",
"test:npm test passes"
],
"budget": 2.00,
"timeout": 120
},
{
"id": "add-tests",
"prompt": "Add unit tests for src/budget.ts",
"expect": [
"file:test/budget.test.ts exists",
"test:npm test passes"
]
}
]
} Top-level fields
| Field | Type | Description |
|---|---|---|
| name | string | Suite identifier, used in results output |
| description | string | Human-readable description |
| model | string | Model to use for all cases (can be overridden per-case) |
| defaults | object | Default budget and timeout applied to cases that don't specify their own |
| cases | array | Array of test case objects |
Case fields
| Field | Type | Description |
|---|---|---|
| id | string | Unique identifier for the case |
| prompt | string | The task to give the agent |
| expect | string[] | Array of expectation strings (see below) |
| budget | number | Max spend in USD for this case |
| timeout | number | Max duration in seconds |
| model | string | Override the suite-level model |
Expectation types
Expectations are strings with a type prefix:
file:<path> exists— Assert the file was created on diskfile:<path> contains '<string>'— Assert the file contains a substringfile:<path> matches /<regex>/— Assert file content matches a regextest:<command>— Run a shell command; pass if exit code is 0diff:<path>— Assert the file was modified from its pre-run stateno-file:<path>— Assert the file does NOT exist (useful for deletion tasks)
All expectations in a case must pass for the case to pass. They're evaluated in order after the agent finishes.
Running suites
Terminal
# Run a suite $ x bench bench/example-suite.json # Run with concurrency $ x bench bench/example-suite.json --concurrent 4 # Save results to file $ x bench bench/example-suite.json --output results.json # Verbose mode — show agent output per case $ x bench bench/example-suite.json --verbose
Results format
Results are JSON with per-case metrics:
results.json
{
"suite": "core-refactors",
"passed": 1,
"failed": 1,
"total": 2,
"totalCost": 3.42,
"cases": [
{
"id": "extract-auth",
"pass": true,
"cost": 1.87,
"duration": 45,
"tokens": { "input": 12400, "output": 3200 }
}
]
}