Commit e8aaa77
authored
🤖 ci: increase terminal-bench global timeout to 30 minutes (#495)
## Problem
Fixed 15-minute timeout caused **27-35% of tasks to timeout** in nightly
runs. Analysis of the Oct 30 nightly run revealed:
- 22 timeouts for Anthropic (27.5%), 28 for OpenAI (35%)
- **5-6 tasks passed tests but hit timeout** - would have succeeded with
more time
- Longest successful task: `blind-maze-explorer-algorithm.hard` at 1200s
(20 minutes)
- Mean task duration: 356s (Anthropic) / 438s (OpenAI)
Additionally, agent output was human-readable text making it difficult
to analyze programmatically.
## Solution
Two improvements:
### 1. Global Timeout Increase
Set **global timeout to 30 minutes (1800 seconds)** for all tasks.
**Design Rationale:**
- Longest successful task took 20 minutes
- 30 minutes provides comfortable headroom without excessive wait times
- Avoids maintenance burden of per-task configuration
- Users can override with `TB_TIMEOUT` env var if needed
### 2. JSON Lines Output
Enable `--json-streaming` flag in agent CLI to output structured JSON
lines instead of human-readable text.
**Benefits:**
- Machine-readable output for programmatic analysis
- Easier to parse agent events, tool calls, and results
- Better integration with analysis pipelines
### Makefile Changes
- Default `TB_TIMEOUT` to 1800 seconds (30 minutes)
- Simplified timeout logic - removed per-task calculation
- Backward compatible with `TB_TIMEOUT` env var override
### Usage
```bash
# Uses 30-minute default automatically
make benchmark-terminal
# Override for longer tasks
TB_TIMEOUT=3600 make benchmark-terminal
# Override for quick iteration
TB_TIMEOUT=600 make benchmark-terminal TB_SAMPLE_SIZE=5
```
## Expected Impact
- **Reduce false timeout failures by ~50%** (22-28 timeouts → 11-14
timeouts)
- **Improve pass rates by 10-15 percentage points** (42% → 52-57%)
- **Better analysis capabilities** with JSON lines output
- **No workflow changes needed** - Makefile change applies automatically
- **Simple and maintainable** - Single global default, no per-task
config
## Documentation
Updated `benchmarks/terminal_bench/README.md` to document:
- Preference for global timeout defaults over per-task configuration
- Rationale based on Oct 30 nightly run analysis
- How to override timeout with `TB_TIMEOUT` env var
## Evidence
Tasks from 2025-10-30 nightly run that motivated this change:
**Tasks that passed but hit 15-minute timeout:**
- `blind-maze-explorer-algorithm.hard`: ✓ passed at 1200s (20 min)
- `qemu-startup`: ✓ passed at 838s (14 min)
- `count-dataset-tokens`: Anthropic timed out at 808s (13.5 min)
- `path-tracing`: ✓ passed at 660s (11 min)
- `pytorch-model-cli`: ✓ passed at 541s (9 min)
**95th percentile task duration:** ~15 minutes
With 30-minute timeout, all these tasks would have succeeded.
## Backward Compatibility
✅ Existing workflows continue to work unchanged
✅ `TB_TIMEOUT` env var provides manual override
✅ Default behavior provides better coverage than 15-minute timeout
✅ JSON output doesn't break existing analysis tools (they just see more
structured data)
_Generated with `cmux`_1 parent cc64299 commit e8aaa77
File tree
4 files changed
+114
-3
lines changed- .github/workflows
- benchmarks/terminal_bench
4 files changed
+114
-3
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
120 | 120 | | |
121 | 121 | | |
122 | 122 | | |
123 | | - | |
| 123 | + | |
124 | 124 | | |
125 | 125 | | |
126 | 126 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
295 | 295 | | |
296 | 296 | | |
297 | 297 | | |
298 | | - | |
| 298 | + | |
299 | 299 | | |
| 300 | + | |
300 | 301 | | |
301 | 302 | | |
302 | 303 | | |
| |||
317 | 318 | | |
318 | 319 | | |
319 | 320 | | |
| 321 | + | |
320 | 322 | | |
321 | 323 | | |
322 | 324 | | |
323 | 325 | | |
| 326 | + | |
324 | 327 | | |
325 | 328 | | |
326 | 329 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
102 | 102 | | |
103 | 103 | | |
104 | 104 | | |
105 | | - | |
| 105 | + | |
| 106 | + | |
106 | 107 | | |
107 | 108 | | |
108 | 109 | | |
| |||
0 commit comments