Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

@ammar-agent ammar-agent commented Nov 8, 2025

Implements hysteresis-based adaptive concurrency control for terminal-bench that automatically adjusts parallelism (1-16) based on system load average.

Problem

Terminal-bench runs with fixed concurrency, which means:

  • Too low = underutilizes hardware
  • Too high = overloads system
  • Optimal value varies by hardware

Users must manually tune concurrency for their system.

Solution

Adaptive concurrency using burst-and-resume pattern:

  1. Start with concurrency=1
  2. Run terminal-bench burst with current concurrency
  3. After burst completes, check 1-minute load average
  4. Hysteresis adjustment:
    • Double concurrency when load < threshold (default: 1.0)
    • Halve concurrency when load > threshold
    • Bounded to [1, 16] for optimal performance across hardware
  5. Update tb.lock with new concurrency
  6. Resume run (automatically skips completed tasks)
  7. Repeat until all tasks complete

Why Burst-and-Resume?

Research showed terminal-bench uses fixed-size ThreadPoolExecutor:

  • Cannot dynamically resize thread pool mid-run
  • Kill-and-restart would interrupt Docker containers mid-task
  • Resume only works after natural termination

Burst-and-resume leverages terminal-bench's native resume capability for clean checkpointing and task skipping.

Implementation

  • benchmarks/terminal_bench/adaptive_bench.py: Wrapper with burst-and-resume logic
  • benchmarks/terminal_bench/adaptive_bench_test.py: Unit tests
  • Makefile: benchmark-terminal now uses adaptive mode
  • Documentation in benchmarks/terminal_bench/README.md

Usage

# Simple - just works with 1-16 adaptive scaling
TB_SAMPLE_SIZE=8 make benchmark-terminal

# Tune load threshold (default: 1.0)
TB_LOAD_THRESHOLD=2.0 make benchmark-terminal

# Faster adjustments (default: 60s between bursts)
TB_CHECK_INTERVAL=30 make benchmark-terminal

# Full benchmark suite
make benchmark-terminal

Configuration

Variable Default Description
TB_LOAD_THRESHOLD 1.0 Load average threshold
TB_CHECK_INTERVAL 60 Seconds between bursts

Example Flow

🚀 Burst #1: Starting NEW run with concurrency=1
⏱️  Burst #1 completed in 180.3s
📊 Load: 0.5 (threshold: 1.0) → Concurrency: 1 → 2

🔄 Burst #2: Resuming run with concurrency=2
⏱️  Burst #2 completed in 120.7s
📊 Load: 0.7 (threshold: 1.0) → Concurrency: 2 → 4

🔄 Burst #3: Resuming run with concurrency=4
⏱️  Burst #3 completed in 90.5s
📈 Progress: 8/8 tasks (0 remaining)
✅ All tasks completed!

Tradeoffs

Advantages:

  • ✅ Automatically finds optimal concurrency for hardware
  • ✅ Prevents system overload
  • ✅ Clean Docker container lifecycle (no mid-task kills)
  • ✅ Uses terminal-bench native features (resume, tb.lock)
  • ✅ Bounded to [1, 16] for safety

Disadvantages:

  • ⚠️ Burst overhead ~2-5s (acceptable for 6+ minute avg task duration)
  • ⚠️ Modifies tb.lock (semi-internal format, but stable)
  • ⚠️ Adjustment latency = burst duration + check interval

Testing

Tested locally with:

  • 3 tasks, verified burst-and-resume pattern works
  • Confirmed completion detection works correctly
  • Verified tb.lock updates between bursts

Changes

  • Commit 1: Initial implementation with burst-and-resume
  • Commit 2: Added workflow support
  • Commit 3: Simplified to hardcoded 1-16 bounds (removed max_concurrent parameter)
  • Commit 4: Fixed bug parsing results.json ('results' field not 'trials')

Generated with cmux

@ammar-agent
Copy link
Collaborator Author

🚀 Testing Adaptive Concurrency

Triggered workflow_dispatch with:

  • adaptive_mode=true
  • max_concurrent=8
  • sample_size=8
  • livestream=true

The workflow will test the burst-and-resume pattern by:

  1. Starting with concurrency=1
  2. Monitoring system load after each burst
  3. Doubling/halving concurrency based on load
  4. Resuming and skipping completed tasks

Watch the workflow here: https://github.com/coder/cmux/actions/workflows/terminal-bench.yml?query=branch%3Atb-hysteresis

Expected behavior:

  • Multiple bursts with increasing concurrency (1 → 2 → 4 → 8)
  • Load-based adjustments logged
  • All 8 tasks complete successfully
  • tb.lock updated with new concurrency between bursts

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 119 to 215
# Subsequent bursts - resume existing run
cmd = [
"uvx",
"terminal-bench",
"runs",
"resume",
"--run-id",
self.run_id,
"--runs-dir",
str(self.runs_dir),
]
print(
f"🔄 Burst #{self.burst_count}: Resuming run {self.run_id} "
f"with concurrency={self.current_concurrent}"
)

print(f" Command: {' '.join(cmd)}")
burst_start = time.time()

# Run terminal-bench
result = subprocess.run(cmd, env=os.environ.copy())

burst_duration = time.time() - burst_start

# Capture run_id from first burst
if self.burst_count == 1 and result.returncode == 0:
# Find most recent run directory
if self.runs_dir.exists():
run_dirs = [
d
for d in self.runs_dir.iterdir()
if d.is_dir() and (d / "tb.lock").exists()
]
if run_dirs:
# Sort by modification time and take most recent
self.run_id = sorted(run_dirs, key=lambda p: p.stat().st_mtime)[
-1
].name
print(f"📝 Captured run_id: {self.run_id}")

print(f"⏱️ Burst #{self.burst_count} completed in {burst_duration:.1f}s")

# Update n_concurrent in tb.lock for next resume
if self.run_id and result.returncode == 0:
self._update_lock_concurrency()

return result.returncode

def _update_lock_concurrency(self):
"""Update n_concurrent_trials in tb.lock for next resume."""
lock_path = self.runs_dir / self.run_id / "tb.lock"
if not lock_path.exists():
return

try:
with open(lock_path, "r") as f:
lock_data = json.load(f)

# Update concurrency in lock file
if "run_config" in lock_data:
lock_data["run_config"][
"n_concurrent_trials"
] = self.current_concurrent

with open(lock_path, "w") as f:
json.dump(lock_data, f, indent=2)

print(f" Updated tb.lock with concurrency={self.current_concurrent}")
except Exception as e:
print(f"⚠️ Could not update tb.lock: {e}")

def run(self):
"""Main loop: run bursts with adaptive concurrency."""
try:
while True:
# Run burst with current concurrency
exit_code = self.run_burst()

if exit_code != 0:
print(f"❌ Terminal-bench exited with code {exit_code}")
return exit_code

# Check if we're done
status = self.get_run_status()
print(
f"📈 Progress: {status['completed']}/{status['total']} tasks "
f"({status['incomplete']} remaining)"
)

if status["incomplete"] == 0:
print("✅ All tasks completed!")
return 0

# Wait before next burst and potentially adjust concurrency
print(f"⏸️ Waiting {self.check_interval}s before next burst...")
time.sleep(self.check_interval)
self.adjust_concurrency()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Apply new concurrency before next burst

In AdaptiveBench.run_burst the resume command (lines 119‑129) never passes the newly selected concurrency to terminal-bench, and the only place where n_concurrent_trials is updated is after the burst finishes (lines 161‑163). Because run() adjusts self.current_concurrent only after the burst completes (lines 201‑215), the lock file still contains the previous value when the next runs resume command launches, so the “updated” concurrency does not take effect until the following burst. If the system is overloaded, an additional full burst still runs at the old high concurrency before the reduction is applied. Consider writing the adjusted concurrency into tb.lock (or passing it via CLI) before invoking the next resume so the change takes effect immediately.

Useful? React with 👍 / 👎.

@ammar-agent
Copy link
Collaborator Author

Simplified Adaptive Concurrency

Updated implementation to remove configuration complexity:

Changes:

  • ✅ Hardcoded concurrency bounds to [1, 16] (removed --max-concurrent parameter)
  • ✅ Made benchmark-terminal an alias for adaptive mode (adaptive is now the only mode)
  • ✅ Simplified workflow inputs (removed adaptive_mode toggle, concurrency input)
  • ✅ Updated documentation to reflect streamlined interface

Rationale:

  • 1-16 range covers all hardware configurations appropriately
  • Single-core systems work fine at min=1
  • 16 is reasonable max parallelism for Docker containers
  • Removes unnecessary configuration decisions for users

Original workflow (run #19199022294) is still running with the old interface (adaptive_mode=true). Will check status once complete.

New interface:

# Simple - just works with 1-16 adaptive scaling
TB_SAMPLE_SIZE=8 make benchmark-terminal

# Tune if needed
TB_LOAD_THRESHOLD=2.0 make benchmark-terminal

Implements adaptive concurrency control for terminal-bench using a
burst-and-resume pattern that automatically adjusts parallelism based on
system load average.

## Key Features

- **Hysteresis-based adjustment**: Double concurrency when load < threshold,
  halve when load > threshold
- **Burst-and-resume pattern**: Runs terminal-bench in bursts, using native
  resume capability to skip completed tasks between bursts
- **Clean container lifecycle**: No mid-task interruption, each burst
  completes naturally before adjusting
- **Configurable parameters**: Max concurrency, load threshold, check interval

## Implementation

- `benchmarks/terminal_bench/adaptive_bench.py`: Main wrapper implementing
  burst-and-resume logic with load monitoring
- `benchmarks/terminal_bench/adaptive_bench_test.py`: Unit tests for adaptive
  logic
- `Makefile`: New `benchmark-terminal-adaptive` target
- Documentation updates in `benchmarks/terminal_bench/README.md`

## Usage

```bash
# Start with concurrency=1, scale up to 16 based on load
TB_MAX_CONCURRENT=16 make benchmark-terminal-adaptive

# Conservative: max 8, higher load threshold
TB_MAX_CONCURRENT=8 TB_LOAD_THRESHOLD=2.0 make benchmark-terminal-adaptive

# Sample 5 tasks with adaptive concurrency
TB_SAMPLE_SIZE=5 TB_MAX_CONCURRENT=8 make benchmark-terminal-adaptive
```

## How It Works

1. Start with concurrency=1
2. Run terminal-bench burst with current concurrency
3. After burst completes, check 1-minute load average
4. Adjust concurrency: double if load < threshold, halve if load > threshold
5. Update tb.lock with new concurrency
6. Resume run (skips completed tasks automatically)
7. Repeat until all tasks complete

## Tradeoffs

- ✅ Automatically finds optimal concurrency for hardware
- ✅ Prevents system overload
- ✅ Uses terminal-bench native features (resume, tb.lock)
- ⚠️ Burst overhead ~2-5s (acceptable for 6+ minute avg task duration)
- ⚠️ Modifies tb.lock (semi-internal format, but stable)

## Design Rationale

Research showed terminal-bench uses fixed-size ThreadPoolExecutor that cannot
be resized mid-run. Kill-and-restart approach would interrupt Docker
containers mid-task. Burst-and-resume leverages terminal-bench's built-in
resume capability for clean checkpointing and task skipping.

_Generated with `cmux`_
Add workflow_dispatch inputs for adaptive concurrency mode:
- adaptive_mode: Enable adaptive concurrency (default: false)
- max_concurrent: Max concurrency for adaptive mode (default: 16)
- load_threshold: Load threshold for adjustments (default: 1.0)

When adaptive_mode=true, runs benchmark-terminal-adaptive instead of
benchmark-terminal.

_Generated with `cmux`_
Make adaptive concurrency the default and only mode for terminal-bench:

- Hardcode MIN_CONCURRENT=1, MAX_CONCURRENT=16 in adaptive_bench.py
- Remove --max-concurrent CLI argument (no longer needed)
- Make `benchmark-terminal` an alias for `benchmark-terminal-adaptive`
- Simplify workflow inputs (remove adaptive_mode toggle, concurrency input)
- Update documentation to reflect simplified interface

This removes unnecessary configuration complexity while providing sensible
bounds for all hardware configurations. The 1-16 range covers:
- Single-core systems (min=1)
- High-core systems (max=16 is reasonable parallelism for Docker containers)
- Load-based adjustment within these bounds

_Generated with `cmux`_
Terminal-bench's results.json uses 'results' field, not 'trials'. This caused
get_run_status() to always return completed=0, leading to infinite loops where
the script would keep resuming even after all tasks were done.

Tested locally with 3 tasks - script now correctly detects completion and exits.
…diately

Addresses Codex feedback: Previously, concurrency adjustments were written to
tb.lock AFTER a burst completed, but the next resume command would read the
tb.lock at the START of the burst. This created a 1-burst delay where the
old concurrency was used even after adjustment.

Now updates tb.lock BEFORE calling 'terminal-bench runs resume', ensuring
the new concurrency takes effect immediately. This is critical when the system
is overloaded - we need to reduce concurrency on the very next burst, not one
burst later.

Flow before fix:
  Burst N completes → adjust_concurrency() → write tb.lock
  Burst N+1 starts → resume reads OLD tb.lock value

Flow after fix:
  adjust_concurrency() completes
  Burst N+1 starts → write tb.lock → resume reads NEW tb.lock value
@ammar-agent
Copy link
Collaborator Author

Addressed Codex feedback (timing bug in commit 9e4144e)

The issue was that was being updated AFTER a burst completed, but the next command reads at the START of the burst. This created a 1-burst delay where adjusted concurrency wouldn't take effect.

Fixed by moving the update before resume:

  • Old: Burst completes → write tb.lock → (next burst) → resume reads old value
  • New: (next burst) → write tb.lock → resume reads new value immediately

This is critical when system is overloaded - we need to reduce concurrency on the very next burst, not one burst later.

Also rebased on main and resolved conflicts in workflow + Makefile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant