🤖 feat: adaptive concurrency for terminal-bench with hysteresis #538

ammar-agent · 2025-11-08T21:43:24Z

Implements hysteresis-based adaptive concurrency control for terminal-bench that automatically adjusts parallelism (1-16) based on system load average.

Problem

Terminal-bench runs with fixed concurrency, which means:

Too low = underutilizes hardware
Too high = overloads system
Optimal value varies by hardware

Users must manually tune concurrency for their system.

Solution

Adaptive concurrency using burst-and-resume pattern:

Start with concurrency=1
Run terminal-bench burst with current concurrency
After burst completes, check 1-minute load average
Hysteresis adjustment:
- Double concurrency when load < threshold (default: 1.0)
- Halve concurrency when load > threshold
- Bounded to [1, 16] for optimal performance across hardware
Update tb.lock with new concurrency
Resume run (automatically skips completed tasks)
Repeat until all tasks complete

Why Burst-and-Resume?

Research showed terminal-bench uses fixed-size ThreadPoolExecutor:

Cannot dynamically resize thread pool mid-run
Kill-and-restart would interrupt Docker containers mid-task
Resume only works after natural termination

Burst-and-resume leverages terminal-bench's native resume capability for clean checkpointing and task skipping.

Implementation

benchmarks/terminal_bench/adaptive_bench.py: Wrapper with burst-and-resume logic
benchmarks/terminal_bench/adaptive_bench_test.py: Unit tests
Makefile: benchmark-terminal now uses adaptive mode
Documentation in benchmarks/terminal_bench/README.md

Usage

# Simple - just works with 1-16 adaptive scaling
TB_SAMPLE_SIZE=8 make benchmark-terminal

# Tune load threshold (default: 1.0)
TB_LOAD_THRESHOLD=2.0 make benchmark-terminal

# Faster adjustments (default: 60s between bursts)
TB_CHECK_INTERVAL=30 make benchmark-terminal

# Full benchmark suite
make benchmark-terminal

Configuration

Variable	Default	Description
`TB_LOAD_THRESHOLD`	1.0	Load average threshold
`TB_CHECK_INTERVAL`	60	Seconds between bursts

Example Flow

🚀 Burst #1: Starting NEW run with concurrency=1
⏱️  Burst #1 completed in 180.3s
📊 Load: 0.5 (threshold: 1.0) → Concurrency: 1 → 2

🔄 Burst #2: Resuming run with concurrency=2
⏱️  Burst #2 completed in 120.7s
📊 Load: 0.7 (threshold: 1.0) → Concurrency: 2 → 4

🔄 Burst #3: Resuming run with concurrency=4
⏱️  Burst #3 completed in 90.5s
📈 Progress: 8/8 tasks (0 remaining)
✅ All tasks completed!

Tradeoffs

Advantages:

✅ Automatically finds optimal concurrency for hardware
✅ Prevents system overload
✅ Clean Docker container lifecycle (no mid-task kills)
✅ Uses terminal-bench native features (resume, tb.lock)
✅ Bounded to [1, 16] for safety

Disadvantages:

⚠️ Burst overhead ~2-5s (acceptable for 6+ minute avg task duration)
⚠️ Modifies tb.lock (semi-internal format, but stable)
⚠️ Adjustment latency = burst duration + check interval

Testing

Tested locally with:

3 tasks, verified burst-and-resume pattern works
Confirmed completion detection works correctly
Verified tb.lock updates between bursts

Changes

Commit 1: Initial implementation with burst-and-resume
Commit 2: Added workflow support
Commit 3: Simplified to hardcoded 1-16 bounds (removed max_concurrent parameter)
Commit 4: Fixed bug parsing results.json ('results' field not 'trials')

Generated with cmux

ammar-agent · 2025-11-08T21:44:09Z

🚀 Testing Adaptive Concurrency

Triggered workflow_dispatch with:

adaptive_mode=true
max_concurrent=8
sample_size=8
livestream=true

The workflow will test the burst-and-resume pattern by:

Starting with concurrency=1
Monitoring system load after each burst
Doubling/halving concurrency based on load
Resuming and skipping completed tasks

Watch the workflow here: https://github.com/coder/cmux/actions/workflows/terminal-bench.yml?query=branch%3Atb-hysteresis

Expected behavior:

Multiple bursts with increasing concurrency (1 → 2 → 4 → 8)
Load-based adjustments logged
All 8 tasks complete successfully
tb.lock updated with new concurrency between bursts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-08T21:45:59Z

benchmarks/terminal_bench/adaptive_bench.py

+            # Subsequent bursts - resume existing run
+            cmd = [
+                "uvx",
+                "terminal-bench",
+                "runs",
+                "resume",
+                "--run-id",
+                self.run_id,
+                "--runs-dir",
+                str(self.runs_dir),
+            ]
+            print(
+                f"🔄 Burst #{self.burst_count}: Resuming run {self.run_id} "
+                f"with concurrency={self.current_concurrent}"
+            )
+
+        print(f"   Command: {' '.join(cmd)}")
+        burst_start = time.time()
+
+        # Run terminal-bench
+        result = subprocess.run(cmd, env=os.environ.copy())
+
+        burst_duration = time.time() - burst_start
+
+        # Capture run_id from first burst
+        if self.burst_count == 1 and result.returncode == 0:
+            # Find most recent run directory
+            if self.runs_dir.exists():
+                run_dirs = [
+                    d
+                    for d in self.runs_dir.iterdir()
+                    if d.is_dir() and (d / "tb.lock").exists()
+                ]
+                if run_dirs:
+                    # Sort by modification time and take most recent
+                    self.run_id = sorted(run_dirs, key=lambda p: p.stat().st_mtime)[
+                        -1
+                    ].name
+                    print(f"📝 Captured run_id: {self.run_id}")
+
+        print(f"⏱️  Burst #{self.burst_count} completed in {burst_duration:.1f}s")
+
+        # Update n_concurrent in tb.lock for next resume
+        if self.run_id and result.returncode == 0:
+            self._update_lock_concurrency()
+
+        return result.returncode
+
+    def _update_lock_concurrency(self):
+        """Update n_concurrent_trials in tb.lock for next resume."""
+        lock_path = self.runs_dir / self.run_id / "tb.lock"
+        if not lock_path.exists():
+            return
+
+        try:
+            with open(lock_path, "r") as f:
+                lock_data = json.load(f)
+
+            # Update concurrency in lock file
+            if "run_config" in lock_data:
+                lock_data["run_config"][
+                    "n_concurrent_trials"
+                ] = self.current_concurrent
+
+            with open(lock_path, "w") as f:
+                json.dump(lock_data, f, indent=2)
+
+            print(f"   Updated tb.lock with concurrency={self.current_concurrent}")
+        except Exception as e:
+            print(f"⚠️  Could not update tb.lock: {e}")
+
+    def run(self):
+        """Main loop: run bursts with adaptive concurrency."""
+        try:
+            while True:
+                # Run burst with current concurrency
+                exit_code = self.run_burst()
+
+                if exit_code != 0:
+                    print(f"❌ Terminal-bench exited with code {exit_code}")
+                    return exit_code
+
+                # Check if we're done
+                status = self.get_run_status()
+                print(
+                    f"📈 Progress: {status['completed']}/{status['total']} tasks "
+                    f"({status['incomplete']} remaining)"
+                )
+
+                if status["incomplete"] == 0:
+                    print("✅ All tasks completed!")
+                    return 0
+
+                # Wait before next burst and potentially adjust concurrency
+                print(f"⏸️  Waiting {self.check_interval}s before next burst...")
+                time.sleep(self.check_interval)
+                self.adjust_concurrency()


Apply new concurrency before next burst

In AdaptiveBench.run_burst the resume command (lines 119‑129) never passes the newly selected concurrency to terminal-bench, and the only place where n_concurrent_trials is updated is after the burst finishes (lines 161‑163). Because run() adjusts self.current_concurrent only after the burst completes (lines 201‑215), the lock file still contains the previous value when the next runs resume command launches, so the “updated” concurrency does not take effect until the following burst. If the system is overloaded, an additional full burst still runs at the old high concurrency before the reduction is applied. Consider writing the adjusted concurrency into tb.lock (or passing it via CLI) before invoking the next resume so the change takes effect immediately.

Useful? React with 👍 / 👎.

ammar-agent · 2025-11-08T21:46:47Z

✅ Simplified Adaptive Concurrency

Updated implementation to remove configuration complexity:

Changes:

✅ Hardcoded concurrency bounds to [1, 16] (removed --max-concurrent parameter)
✅ Made benchmark-terminal an alias for adaptive mode (adaptive is now the only mode)
✅ Simplified workflow inputs (removed adaptive_mode toggle, concurrency input)
✅ Updated documentation to reflect streamlined interface

Rationale:

1-16 range covers all hardware configurations appropriately
Single-core systems work fine at min=1
16 is reasonable max parallelism for Docker containers
Removes unnecessary configuration decisions for users

Original workflow (run #19199022294) is still running with the old interface (adaptive_mode=true). Will check status once complete.

New interface:

# Simple - just works with 1-16 adaptive scaling
TB_SAMPLE_SIZE=8 make benchmark-terminal

# Tune if needed
TB_LOAD_THRESHOLD=2.0 make benchmark-terminal

Implements adaptive concurrency control for terminal-bench using a burst-and-resume pattern that automatically adjusts parallelism based on system load average. ## Key Features - **Hysteresis-based adjustment**: Double concurrency when load < threshold, halve when load > threshold - **Burst-and-resume pattern**: Runs terminal-bench in bursts, using native resume capability to skip completed tasks between bursts - **Clean container lifecycle**: No mid-task interruption, each burst completes naturally before adjusting - **Configurable parameters**: Max concurrency, load threshold, check interval ## Implementation - `benchmarks/terminal_bench/adaptive_bench.py`: Main wrapper implementing burst-and-resume logic with load monitoring - `benchmarks/terminal_bench/adaptive_bench_test.py`: Unit tests for adaptive logic - `Makefile`: New `benchmark-terminal-adaptive` target - Documentation updates in `benchmarks/terminal_bench/README.md` ## Usage ```bash # Start with concurrency=1, scale up to 16 based on load TB_MAX_CONCURRENT=16 make benchmark-terminal-adaptive # Conservative: max 8, higher load threshold TB_MAX_CONCURRENT=8 TB_LOAD_THRESHOLD=2.0 make benchmark-terminal-adaptive # Sample 5 tasks with adaptive concurrency TB_SAMPLE_SIZE=5 TB_MAX_CONCURRENT=8 make benchmark-terminal-adaptive ``` ## How It Works 1. Start with concurrency=1 2. Run terminal-bench burst with current concurrency 3. After burst completes, check 1-minute load average 4. Adjust concurrency: double if load < threshold, halve if load > threshold 5. Update tb.lock with new concurrency 6. Resume run (skips completed tasks automatically) 7. Repeat until all tasks complete ## Tradeoffs - ✅ Automatically finds optimal concurrency for hardware - ✅ Prevents system overload - ✅ Uses terminal-bench native features (resume, tb.lock) - ⚠️ Burst overhead ~2-5s (acceptable for 6+ minute avg task duration) - ⚠️ Modifies tb.lock (semi-internal format, but stable) ## Design Rationale Research showed terminal-bench uses fixed-size ThreadPoolExecutor that cannot be resized mid-run. Kill-and-restart approach would interrupt Docker containers mid-task. Burst-and-resume leverages terminal-bench's built-in resume capability for clean checkpointing and task skipping. _Generated with `cmux`_

Add workflow_dispatch inputs for adaptive concurrency mode: - adaptive_mode: Enable adaptive concurrency (default: false) - max_concurrent: Max concurrency for adaptive mode (default: 16) - load_threshold: Load threshold for adjustments (default: 1.0) When adaptive_mode=true, runs benchmark-terminal-adaptive instead of benchmark-terminal. _Generated with `cmux`_

Make adaptive concurrency the default and only mode for terminal-bench: - Hardcode MIN_CONCURRENT=1, MAX_CONCURRENT=16 in adaptive_bench.py - Remove --max-concurrent CLI argument (no longer needed) - Make `benchmark-terminal` an alias for `benchmark-terminal-adaptive` - Simplify workflow inputs (remove adaptive_mode toggle, concurrency input) - Update documentation to reflect simplified interface This removes unnecessary configuration complexity while providing sensible bounds for all hardware configurations. The 1-16 range covers: - Single-core systems (min=1) - High-core systems (max=16 is reasonable parallelism for Docker containers) - Load-based adjustment within these bounds _Generated with `cmux`_

Terminal-bench's results.json uses 'results' field, not 'trials'. This caused get_run_status() to always return completed=0, leading to infinite loops where the script would keep resuming even after all tasks were done. Tested locally with 3 tasks - script now correctly detects completion and exits.

…diately Addresses Codex feedback: Previously, concurrency adjustments were written to tb.lock AFTER a burst completed, but the next resume command would read the tb.lock at the START of the burst. This created a 1-burst delay where the old concurrency was used even after adjustment. Now updates tb.lock BEFORE calling 'terminal-bench runs resume', ensuring the new concurrency takes effect immediately. This is critical when the system is overloaded - we need to reduce concurrency on the very next burst, not one burst later. Flow before fix: Burst N completes → adjust_concurrency() → write tb.lock Burst N+1 starts → resume reads OLD tb.lock value Flow after fix: adjust_concurrency() completes Burst N+1 starts → write tb.lock → resume reads NEW tb.lock value

ammar-agent · 2025-11-08T22:39:18Z

✅ Addressed Codex feedback (timing bug in commit 9e4144e)

The issue was that was being updated AFTER a burst completed, but the next command reads at the START of the burst. This created a 1-burst delay where adjusted concurrency wouldn't take effect.

Fixed by moving the update before resume:

Old: Burst completes → write tb.lock → (next burst) → resume reads old value
New: (next burst) → write tb.lock → resume reads new value immediately

This is critical when system is overloaded - we need to reduce concurrency on the very next burst, not one burst later.

Also rebased on main and resolved conflicts in workflow + Makefile.

chatgpt-codex-connector bot reviewed Nov 8, 2025

View reviewed changes

ammar-agent added 5 commits November 8, 2025 22:34

ammar-agent force-pushed the tb-hysteresis branch from da65a23 to 9e4144e Compare November 8, 2025 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🤖 feat: adaptive concurrency for terminal-bench with hysteresis #538

🤖 feat: adaptive concurrency for terminal-bench with hysteresis #538

ammar-agent commented Nov 8, 2025 •

edited

Loading

Uh oh!

ammar-agent commented Nov 8, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 8, 2025

Uh oh!

ammar-agent commented Nov 8, 2025

Uh oh!

ammar-agent commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🤖 feat: adaptive concurrency for terminal-bench with hysteresis #538

Are you sure you want to change the base?

🤖 feat: adaptive concurrency for terminal-bench with hysteresis #538

Conversation

ammar-agent commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Why Burst-and-Resume?

Implementation

Usage

Configuration

Example Flow

Tradeoffs

Testing

Changes

Uh oh!

ammar-agent commented Nov 8, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

ammar-agent commented Nov 8, 2025

Uh oh!

ammar-agent commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ammar-agent commented Nov 8, 2025 •

edited

Loading