-
Notifications
You must be signed in to change notification settings - Fork 11
🤖 feat: adaptive concurrency for terminal-bench with hysteresis #538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
🚀 Testing Adaptive Concurrency Triggered workflow_dispatch with:
The workflow will test the burst-and-resume pattern by:
Watch the workflow here: https://github.com/coder/cmux/actions/workflows/terminal-bench.yml?query=branch%3Atb-hysteresis Expected behavior:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Subsequent bursts - resume existing run | ||
| cmd = [ | ||
| "uvx", | ||
| "terminal-bench", | ||
| "runs", | ||
| "resume", | ||
| "--run-id", | ||
| self.run_id, | ||
| "--runs-dir", | ||
| str(self.runs_dir), | ||
| ] | ||
| print( | ||
| f"🔄 Burst #{self.burst_count}: Resuming run {self.run_id} " | ||
| f"with concurrency={self.current_concurrent}" | ||
| ) | ||
|
|
||
| print(f" Command: {' '.join(cmd)}") | ||
| burst_start = time.time() | ||
|
|
||
| # Run terminal-bench | ||
| result = subprocess.run(cmd, env=os.environ.copy()) | ||
|
|
||
| burst_duration = time.time() - burst_start | ||
|
|
||
| # Capture run_id from first burst | ||
| if self.burst_count == 1 and result.returncode == 0: | ||
| # Find most recent run directory | ||
| if self.runs_dir.exists(): | ||
| run_dirs = [ | ||
| d | ||
| for d in self.runs_dir.iterdir() | ||
| if d.is_dir() and (d / "tb.lock").exists() | ||
| ] | ||
| if run_dirs: | ||
| # Sort by modification time and take most recent | ||
| self.run_id = sorted(run_dirs, key=lambda p: p.stat().st_mtime)[ | ||
| -1 | ||
| ].name | ||
| print(f"📝 Captured run_id: {self.run_id}") | ||
|
|
||
| print(f"⏱️ Burst #{self.burst_count} completed in {burst_duration:.1f}s") | ||
|
|
||
| # Update n_concurrent in tb.lock for next resume | ||
| if self.run_id and result.returncode == 0: | ||
| self._update_lock_concurrency() | ||
|
|
||
| return result.returncode | ||
|
|
||
| def _update_lock_concurrency(self): | ||
| """Update n_concurrent_trials in tb.lock for next resume.""" | ||
| lock_path = self.runs_dir / self.run_id / "tb.lock" | ||
| if not lock_path.exists(): | ||
| return | ||
|
|
||
| try: | ||
| with open(lock_path, "r") as f: | ||
| lock_data = json.load(f) | ||
|
|
||
| # Update concurrency in lock file | ||
| if "run_config" in lock_data: | ||
| lock_data["run_config"][ | ||
| "n_concurrent_trials" | ||
| ] = self.current_concurrent | ||
|
|
||
| with open(lock_path, "w") as f: | ||
| json.dump(lock_data, f, indent=2) | ||
|
|
||
| print(f" Updated tb.lock with concurrency={self.current_concurrent}") | ||
| except Exception as e: | ||
| print(f"⚠️ Could not update tb.lock: {e}") | ||
|
|
||
| def run(self): | ||
| """Main loop: run bursts with adaptive concurrency.""" | ||
| try: | ||
| while True: | ||
| # Run burst with current concurrency | ||
| exit_code = self.run_burst() | ||
|
|
||
| if exit_code != 0: | ||
| print(f"❌ Terminal-bench exited with code {exit_code}") | ||
| return exit_code | ||
|
|
||
| # Check if we're done | ||
| status = self.get_run_status() | ||
| print( | ||
| f"📈 Progress: {status['completed']}/{status['total']} tasks " | ||
| f"({status['incomplete']} remaining)" | ||
| ) | ||
|
|
||
| if status["incomplete"] == 0: | ||
| print("✅ All tasks completed!") | ||
| return 0 | ||
|
|
||
| # Wait before next burst and potentially adjust concurrency | ||
| print(f"⏸️ Waiting {self.check_interval}s before next burst...") | ||
| time.sleep(self.check_interval) | ||
| self.adjust_concurrency() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apply new concurrency before next burst
In AdaptiveBench.run_burst the resume command (lines 119‑129) never passes the newly selected concurrency to terminal-bench, and the only place where n_concurrent_trials is updated is after the burst finishes (lines 161‑163). Because run() adjusts self.current_concurrent only after the burst completes (lines 201‑215), the lock file still contains the previous value when the next runs resume command launches, so the “updated” concurrency does not take effect until the following burst. If the system is overloaded, an additional full burst still runs at the old high concurrency before the reduction is applied. Consider writing the adjusted concurrency into tb.lock (or passing it via CLI) before invoking the next resume so the change takes effect immediately.
Useful? React with 👍 / 👎.
|
✅ Simplified Adaptive Concurrency Updated implementation to remove configuration complexity: Changes:
Rationale:
Original workflow (run #19199022294) is still running with the old interface (adaptive_mode=true). Will check status once complete. New interface: # Simple - just works with 1-16 adaptive scaling
TB_SAMPLE_SIZE=8 make benchmark-terminal
# Tune if needed
TB_LOAD_THRESHOLD=2.0 make benchmark-terminal |
Implements adaptive concurrency control for terminal-bench using a burst-and-resume pattern that automatically adjusts parallelism based on system load average. ## Key Features - **Hysteresis-based adjustment**: Double concurrency when load < threshold, halve when load > threshold - **Burst-and-resume pattern**: Runs terminal-bench in bursts, using native resume capability to skip completed tasks between bursts - **Clean container lifecycle**: No mid-task interruption, each burst completes naturally before adjusting - **Configurable parameters**: Max concurrency, load threshold, check interval ## Implementation - `benchmarks/terminal_bench/adaptive_bench.py`: Main wrapper implementing burst-and-resume logic with load monitoring - `benchmarks/terminal_bench/adaptive_bench_test.py`: Unit tests for adaptive logic - `Makefile`: New `benchmark-terminal-adaptive` target - Documentation updates in `benchmarks/terminal_bench/README.md` ## Usage ```bash # Start with concurrency=1, scale up to 16 based on load TB_MAX_CONCURRENT=16 make benchmark-terminal-adaptive # Conservative: max 8, higher load threshold TB_MAX_CONCURRENT=8 TB_LOAD_THRESHOLD=2.0 make benchmark-terminal-adaptive # Sample 5 tasks with adaptive concurrency TB_SAMPLE_SIZE=5 TB_MAX_CONCURRENT=8 make benchmark-terminal-adaptive ``` ## How It Works 1. Start with concurrency=1 2. Run terminal-bench burst with current concurrency 3. After burst completes, check 1-minute load average 4. Adjust concurrency: double if load < threshold, halve if load > threshold 5. Update tb.lock with new concurrency 6. Resume run (skips completed tasks automatically) 7. Repeat until all tasks complete ## Tradeoffs - ✅ Automatically finds optimal concurrency for hardware - ✅ Prevents system overload - ✅ Uses terminal-bench native features (resume, tb.lock) -⚠️ Burst overhead ~2-5s (acceptable for 6+ minute avg task duration) -⚠️ Modifies tb.lock (semi-internal format, but stable) ## Design Rationale Research showed terminal-bench uses fixed-size ThreadPoolExecutor that cannot be resized mid-run. Kill-and-restart approach would interrupt Docker containers mid-task. Burst-and-resume leverages terminal-bench's built-in resume capability for clean checkpointing and task skipping. _Generated with `cmux`_
Add workflow_dispatch inputs for adaptive concurrency mode: - adaptive_mode: Enable adaptive concurrency (default: false) - max_concurrent: Max concurrency for adaptive mode (default: 16) - load_threshold: Load threshold for adjustments (default: 1.0) When adaptive_mode=true, runs benchmark-terminal-adaptive instead of benchmark-terminal. _Generated with `cmux`_
Make adaptive concurrency the default and only mode for terminal-bench: - Hardcode MIN_CONCURRENT=1, MAX_CONCURRENT=16 in adaptive_bench.py - Remove --max-concurrent CLI argument (no longer needed) - Make `benchmark-terminal` an alias for `benchmark-terminal-adaptive` - Simplify workflow inputs (remove adaptive_mode toggle, concurrency input) - Update documentation to reflect simplified interface This removes unnecessary configuration complexity while providing sensible bounds for all hardware configurations. The 1-16 range covers: - Single-core systems (min=1) - High-core systems (max=16 is reasonable parallelism for Docker containers) - Load-based adjustment within these bounds _Generated with `cmux`_
Terminal-bench's results.json uses 'results' field, not 'trials'. This caused get_run_status() to always return completed=0, leading to infinite loops where the script would keep resuming even after all tasks were done. Tested locally with 3 tasks - script now correctly detects completion and exits.
…diately Addresses Codex feedback: Previously, concurrency adjustments were written to tb.lock AFTER a burst completed, but the next resume command would read the tb.lock at the START of the burst. This created a 1-burst delay where the old concurrency was used even after adjustment. Now updates tb.lock BEFORE calling 'terminal-bench runs resume', ensuring the new concurrency takes effect immediately. This is critical when the system is overloaded - we need to reduce concurrency on the very next burst, not one burst later. Flow before fix: Burst N completes → adjust_concurrency() → write tb.lock Burst N+1 starts → resume reads OLD tb.lock value Flow after fix: adjust_concurrency() completes Burst N+1 starts → write tb.lock → resume reads NEW tb.lock value
da65a23 to
9e4144e
Compare
|
✅ Addressed Codex feedback (timing bug in commit 9e4144e) The issue was that was being updated AFTER a burst completed, but the next command reads at the START of the burst. This created a 1-burst delay where adjusted concurrency wouldn't take effect. Fixed by moving the update before resume:
This is critical when system is overloaded - we need to reduce concurrency on the very next burst, not one burst later. Also rebased on main and resolved conflicts in workflow + Makefile. |
Implements hysteresis-based adaptive concurrency control for terminal-bench that automatically adjusts parallelism (1-16) based on system load average.
Problem
Terminal-bench runs with fixed concurrency, which means:
Users must manually tune concurrency for their system.
Solution
Adaptive concurrency using burst-and-resume pattern:
tb.lockwith new concurrencyWhy Burst-and-Resume?
Research showed terminal-bench uses fixed-size
ThreadPoolExecutor:Burst-and-resume leverages terminal-bench's native resume capability for clean checkpointing and task skipping.
Implementation
benchmarks/terminal_bench/adaptive_bench.py: Wrapper with burst-and-resume logicbenchmarks/terminal_bench/adaptive_bench_test.py: Unit testsMakefile:benchmark-terminalnow uses adaptive modebenchmarks/terminal_bench/README.mdUsage
Configuration
TB_LOAD_THRESHOLDTB_CHECK_INTERVALExample Flow
Tradeoffs
Advantages:
Disadvantages:
tb.lock(semi-internal format, but stable)Testing
Tested locally with:
Changes
Generated with
cmux