You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### 5. Merge Results from Multiple Runs (Optional)
110
+
```bash
111
+
# Combine results from different machines or benchmark runs
112
+
uv run agents_mcp_usage/evaluations/mermaid_evals/scripts/merge_benchmark_results.py \
113
+
-i results/run1.json results/run2.json \
114
+
-o results/merged.json \
115
+
--dedup keep-all \
116
+
--report merge_report.json
117
+
```
118
+
109
119
## Evaluation Task & Test Cases
110
120
111
121
The system challenges LLM agents to:
@@ -239,6 +249,125 @@ The local dashboard (`merbench_ui.py`) automatically detects and loads these CSV
239
249
- Performance metrics and scores
240
250
- Error messages and failure reasons
241
251
252
+
## Merging Benchmark Results
253
+
254
+
The `merge_benchmark_results.py` script enables combining multiple JSON benchmark result files generated by `preprocess_merbench_data.py`. This is particularly useful when:
255
+
- Running benchmarks on different machines
256
+
- Combining results from different time periods
257
+
- Aggregating data from distributed benchmark runs
258
+
259
+
### Features
260
+
-**Multiple merge strategies** for handling duplicate test runs
261
+
-**Complete recalculation** of all statistics from merged raw data
262
+
-**Detailed merge reports** showing what was combined
263
+
-**Preservation of all data sections** (leaderboard, failure analysis, cost breakdown, etc.)
0 commit comments