Skip to content

Commit 697c3fc

Browse files
authored
Merge pull request #16 from andrewginns/october-2025-updates
feat: October 2025 updates
2 parents f54372c + a4ecee7 commit 697c3fc

File tree

5 files changed

+1945
-1056
lines changed

5 files changed

+1945
-1056
lines changed

agents_mcp_usage/evaluations/mermaid_evals/README.md

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,16 @@ uv run agents_mcp_usage/evaluations/mermaid_evals/scripts/preprocess_merbench_da
106106
-o "agents_mcp_usage/evaluations/mermaid_evals/results/<timestamp>_processed.json"
107107
```
108108

109+
### 5. Merge Results from Multiple Runs (Optional)
110+
```bash
111+
# Combine results from different machines or benchmark runs
112+
uv run agents_mcp_usage/evaluations/mermaid_evals/scripts/merge_benchmark_results.py \
113+
-i results/run1.json results/run2.json \
114+
-o results/merged.json \
115+
--dedup keep-all \
116+
--report merge_report.json
117+
```
118+
109119
## Evaluation Task & Test Cases
110120

111121
The system challenges LLM agents to:
@@ -239,6 +249,125 @@ The local dashboard (`merbench_ui.py`) automatically detects and loads these CSV
239249
- Performance metrics and scores
240250
- Error messages and failure reasons
241251

252+
## Merging Benchmark Results
253+
254+
The `merge_benchmark_results.py` script enables combining multiple JSON benchmark result files generated by `preprocess_merbench_data.py`. This is particularly useful when:
255+
- Running benchmarks on different machines
256+
- Combining results from different time periods
257+
- Aggregating data from distributed benchmark runs
258+
259+
### Features
260+
- **Multiple merge strategies** for handling duplicate test runs
261+
- **Complete recalculation** of all statistics from merged raw data
262+
- **Detailed merge reports** showing what was combined
263+
- **Preservation of all data sections** (leaderboard, failure analysis, cost breakdown, etc.)
264+
265+
### Usage Examples
266+
267+
#### Basic Merge
268+
```bash
269+
python scripts/merge_benchmark_results.py -i file1.json file2.json -o merged.json
270+
```
271+
272+
#### With Deduplication Strategy
273+
```bash
274+
python scripts/merge_benchmark_results.py \
275+
-i file1.json file2.json \
276+
-o merged.json \
277+
--dedup keep-first
278+
```
279+
280+
#### Generate Detailed Report
281+
```bash
282+
python scripts/merge_benchmark_results.py \
283+
-i file1.json file2.json \
284+
-o merged.json \
285+
--report merge_report.json \
286+
--verbose
287+
```
288+
289+
### Deduplication Strategies
290+
291+
The script offers four strategies for handling duplicate (Model, Case) combinations:
292+
293+
1. **`keep-all`** (default)
294+
- Keeps all records, no deduplication
295+
- Use when runs were performed under different conditions
296+
- Preserves complete data for analysis
297+
298+
2. **`keep-first`**
299+
- Keeps the first occurrence from the first file
300+
- Use when preferring older/original results
301+
- Maintains consistency with initial benchmarks
302+
303+
3. **`keep-last`**
304+
- Keeps the last occurrence from the last file
305+
- Use when preferring newer/updated results
306+
- Good for iterative improvements
307+
308+
4. **`average`**
309+
- Averages metrics for duplicate combinations
310+
- Use when multiple runs should be combined statistically
311+
- Provides a balanced view across multiple executions
312+
313+
### Output Structure
314+
315+
The merged JSON file maintains the same structure as individual result files:
316+
- `stats`: Aggregate statistics
317+
- `leaderboard`: Model performance rankings
318+
- `pareto_data`: Efficiency analysis data
319+
- `test_groups_data`: Performance by test difficulty
320+
- `failure_analysis_data`: Failure type counts
321+
- `cost_breakdown_data`: Cost analysis by model and test group
322+
- `raw_data`: Individual run records
323+
- `config`: Dashboard configuration
324+
325+
### Merge Report
326+
327+
When using `--report`, the script generates a detailed report containing:
328+
- Timestamp of merge operation
329+
- Deduplication strategy used
330+
- Details of each input file (runs, models)
331+
- Summary statistics (total runs, duplicates handled)
332+
- Complete list of merged models
333+
334+
### Complete Workflow Example
335+
336+
#### 1. Run benchmarks on Machine A:
337+
```bash
338+
# Run evaluation
339+
uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py
340+
341+
# Convert to JSON
342+
uv run agents_mcp_usage/evaluations/mermaid_evals/scripts/preprocess_merbench_data.py \
343+
-i mermaid_eval_results/machine_a_results.csv \
344+
-o results/machine_a.json
345+
```
346+
347+
#### 2. Run benchmarks on Machine B:
348+
```bash
349+
# Run evaluation
350+
uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py
351+
352+
# Convert to JSON
353+
uv run agents_mcp_usage/evaluations/mermaid_evals/scripts/preprocess_merbench_data.py \
354+
-i mermaid_eval_results/machine_b_results.csv \
355+
-o results/machine_b.json
356+
```
357+
358+
#### 3. Merge results:
359+
```bash
360+
uv run agents_mcp_usage/evaluations/mermaid_evals/scripts/merge_benchmark_results.py \
361+
-i results/machine_a.json results/machine_b.json \
362+
-o results/combined.json \
363+
--report results/merge_summary.json
364+
```
365+
366+
### Notes
367+
- The script automatically recalculates all statistics from the merged raw data
368+
- Cost calculations use the same `costs.json` configuration as the preprocessing script
369+
- Provider detection is based on model name patterns (gemini→Google, nova→Amazon, etc.)
370+
242371
## Monitoring & Debugging
243372

244373
All evaluation runs are traced with **Logfire** for comprehensive monitoring:

agents_mcp_usage/evaluations/mermaid_evals/costs.json

Lines changed: 75 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,31 @@
105105
]
106106
}
107107
},
108+
"gemini-2.5-pro": {
109+
"friendly_name": "Gemini 2.5 Pro Preview",
110+
"input": [
111+
{
112+
"up_to": 200000,
113+
"price": 1.25
114+
},
115+
{
116+
"up_to": "inf",
117+
"price": 2.5
118+
}
119+
],
120+
"output": {
121+
"default": [
122+
{
123+
"up_to": 200000,
124+
"price": 10.0
125+
},
126+
{
127+
"up_to": "inf",
128+
"price": 15.0
129+
}
130+
]
131+
}
132+
},
108133
"gemini-1.5-pro": {
109134
"friendly_name": "Gemini 1.5 Pro",
110135
"input": [
@@ -244,10 +269,6 @@
244269
"gemini-2.5-flash": {
245270
"friendly_name": "Gemini 2.5 Flash",
246271
"input": [
247-
{
248-
"up_to": 200000,
249-
"price": 0.15
250-
},
251272
{
252273
"up_to": "inf",
253274
"price": 0.3
@@ -256,9 +277,22 @@
256277
"output": {
257278
"default": [
258279
{
259-
"up_to": 200000,
260-
"price": 1.25
261-
},
280+
"up_to": "inf",
281+
"price": 2.5
282+
}
283+
]
284+
}
285+
},
286+
"gemini-2.5-flash-preview-09-2025": {
287+
"friendly_name": "Gemini 2.5 Flash Preview (Sept)",
288+
"input": [
289+
{
290+
"up_to": "inf",
291+
"price": 0.3
292+
}
293+
],
294+
"output": {
295+
"default": [
262296
{
263297
"up_to": "inf",
264298
"price": 2.5
@@ -283,6 +317,40 @@
283317
]
284318
}
285319
},
320+
"gemini-2.5-flash-lite": {
321+
"friendly_name": "Gemini 2.5 Flash Lite",
322+
"input": [
323+
{
324+
"up_to": "inf",
325+
"price": 0.1
326+
}
327+
],
328+
"output": {
329+
"default": [
330+
{
331+
"up_to": "inf",
332+
"price": 0.4
333+
}
334+
]
335+
}
336+
},
337+
"gemini-2.5-flash-lite-preview-09-2025": {
338+
"friendly_name": "Gemini 2.5 Flash Lite Preview (Sept)",
339+
"input": [
340+
{
341+
"up_to": "inf",
342+
"price": 0.1
343+
}
344+
],
345+
"output": {
346+
"default": [
347+
{
348+
"up_to": "inf",
349+
"price": 0.4
350+
}
351+
]
352+
}
353+
},
286354
"openai:o4-mini": {
287355
"friendly_name": "OpenAI o4-mini",
288356
"input": [

agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,10 @@
4848
# "gemini-2.5-pro-preview-05-06",
4949
# "gemini-2.5-pro-preview-03-25",
5050
# "gemini-2.0-flash",
51-
"gemini-2.5-flash",
51+
# "gemini-2.5-flash",
52+
"gemini-2.5-flash-lite",
53+
# "gemini-2.5-flash-preview-09-2025",
54+
# "gemini-2.5-flash-lite-preview-09-2025"
5255
# "bedrock:us.amazon.nova-pro-v1:0",
5356
# "bedrock:us.amazon.nova-lite-v1:0",
5457
# "bedrock:us.amazon.nova-micro-v1:0",
@@ -511,13 +514,13 @@ async def main() -> None:
511514
parser.add_argument(
512515
"--runs",
513516
type=int,
514-
default=5,
517+
default=15,
515518
help="Number of evaluation runs per model",
516519
)
517520
parser.add_argument(
518521
"--judge-model",
519522
type=str,
520-
default="gemini-2.5-pro-preview-06-05",
523+
default="gemini-2.5-pro",
521524
help="Model to use for LLM judging",
522525
)
523526
parser.add_argument(

pyproject.toml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,12 @@ dependencies = [
1414
"langgraph>=0.3.31",
1515
"logfire>=3.20.0",
1616
"loguru>=0.7.3",
17-
"mcp==1.9.0",
17+
"mcp>=1.12.3",
1818
"openai-agents>=0.0.12",
1919
"pandas>=2.3.0",
2020
"plotly>=6.1.2",
21-
"pydantic-ai-slim[bedrock,mcp]>=0.2.15",
22-
"pydantic-evals[logfire]>=0.2.15",
21+
"pydantic-ai-slim[bedrock,mcp]>=1.0.17",
22+
"pydantic-evals[logfire]>=1.0.17",
2323
"python-dotenv>=1.1.0",
2424
"ruff>=0.11.10",
2525
"streamlit>=1.45.1",

0 commit comments

Comments
 (0)