Initial merbench release #6
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Initial release of Merbench, a comprehensive evaluation dashboard and benchmarking toolkit for LLM agents using Model Context Protocol (MCP) integration.
Problem:
The project lacked systematic evaluation capabilities for MCP-enabled agents. There was no way to benchmark multiple LLMs on complex multi-server tasks, compare cost/performance trade-offs, or validate agent behaviour with real-world MCP server interactions. Additionally, AWS Bedrock model support was missing, and the existing cost tracking was incomplete.
Solution:
Built Merbench - a production-ready evaluation platform that tests LLM agents on Mermaid diagram generation tasks using MCP servers for validation and error correction. Added comprehensive multi-model support including AWS Bedrock, created an interactive Streamlit dashboard with leaderboards and Pareto analysis, and implemented sophisticated cost tracking with real pricing data across providers.
Unlocks:
Detailed breakdown of changes:
.envvariables(
AWS_REGION,AWS_PROFILE)eval()parsing to secure JSON format incosts.json, added friendly modelnames and multi-tier pricing structures
merbench_ui.pywith smart label positioning for Pareto plots, richer UI descriptions,and better cost/performance visualisation options
evals_pydantic_mcp.pywith Bedrock model creation logic and improved error handling;updated
run_multi_evals.pywith refined parallelism and new default model configurationsmermaid_diagrams.pymake leaderboardcommand, updated dependencies for Bedrock support (boto3, botocore,s3transfer), and improved schema validation in
dashboard_config.pyeval_basic_mcp_usedirectory from README.md to aligndocumentation with actual codebase structure