mcp-eval is your βflight simulatorβ for tool-using LLMs. Connect agents to real MCP servers, run realistic scenarios, and get production-grade insights into behavior and performance.
What is mcp-eval?
mcp-eval is a developer-first evaluation framework designed specifically for testing Model Context Protocol (MCP) servers and the agents that use them. Unlike traditional testing approaches that mock interactions or test components in isolation, mcp-eval exercises your complete system in the environment it actually runs in: an LLM/agent calling real MCP tools.Think of it this way: If unit tests are like testing car parts on a bench, mcp-eval is like taking the whole car to a test track. You see how everything works together under realistic conditions.
Why mcp-eval exists
The challenge
As AI agents become more sophisticated and MCP servers proliferate, teams face critical questions:- For MCP server developers: βWill my server handle real agent requests correctly? What about edge cases?β
- For agent developers: βIs my agent using tools effectively? Does it recover from errors?β
- For both: βHow do we measure quality, performance, and reliability before production?β
The solution
mcp-eval addresses these challenges by providing:- Real environment testing - No mocks, actual agent-to-server communication
- Full observability - OpenTelemetry traces capture detailed agent execution to run evals over
- Rich assertion library - From tool checks to sophisticated path analysis
- Multiple test styles - Choose what fits your workflow β
pytest
, datasets or@task
decorators. - Language agnostic - Test MCP servers written in any language
Core capabilities
π§ͺ Comprehensive Assertions
Content validation: Pattern matching, regex, contains/not-containsTool verification: Was called, call counts, arguments, outputsPerformance gates: Response time, iteration limits, token usageQuality judges: LLM-based evaluation with custom rubricsPath analysis: Efficiency, backtracking, optimal sequences
π OpenTelemetry Metrics
Automatic capture: Every tool call, LLM interaction, timingSpan tree analysis: Visualize execution flow and bottlenecksCost tracking: Token usage and estimated costs per testPerformance breakdown: LLM time vs tool time vs overheadError recovery: Track retry patterns and failure handling
π¨ Flexible Test Authoring
Decorator style: Simple
@task
decorators for quick testsPytest integration: Use familiar pytest fixtures and markers (run with uv run pytest
)Dataset driven: Systematic evaluation with test matricesAI generation: Let Claude/GPT generate test scenariosParameterization: Test variations with minimal codeπ Developer Experience
Quick start:
mcp-eval init
sets up everythingSmart CLI: Discover servers, generate tests, validate configRich reports: Console, JSON, Markdown, interactive HTMLCI/CD ready: GitHub Actions, exit codes, artifact uploadsHelpful diagnostics: doctor
and validate
commandsHow mcp-eval works
Architecture overview
Execution flow
1
Configure your environment
Define which MCP servers are available and configure your agent with appropriate tools and instructions.
2
Write test scenarios
Create tests that give your agent realistic tasks requiring tool use.
3
Execute tests
mcp-eval orchestrates the agent, captures all interactions via OpenTelemetry, and records comprehensive traces.
4
Process traces into metrics
OTEL traces are parsed to extract tool calls, timings, token usage, error patterns, and execution paths.
5
Apply assertions
Your assertions run against the response content and extracted metrics to verify behavior.
6
Generate reports
Results are compiled into multiple formats for different audiences and use cases.
A complete example
Hereβs a real-world test that showcases mcp-evalβs capabilities:Key features in depth
π― Unified assertion API
All assertions use a single, discoverable API pattern:π‘ OpenTelemetry integration
Every interaction generates detailed traces:- Tool invocations with arguments and results
- LLM calls with token counts
- Timing breakdowns for each operation
- Error and retry patterns
- Nested span relationships
π€ Multiple test styles
Choose the approach that fits your team:- Decorator Style
- Pytest Style
- Dataset Style
π Production-ready reporting
Generate reports in multiple formats:- Console: Real-time progress and summaries
- JSON: Machine-readable for CI/CD pipelines
- Markdown: PR comments and documentation
- HTML: Interactive exploration with filtering
When to use mcp-eval
Perfect for
β MCP server development - Ensure your server handles agent requests correctly β Agent development - Verify your agent uses tools effectively β Integration testing - Test agent + server combinations before deployment β Regression testing - Catch breaking changes in CI/CD β Performance optimization - Identify bottlenecks and inefficiencies β Quality gating - Enforce standards before merging codeNot designed for
β Unit testing - Use standard testing frameworks for isolated functions β Load testing - Consider specialized tools for high-volume testing β Security testing - Use dedicated security scanning toolsNext steps
Ready to start testing? Hereβs your path:Quickstart Guide
Get your first test running in 5 minutes
Common Workflows
Learn testing patterns and best practices
Example Tests
Browse real-world test implementations
API Reference
Explore the complete assertion catalog
Technical foundations
mcp-eval is built on solid technical foundations:- Async-first Python for performance and concurrency
- OpenTelemetry for vendor-neutral observability
- Pydantic for configuration and validation
- Rich CLI powered by Click and Rich libraries
- Extensible architecture for custom evaluators and reporters