What is mcp-eval
?
Think of mcp-eval
as your “flight simulator” for tool‑using LLMs. You plug in an agent, connect it to real MCP servers (tools), and run realistic scenarios. The framework captures OTEL traces as the single source of truth, turns them into metrics, and gives you expressive assertions for both content and behavior.
Core pieces
- TestSession and TestAgent
- Decorators and task runner
- Dataset and Case
- Expect (assertion catalog)
- Evaluators and Metrics
- Runner and CLI
Getting Started
- Install mcp-eval globally:
uv tool install mcpevals
(recommended) orpip install mcpevals
- Initialize your project:
mcp-eval init
- interactive setup for API keys and configuration - Add your MCP server:
mcp-eval server add
- configure the server you want to test - Run tests:
mcp-eval run tests/
- execute your test suite
Test servers written in any language: Your MCP server can be written in Python, TypeScript, Go, Rust, Java, or any other language.
mcp-eval
connects to it via the MCP protocol, making testing completely language-agnostic.Styles of Tests
Decorator style
Pytest style
Dataset style
Assertions and Timing
Immediate vs deferred execution of evaluators is handled automatically based on whether final metrics are required. See Assertions.Agent Evaluation
Define your agent as the system under test via use_agent and with_agent. See Agent Evaluation for patterns and metrics to watch.Server Evaluation
Connect an MCP server, then write scenarios that exercise it through an agent. Use tool/path/efficiency assertions. See Server Evaluation.Metrics & Tracing
OTEL is the source of truth. After a run, explore metrics and the span tree for loops, path inefficiency, and recovery. See Metrics & Tracing.Test Generation with LLMs
Usemcp-eval generate
to bootstrap comprehensive tests. We recommend Anthropic Sonnet/Opus. See Test Generation.
CI/CD
Run in GitHub Actions and publish artifacts/badges. See CI/CD.Troubleshooting
Usemcp-eval doctor
, validate
, and issue
for diagnosis. See Troubleshooting.
Best Practices
- Prefer objective, structural checks alongside LLM judges
- Keep prompts clear and deterministic; gate performance separately (nightly)
- Use parametrization to widen coverage
- Keep servers in mcp‑agent config; use
mcpeval.yaml
for eval knobs