This is a long-form guide adapted from and expanding on GUIDE.md. It’s organized for readers who prefer a single page.

What is mcp-eval?

Think of mcp-eval as your “flight simulator” for tool‑using LLMs. You plug in an agent, connect it to real MCP servers (tools), and run realistic scenarios. The framework captures OTEL traces as the single source of truth, turns them into metrics, and gives you expressive assertions for both content and behavior.

Core pieces

Getting Started

  1. Install mcp-eval globally: uv tool install mcpevals (recommended) or pip install mcpevals
  2. Initialize your project: mcp-eval init - interactive setup for API keys and configuration
  3. Add your MCP server: mcp-eval server add - configure the server you want to test
  4. Run tests: mcp-eval run tests/ - execute your test suite
Test servers written in any language: Your MCP server can be written in Python, TypeScript, Go, Rust, Java, or any other language. mcp-eval connects to it via the MCP protocol, making testing completely language-agnostic.
See the Quickstart page for detailed setup instructions.

Styles of Tests

Decorator style

from mcp_eval import Expect
from mcp_eval import task, setup, teardown, parametrize

@task("Test basic URL fetching functionality")
async def test_basic_fetch(agent, session):
    response = await agent.generate_str("Fetch the content from https://example.com")
    await session.assert_that(Expect.tools.was_called("fetch"), name="fetch_called", response=response)
    await session.assert_that(Expect.content.contains("Example Domain"), name="contains_domain", response=response)
Full example: test_decorator_style.py

Pytest style

import pytest
from mcp_eval import Expect

@pytest.mark.asyncio
async def test_basic_fetch_with_pytest(mcp_agent):
    response = await mcp_agent.generate_str("Fetch the content from https://example.com")
    await mcp_agent.session.assert_that(Expect.tools.was_called("fetch"), name="fetch_called", response=response)
    await mcp_agent.session.assert_that(Expect.content.contains("Example Domain"), name="contains_text", response=response)
Full example: test_pytest_style.py

Dataset style

from mcp_eval import Case, Dataset, ToolWasCalled, ResponseContains

cases = [
  Case(name="fetch_example", inputs="Fetch https://example.com", evaluators=[ToolWasCalled("fetch"), ResponseContains("Example Domain")])
]
dataset = Dataset(name="Fetch Suite", cases=cases)
report = await dataset.evaluate(lambda inputs, agent, session: agent.generate_str(inputs))
report.print(include_input=True, include_output=True)
Full example: test_dataset_style.py and basic_fetch_dataset.yaml

Assertions and Timing

Immediate vs deferred execution of evaluators is handled automatically based on whether final metrics are required. See Assertions.

Agent Evaluation

Define your agent as the system under test via use_agent and with_agent. See Agent Evaluation for patterns and metrics to watch.

Server Evaluation

Connect an MCP server, then write scenarios that exercise it through an agent. Use tool/path/efficiency assertions. See Server Evaluation.

Metrics & Tracing

OTEL is the source of truth. After a run, explore metrics and the span tree for loops, path inefficiency, and recovery. See Metrics & Tracing.

Test Generation with LLMs

Use mcp-eval generate to bootstrap comprehensive tests. We recommend Anthropic Sonnet/Opus. See Test Generation.

CI/CD

Run in GitHub Actions and publish artifacts/badges. See CI/CD.

Troubleshooting

Use mcp-eval doctor, validate, and issue for diagnosis. See Troubleshooting.

Best Practices

  • Prefer objective, structural checks alongside LLM judges
  • Keep prompts clear and deterministic; gate performance separately (nightly)
  • Use parametrization to widen coverage
  • Keep servers in mcp‑agent config; use mcpeval.yaml for eval knobs