Detailed Guide

This is a long-form guide adapted from and expanding on GUIDE.md. It’s organized for readers who prefer a single page.

What is `mcp-eval`?

Think of mcp-eval as your “flight simulator” for tool‑using LLMs. You plug in an agent, connect it to real MCP servers (tools), and run realistic scenarios. The framework captures OTEL traces as the single source of truth, turns them into metrics, and gives you expressive assertions for both content and behavior.

Core pieces

Getting Started

Install mcp-eval globally: uv tool install mcpevals (recommended) or pip install mcpevals
Initialize your project: mcp-eval init - interactive setup for API keys and configuration
Add your MCP server: mcp-eval server add - configure the server you want to test
Run tests: mcp-eval run tests/ - execute your test suite

Test servers written in any language: Your MCP server can be written in Python, TypeScript, Go, Rust, Java, or any other language. mcp-eval connects to it via the MCP protocol, making testing completely language-agnostic.

See the Quickstart page for detailed setup instructions.

Styles of Tests

Decorator style

from mcp_eval import Expect
from mcp_eval import task, setup, teardown, parametrize

@task("Test basic URL fetching functionality")
async def test_basic_fetch(agent, session):
    response = await agent.generate_str("Fetch the content from https://example.com")
    await session.assert_that(Expect.tools.was_called("fetch"), name="fetch_called", response=response)
    await session.assert_that(Expect.content.contains("Example Domain"), name="contains_domain", response=response)

Full example: test_decorator_style.py

Pytest style

import pytest
from mcp_eval import Expect

@pytest.mark.asyncio
async def test_basic_fetch_with_pytest(mcp_agent):
    response = await mcp_agent.generate_str("Fetch the content from https://example.com")
    await mcp_agent.session.assert_that(Expect.tools.was_called("fetch"), name="fetch_called", response=response)
    await mcp_agent.session.assert_that(Expect.content.contains("Example Domain"), name="contains_text", response=response)

Full example: test_pytest_style.py

Dataset style

from mcp_eval import Case, Dataset, ToolWasCalled, ResponseContains

cases = [
  Case(name="fetch_example", inputs="Fetch https://example.com", evaluators=[ToolWasCalled("fetch"), ResponseContains("Example Domain")])
]
dataset = Dataset(name="Fetch Suite", cases=cases)
report = await dataset.evaluate(lambda inputs, agent, session: agent.generate_str(inputs))
report.print(include_input=True, include_output=True)

Full example: test_dataset_style.py and basic_fetch_dataset.yaml

Assertions and Timing

Immediate vs deferred execution of evaluators is handled automatically based on whether final metrics are required. See Assertions.

Agent Evaluation

Define your agent as the system under test via use_agent and with_agent. See Agent Evaluation for patterns and metrics to watch.

Server Evaluation

Connect an MCP server, then write scenarios that exercise it through an agent. Use tool/path/efficiency assertions. See Server Evaluation.

Metrics & Tracing

OTEL is the source of truth. After a run, explore metrics and the span tree for loops, path inefficiency, and recovery. See Metrics & Tracing.

Test Generation with LLMs

Use mcp-eval generate to bootstrap comprehensive tests. We recommend Anthropic Sonnet/Opus. See Test Generation.

CI/CD

Run in GitHub Actions and publish artifacts/badges. See CI/CD.

Troubleshooting

Use mcp-eval doctor, validate, and issue for diagnosis. See Troubleshooting.

Best Practices

Prefer objective, structural checks alongside LLM judges
Keep prompts clear and deterministic; gate performance separately (nightly)
Use parametrization to widen coverage
Keep servers in mcp‑agent config; use mcpeval.yaml for eval knobs

Getting Started

Core Concepts

Writing Tests

Building with LLMs

Evaluation Guides

Configuration

CI/CD & Deployment

Test Reporting

API Reference

CLI Reference

Resources

What is `mcp-eval`?

Core pieces

Getting Started

Styles of Tests

Decorator style

Pytest style

Dataset style

Assertions and Timing

Agent Evaluation

Server Evaluation

Metrics & Tracing

Test Generation with LLMs

CI/CD

Troubleshooting

Best Practices

Getting Started

Core Concepts

Writing Tests

Building with LLMs

Evaluation Guides

Configuration

CI/CD & Deployment

Test Reporting

API Reference

CLI Reference

Resources

​What is mcp-eval?

​Core pieces

​Getting Started

​Styles of Tests

​Decorator style

​Pytest style

​Dataset style

​Assertions and Timing

​Agent Evaluation

​Server Evaluation

​Metrics & Tracing

​Test Generation with LLMs

​CI/CD

​Troubleshooting

​Best Practices

What is `mcp-eval`?

Core pieces

Getting Started

Styles of Tests

Decorator style

Pytest style

Dataset style

Assertions and Timing

Agent Evaluation

Server Evaluation

Metrics & Tracing

Test Generation with LLMs

CI/CD

Troubleshooting

Best Practices