Learn proven patterns for testing MCP servers and agents. Each workflow includes practical examples and tips from real-world usage.
Write your first test
Choose a test style
Pick the style that fits your workflow:Decorator (simplest):from mcp_eval import task, Expect
@task("My first test")
async def test_basic(agent, session):
response = await agent.generate_str("Hello")
await session.assert_that(Expect.content.contains("Hi"))
Pytest (familiar):@pytest.mark.asyncio
async def test_basic(mcp_agent):
response = await mcp_agent.generate_str("Hello")
assert "Hi" in response
Add meaningful assertions
Start simple, then add more specific checks:# Start with content
await session.assert_that(Expect.content.contains("success"))
# Add tool verification
await session.assert_that(Expect.tools.was_called("my_tool"))
# Include performance
await session.assert_that(Expect.performance.response_time_under(5000))
Run and iterate
# Run with verbose output
mcp-eval run test_basic.py -v
# Generate reports
mcp-eval run test_basic.py --html report.html
Review failures, adjust assertions, and rerun.
Start with simple content assertions, then gradually add tool and performance checks as you understand your system’s behavior.
Test an MCP server comprehensively
Identify server capabilities
List all tools your server provides:mcp-eval server list --verbose
Note the available tools and their expected behaviors. Create test scenarios for each tool
Write tests covering normal and edge cases:@task("Test calculator add - normal case")
async def test_add_normal(agent, session):
response = await agent.generate_str("Calculate 5 + 3")
await session.assert_that(Expect.tools.was_called("add"))
await session.assert_that(Expect.content.contains("8"))
@task("Test calculator add - edge case")
async def test_add_overflow(agent, session):
response = await agent.generate_str("Calculate 999999999 + 999999999")
await session.assert_that(Expect.tools.was_called("add"))
# Check for appropriate handling
await session.assert_that(
Expect.content.regex(r"(overflow|large|error)", case_sensitive=False)
)
Test error handling
Verify graceful failure:@task("Test invalid input handling")
async def test_error_handling(agent, session):
response = await agent.generate_str("Calculate abc + xyz")
# Should either fail gracefully or explain the issue
await session.assert_that(
Expect.content.regex(r"(invalid|error|cannot|unable)")
)
Create a comprehensive dataset
For systematic testing:from mcp_eval import Dataset, Case
dataset = Dataset(
name="Calculator Server Tests",
cases=[
Case(
name="addition",
inputs="Calculate 10 + 20",
expected_output="30",
evaluators=[
ToolWasCalled("add"),
ResponseContains("30")
]
),
# Add more cases...
]
)
Create and enforce a golden path
Ensure your agent follows the optimal execution path:
Define the ideal tool sequence
Identify the minimal, correct sequence of tools:# Example: validate → process → format
golden_path = ["validate_input", "process_data", "format_output"]
Add path efficiency assertion
await session.assert_that(
Expect.path.efficiency(
expected_tool_sequence=golden_path,
tool_usage_limits={
"validate_input": 1,
"process_data": 1,
"format_output": 1
},
allow_extra_steps=0,
penalize_backtracking=True
),
name="golden_path_check"
)
Debug path violations
When tests fail, examine the actual path:# In your test
metrics = session.get_metrics()
actual_sequence = [call.name for call in metrics.tool_calls]
print(f"Expected: {golden_path}")
print(f"Actual: {actual_sequence}")
Refine agent instructions
If the agent deviates, improve its instructions:agent = Agent(
instruction="""
IMPORTANT: Follow this exact sequence:
1. First validate the input
2. Then process the validated data
3. Finally format the output
Never skip steps or backtrack.
"""
)
Golden paths work best for deterministic workflows. For creative tasks, consider using allow_extra_steps
or checking only critical waypoints.
Build quality gates with LLM judges
Start with a simple rubric
Define what “good” looks like:judge = Expect.judge.llm(
rubric="""
The response should:
- Accurately summarize the main points
- Use clear, professional language
- Be 2-3 sentences long
""",
min_score=0.8,
include_input=True # Give judge full context
)
Combine with structural checks
Don’t rely solely on judges:# Structural check (deterministic)
await session.assert_that(
Expect.tools.output_matches(
tool_name="fetch",
expected_output="Example Domain",
match_type="contains"
)
)
# Quality check (LLM judge)
await session.assert_that(judge, response=response)
Use multi-criteria for complex evaluation
from mcp_eval.evaluators import EvaluationCriterion
criteria = [
EvaluationCriterion(
name="accuracy",
description="All facts are correct and up-to-date",
weight=3.0, # Most important
min_score=0.9
),
EvaluationCriterion(
name="completeness",
description="Covers all requested information",
weight=2.0,
min_score=0.8
),
EvaluationCriterion(
name="clarity",
description="Easy to understand, well-organized",
weight=1.0,
min_score=0.7
)
]
judge = Expect.judge.multi_criteria(
criteria=criteria,
aggregate_method="weighted", # or "min" for strictest
require_all_pass=False, # Set True for strict gating
use_cot=True # Chain-of-thought reasoning
)
Calibrate thresholds
Run tests, collect scores, adjust:# Start lenient
min_score=0.6
# After collecting data, tighten to p50 or p75
min_score=0.8 # Based on historical performance
Pro tip: Use Anthropic Claude (Opus or Sonnet) for best judge quality. They provide more consistent and nuanced evaluations.
Integrate with CI/CD
Add GitHub Actions workflow
Create .github/workflows/mcp-eval.yml
using the reusable workflow:name: MCP-Eval CI
on:
push:
branches: [main, master, trunk]
pull_request:
workflow_dispatch:
# Cancel redundant runs on the same ref
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
call-mcpeval:
uses: lastmile-ai/mcp-eval/.github/workflows/mcpeval-reusable.yml
with:
deploy-pages: true
# Optional: customize test configuration
# python-version: '3.11'
# tests: 'tests/'
# run-args: '-v --max-concurrency 4'
permissions:
contents: read
pages: write
id-token: write
pull-requests: write
secrets: inherit
This reusable workflow automatically:
- Runs tests and generates reports
- Posts PR comments with results
- Uploads artifacts
- Deploys badges and HTML reports to GitHub Pages (on main branch)
Enable GitHub Pages
In your repository settings:
- Go to Settings → Pages
- Source: Deploy from a branch
- Branch: gh-pages (created automatically by the workflow)
- Save the settings
Your badges and reports will be available at:
- Badges:
https://YOUR_USERNAME.github.io/YOUR_REPO/badges/
- Report:
https://YOUR_USERNAME.github.io/YOUR_REPO/
Add test badges from GitHub Pages
After deploying to GitHub Pages, you may add badges to your README.md to show users your mcp-eval test and coverage status:[](https://YOUR_USERNAME.github.io/YOUR_REPO/)
[](https://YOUR_USERNAME.github.io/YOUR_REPO/)
These badges will automatically update after each push to main. Configure failure conditions
Make tests fail the build appropriately:# In your test
critical_assertions = [
Expect.tools.success_rate(min_rate=0.95),
Expect.performance.response_time_under(10000)
]
for assertion in critical_assertions:
result = await session.assert_that(assertion)
if not result.passed:
sys.exit(1) # Fail CI
Generate tests with AI
Use the generate command
Let AI create test scenarios:# Generate 10 pytest-style tests
mcp-eval generate \
--style pytest \
--n-examples 10 \
--provider anthropic \
--model claude-3-5-sonnet-20241022
Review and customize
AI-generated tests are a starting point:# Generated test
@task("Test weather fetching")
async def test_weather(agent, session):
response = await agent.generate_str("Get weather for NYC")
await session.assert_that(Expect.tools.was_called("weather_api"))
# Add your domain knowledge
@task("Test weather fetching with validation")
async def test_weather_enhanced(agent, session):
response = await agent.generate_str("Get weather for NYC")
await session.assert_that(Expect.tools.was_called("weather_api"))
# Add specific checks
await session.assert_that(Expect.content.regex(r"\d+°[CF]"))
await session.assert_that(Expect.content.contains("New York"))
Update existing tests
Add new scenarios to existing files:mcp-eval generate \
--update tests/test_weather.py \
--style pytest \
--n-examples 5
Debug failing tests
Enable verbose output
mcp-eval run test_file.py -v
Shows tool calls, responses, and assertion details.Examine OTEL traces
Look in test-reports/test_name_*/trace.jsonl
:import json
with open("test-reports/test_abc123/trace.jsonl") as f:
for line in f:
span = json.loads(line)
if span["name"].startswith("tool:"):
print(f"Tool: {span['name']}")
print(f"Duration: {span['duration_ms']}ms")
print(f"Input: {span['attributes'].get('input')}")
print(f"Output: {span['attributes'].get('output')}")
Use doctor and validate commands
# Check system health
mcp-eval doctor --full
# Validate configuration
mcp-eval validate
Add debug assertions
# Temporarily add debug output
metrics = session.get_metrics()
print(f"Tool calls: {[c.name for c in metrics.tool_calls]}")
print(f"Total duration: {metrics.total_duration_ms}ms")
print(f"Token cost: ${metrics.total_cost_usd}")
Next steps
Master these advanced topics: