Learn proven patterns for testing MCP servers and agents. Each workflow includes practical examples and tips from real-world usage.

Write your first test

1

Choose a test style

Pick the style that fits your workflow:Decorator (simplest):
from mcp_eval import task, Expect

@task("My first test")
async def test_basic(agent, session):
    response = await agent.generate_str("Hello")
    await session.assert_that(Expect.content.contains("Hi"))
Pytest (familiar):
@pytest.mark.asyncio
async def test_basic(mcp_agent):
    response = await mcp_agent.generate_str("Hello")
    assert "Hi" in response
2

Add meaningful assertions

Start simple, then add more specific checks:
# Start with content
await session.assert_that(Expect.content.contains("success"))

# Add tool verification
await session.assert_that(Expect.tools.was_called("my_tool"))

# Include performance
await session.assert_that(Expect.performance.response_time_under(5000))
3

Run and iterate

# Run with verbose output
mcp-eval run test_basic.py -v

# Generate reports
mcp-eval run test_basic.py --html report.html
Review failures, adjust assertions, and rerun.
Start with simple content assertions, then gradually add tool and performance checks as you understand your system’s behavior.

Test an MCP server comprehensively

1

Identify server capabilities

List all tools your server provides:
mcp-eval server list --verbose
Note the available tools and their expected behaviors.
2

Create test scenarios for each tool

Write tests covering normal and edge cases:
@task("Test calculator add - normal case")
async def test_add_normal(agent, session):
    response = await agent.generate_str("Calculate 5 + 3")
    await session.assert_that(Expect.tools.was_called("add"))
    await session.assert_that(Expect.content.contains("8"))

@task("Test calculator add - edge case")
async def test_add_overflow(agent, session):
    response = await agent.generate_str("Calculate 999999999 + 999999999")
    await session.assert_that(Expect.tools.was_called("add"))
    # Check for appropriate handling
    await session.assert_that(
        Expect.content.regex(r"(overflow|large|error)", case_sensitive=False)
    )
3

Test error handling

Verify graceful failure:
@task("Test invalid input handling")
async def test_error_handling(agent, session):
    response = await agent.generate_str("Calculate abc + xyz")
    
    # Should either fail gracefully or explain the issue
    await session.assert_that(
        Expect.content.regex(r"(invalid|error|cannot|unable)")
    )
4

Create a comprehensive dataset

For systematic testing:
from mcp_eval import Dataset, Case

dataset = Dataset(
    name="Calculator Server Tests",
    cases=[
        Case(
            name="addition",
            inputs="Calculate 10 + 20",
            expected_output="30",
            evaluators=[
                ToolWasCalled("add"),
                ResponseContains("30")
            ]
        ),
        # Add more cases...
    ]
)

Create and enforce a golden path

Ensure your agent follows the optimal execution path:
1

Define the ideal tool sequence

Identify the minimal, correct sequence of tools:
# Example: validate → process → format
golden_path = ["validate_input", "process_data", "format_output"]
2

Add path efficiency assertion

await session.assert_that(
    Expect.path.efficiency(
        expected_tool_sequence=golden_path,
        tool_usage_limits={
            "validate_input": 1,
            "process_data": 1,
            "format_output": 1
        },
        allow_extra_steps=0,
        penalize_backtracking=True
    ),
    name="golden_path_check"
)
3

Debug path violations

When tests fail, examine the actual path:
# In your test
metrics = session.get_metrics()
actual_sequence = [call.name for call in metrics.tool_calls]
print(f"Expected: {golden_path}")
print(f"Actual: {actual_sequence}")
4

Refine agent instructions

If the agent deviates, improve its instructions:
agent = Agent(
    instruction="""
    IMPORTANT: Follow this exact sequence:
    1. First validate the input
    2. Then process the validated data
    3. Finally format the output
    Never skip steps or backtrack.
    """
)
Golden paths work best for deterministic workflows. For creative tasks, consider using allow_extra_steps or checking only critical waypoints.

Build quality gates with LLM judges

1

Start with a simple rubric

Define what “good” looks like:
judge = Expect.judge.llm(
    rubric="""
    The response should:
    - Accurately summarize the main points
    - Use clear, professional language
    - Be 2-3 sentences long
    """,
    min_score=0.8,
    include_input=True  # Give judge full context
)
2

Combine with structural checks

Don’t rely solely on judges:
# Structural check (deterministic)
await session.assert_that(
    Expect.tools.output_matches(
        tool_name="fetch",
        expected_output="Example Domain",
        match_type="contains"
    )
)

# Quality check (LLM judge)
await session.assert_that(judge, response=response)
3

Use multi-criteria for complex evaluation

from mcp_eval.evaluators import EvaluationCriterion

criteria = [
    EvaluationCriterion(
        name="accuracy",
        description="All facts are correct and up-to-date",
        weight=3.0,  # Most important
        min_score=0.9
    ),
    EvaluationCriterion(
        name="completeness",
        description="Covers all requested information",
        weight=2.0,
        min_score=0.8
    ),
    EvaluationCriterion(
        name="clarity",
        description="Easy to understand, well-organized",
        weight=1.0,
        min_score=0.7
    )
]

judge = Expect.judge.multi_criteria(
    criteria=criteria,
    aggregate_method="weighted",  # or "min" for strictest
    require_all_pass=False,  # Set True for strict gating
    use_cot=True  # Chain-of-thought reasoning
)
4

Calibrate thresholds

Run tests, collect scores, adjust:
# Start lenient
min_score=0.6

# After collecting data, tighten to p50 or p75
min_score=0.8  # Based on historical performance
Pro tip: Use Anthropic Claude (Opus or Sonnet) for best judge quality. They provide more consistent and nuanced evaluations.

Integrate with CI/CD

1

Add GitHub Actions workflow

Create .github/workflows/mcp-eval.yml:
name: mcp-eval Tests

on:
  pull_request:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    
    - name: Install dependencies
      run: |
        pip install mcpevals
        # Or using uv (faster!):
        # uv add mcpevals
        # Or from your repo:
        # pip install -e .
    
    - name: Run mcp-eval tests
      env:
        ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      run: |
        mcp-eval run tests/ \
          --json test-reports/results.json \
          --markdown test-reports/results.md \
          --html test-reports/index.html
    
    - name: Upload test reports
      if: always()
      uses: actions/upload-artifact@v3
      with:
        name: mcp-eval-reports
        path: test-reports/
    
    - name: Comment PR with results
      if: github.event_name == 'pull_request'
      uses: actions/github-script@v6
      with:
        script: |
          const fs = require('fs');
          const markdown = fs.readFileSync('test-reports/results.md', 'utf8');
          
          github.rest.issues.createComment({
            issue_number: context.issue.number,
            owner: context.repo.owner,
            repo: context.repo.repo,
            body: `## mcp-eval Test Results\n\n${markdown}`
          });
2

Add test badges

In your README.md:
![mcp-eval Tests](https://github.com/YOUR_ORG/YOUR_REPO/actions/workflows/mcp-eval.yml/badge.svg)
3

Configure failure conditions

Make tests fail the build appropriately:
# In your test
critical_assertions = [
    Expect.tools.success_rate(min_rate=0.95),
    Expect.performance.response_time_under(10000)
]

for assertion in critical_assertions:
    result = await session.assert_that(assertion)
    if not result.passed:
        sys.exit(1)  # Fail CI

Generate tests with AI

1

Use the generate command

Let AI create test scenarios:
# Generate 10 pytest-style tests
mcp-eval generate \
  --style pytest \
  --n-examples 10 \
  --provider anthropic \
  --model claude-3-5-sonnet-20241022
2

Review and customize

AI-generated tests are a starting point:
# Generated test
@task("Test weather fetching")
async def test_weather(agent, session):
    response = await agent.generate_str("Get weather for NYC")
    await session.assert_that(Expect.tools.was_called("weather_api"))

# Add your domain knowledge
@task("Test weather fetching with validation")
async def test_weather_enhanced(agent, session):
    response = await agent.generate_str("Get weather for NYC")
    await session.assert_that(Expect.tools.was_called("weather_api"))
    # Add specific checks
    await session.assert_that(Expect.content.regex(r"\d+°[CF]"))
    await session.assert_that(Expect.content.contains("New York"))
3

Update existing tests

Add new scenarios to existing files:
mcp-eval generate \
  --update tests/test_weather.py \
  --style pytest \
  --n-examples 5

Debug failing tests

1

Enable verbose output

mcp-eval run test_file.py -v
Shows tool calls, responses, and assertion details.
2

Examine OTEL traces

Look in test-reports/test_name_*/trace.jsonl:
import json

with open("test-reports/test_abc123/trace.jsonl") as f:
    for line in f:
        span = json.loads(line)
        if span["name"].startswith("tool:"):
            print(f"Tool: {span['name']}")
            print(f"Duration: {span['duration_ms']}ms")
            print(f"Input: {span['attributes'].get('input')}")
            print(f"Output: {span['attributes'].get('output')}")
3

Use doctor and validate commands

# Check system health
mcp-eval doctor --full

# Validate configuration
mcp-eval validate
4

Add debug assertions

# Temporarily add debug output
metrics = session.get_metrics()
print(f"Tool calls: {[c.name for c in metrics.tool_calls]}")
print(f"Total duration: {metrics.total_duration_ms}ms")
print(f"Token cost: ${metrics.total_cost_usd}")

Next steps

Master these advanced topics: