Common Workflows

Learn proven patterns for testing MCP servers and agents. Each workflow includes practical examples and tips from real-world usage.

Write your first test

Choose a test style

Pick the style that fits your workflow:Decorator (simplest):

from mcp_eval import task, Expect

@task("My first test")
async def test_basic(agent, session):
    response = await agent.generate_str("Hello")
    await session.assert_that(Expect.content.contains("Hi"))

Pytest (familiar):

@pytest.mark.asyncio
async def test_basic(mcp_agent):
    response = await mcp_agent.generate_str("Hello")
    assert "Hi" in response

Add meaningful assertions

Start simple, then add more specific checks:

# Start with content
await session.assert_that(Expect.content.contains("success"))

# Add tool verification
await session.assert_that(Expect.tools.was_called("my_tool"))

# Include performance
await session.assert_that(Expect.performance.response_time_under(5000))

Run and iterate

# Run with verbose output
mcp-eval run test_basic.py -v

# Generate reports
mcp-eval run test_basic.py --html report.html

Review failures, adjust assertions, and rerun.

Start with simple content assertions, then gradually add tool and performance checks as you understand your system’s behavior.

Test an MCP server comprehensively

Identify server capabilities

List all tools your server provides:

mcp-eval server list --verbose

Note the available tools and their expected behaviors.

Create test scenarios for each tool

Write tests covering normal and edge cases:

@task("Test calculator add - normal case")
async def test_add_normal(agent, session):
    response = await agent.generate_str("Calculate 5 + 3")
    await session.assert_that(Expect.tools.was_called("add"))
    await session.assert_that(Expect.content.contains("8"))

@task("Test calculator add - edge case")
async def test_add_overflow(agent, session):
    response = await agent.generate_str("Calculate 999999999 + 999999999")
    await session.assert_that(Expect.tools.was_called("add"))
    # Check for appropriate handling
    await session.assert_that(
        Expect.content.regex(r"(overflow|large|error)", case_sensitive=False)
    )

Test error handling

Verify graceful failure:

@task("Test invalid input handling")
async def test_error_handling(agent, session):
    response = await agent.generate_str("Calculate abc + xyz")
    
    # Should either fail gracefully or explain the issue
    await session.assert_that(
        Expect.content.regex(r"(invalid|error|cannot|unable)")
    )

Create a comprehensive dataset

For systematic testing:

from mcp_eval import Dataset, Case

dataset = Dataset(
    name="Calculator Server Tests",
    cases=[
        Case(
            name="addition",
            inputs="Calculate 10 + 20",
            expected_output="30",
            evaluators=[
                ToolWasCalled("add"),
                ResponseContains("30")
            ]
        ),
        # Add more cases...
    ]
)

Create and enforce a golden path

Ensure your agent follows the optimal execution path:

Define the ideal tool sequence

Identify the minimal, correct sequence of tools:

# Example: validate → process → format
golden_path = ["validate_input", "process_data", "format_output"]

Add path efficiency assertion

await session.assert_that(
    Expect.path.efficiency(
        expected_tool_sequence=golden_path,
        tool_usage_limits={
            "validate_input": 1,
            "process_data": 1,
            "format_output": 1
        },
        allow_extra_steps=0,
        penalize_backtracking=True
    ),
    name="golden_path_check"
)

Debug path violations

When tests fail, examine the actual path:

# In your test
metrics = session.get_metrics()
actual_sequence = [call.name for call in metrics.tool_calls]
print(f"Expected: {golden_path}")
print(f"Actual: {actual_sequence}")

Refine agent instructions

If the agent deviates, improve its instructions:

agent = Agent(
    instruction="""
    IMPORTANT: Follow this exact sequence:
    1. First validate the input
    2. Then process the validated data
    3. Finally format the output
    Never skip steps or backtrack.
    """
)

Golden paths work best for deterministic workflows. For creative tasks, consider using allow_extra_steps or checking only critical waypoints.

Build quality gates with LLM judges

Start with a simple rubric

Define what “good” looks like:

judge = Expect.judge.llm(
    rubric="""
    The response should:
    - Accurately summarize the main points
    - Use clear, professional language
    - Be 2-3 sentences long
    """,
    min_score=0.8,
    include_input=True  # Give judge full context
)

Combine with structural checks

Don’t rely solely on judges:

# Structural check (deterministic)
await session.assert_that(
    Expect.tools.output_matches(
        tool_name="fetch",
        expected_output="Example Domain",
        match_type="contains"
    )
)

# Quality check (LLM judge)
await session.assert_that(judge, response=response)

Use multi-criteria for complex evaluation

from mcp_eval.evaluators import EvaluationCriterion

criteria = [
    EvaluationCriterion(
        name="accuracy",
        description="All facts are correct and up-to-date",
        weight=3.0,  # Most important
        min_score=0.9
    ),
    EvaluationCriterion(
        name="completeness",
        description="Covers all requested information",
        weight=2.0,
        min_score=0.8
    ),
    EvaluationCriterion(
        name="clarity",
        description="Easy to understand, well-organized",
        weight=1.0,
        min_score=0.7
    )
]

judge = Expect.judge.multi_criteria(
    criteria=criteria,
    aggregate_method="weighted",  # or "min" for strictest
    require_all_pass=False,  # Set True for strict gating
    use_cot=True  # Chain-of-thought reasoning
)

Calibrate thresholds

Run tests, collect scores, adjust:

# Start lenient
min_score=0.6

# After collecting data, tighten to p50 or p75
min_score=0.8  # Based on historical performance

Pro tip: Use Anthropic Claude (Opus or Sonnet) for best judge quality. They provide more consistent and nuanced evaluations.

Integrate with CI/CD

Add GitHub Actions workflow

Create .github/workflows/mcp-eval.yml using the reusable workflow:

name: MCP-Eval CI

on:
  push:
    branches: [main, master, trunk]
  pull_request:
  workflow_dispatch:

# Cancel redundant runs on the same ref
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  call-mcpeval:
    uses: lastmile-ai/mcp-eval/.github/workflows/mcpeval-reusable.yml
    with:
      deploy-pages: true
      # Optional: customize test configuration
      # python-version: '3.11'
      # tests: 'tests/'
      # run-args: '-v --max-concurrency 4'
    permissions:
      contents: read
      pages: write
      id-token: write
      pull-requests: write
    secrets: inherit

This reusable workflow automatically:

Runs tests and generates reports
Posts PR comments with results
Uploads artifacts
Deploys badges and HTML reports to GitHub Pages (on main branch)

Enable GitHub Pages

In your repository settings:

Go to Settings → Pages
Source: Deploy from a branch
Branch: gh-pages (created automatically by the workflow)
Save the settings

Your badges and reports will be available at:

Badges: https://YOUR_USERNAME.github.io/YOUR_REPO/badges/
Report: https://YOUR_USERNAME.github.io/YOUR_REPO/

Add test badges from GitHub Pages

After deploying to GitHub Pages, you may add badges to your README.md to show users your mcp-eval test and coverage status:

[![mcp-tests](https://img.shields.io/endpoint?url=https://YOUR_USERNAME.github.io/YOUR_REPO/badges/mcp-tests.json&cacheSeconds=300)](https://YOUR_USERNAME.github.io/YOUR_REPO/)
[![mcp-cov](https://img.shields.io/endpoint?url=https://YOUR_USERNAME.github.io/YOUR_REPO/badges/mcp-cov.json&cacheSeconds=300)](https://YOUR_USERNAME.github.io/YOUR_REPO/)

These badges will automatically update after each push to main.

Configure failure conditions

Make tests fail the build appropriately:

# In your test
critical_assertions = [
    Expect.tools.success_rate(min_rate=0.95),
    Expect.performance.response_time_under(10000)
]

for assertion in critical_assertions:
    result = await session.assert_that(assertion)
    if not result.passed:
        sys.exit(1)  # Fail CI

Generate tests with AI

Use the generate command

Let AI create test scenarios:

# Generate 10 pytest-style tests
mcp-eval generate \
  --style pytest \
  --n-examples 10 \
  --provider anthropic \
  --model claude-3-5-sonnet-20241022

Review and customize

AI-generated tests are a starting point:

# Generated test
@task("Test weather fetching")
async def test_weather(agent, session):
    response = await agent.generate_str("Get weather for NYC")
    await session.assert_that(Expect.tools.was_called("weather_api"))

# Add your domain knowledge
@task("Test weather fetching with validation")
async def test_weather_enhanced(agent, session):
    response = await agent.generate_str("Get weather for NYC")
    await session.assert_that(Expect.tools.was_called("weather_api"))
    # Add specific checks
    await session.assert_that(Expect.content.regex(r"\d+°[CF]"))
    await session.assert_that(Expect.content.contains("New York"))

Update existing tests

Add new scenarios to existing files:

mcp-eval generate \
  --update tests/test_weather.py \
  --style pytest \
  --n-examples 5

Debug failing tests

Enable verbose output

mcp-eval run test_file.py -v

Shows tool calls, responses, and assertion details.

Examine OTEL traces

Look in test-reports/test_name_*/trace.jsonl:

import json

with open("test-reports/test_abc123/trace.jsonl") as f:
    for line in f:
        span = json.loads(line)
        if span["name"].startswith("tool:"):
            print(f"Tool: {span['name']}")
            print(f"Duration: {span['duration_ms']}ms")
            print(f"Input: {span['attributes'].get('input')}")
            print(f"Output: {span['attributes'].get('output')}")

Use doctor and validate commands

# Check system health
mcp-eval doctor --full

# Validate configuration
mcp-eval validate

Add debug assertions

# Temporarily add debug output
metrics = session.get_metrics()
print(f"Tool calls: {[c.name for c in metrics.tool_calls]}")
print(f"Total duration: {metrics.total_duration_ms}ms")
print(f"Token cost: ${metrics.total_cost_usd}")

Next steps

Master these advanced topics:

Dataset Testing

Systematic evaluation at scale

Custom Evaluators

Build domain-specific checks

Performance Tuning

Optimize test execution

Getting Started

Core Concepts

Writing Tests

Building with LLMs

Evaluation Guides

Configuration

CI/CD & Deployment

Test Reporting

API Reference

CLI Reference

Resources

Write your first test

Test an MCP server comprehensively

Create and enforce a golden path

Build quality gates with LLM judges

Integrate with CI/CD

Generate tests with AI

Debug failing tests

Next steps

Dataset Testing

Custom Evaluators

Performance Tuning

Getting Started

Core Concepts

Writing Tests

Building with LLMs

Evaluation Guides

Configuration

CI/CD & Deployment

Test Reporting

API Reference

CLI Reference

Resources

​Write your first test

​Test an MCP server comprehensively

​Create and enforce a golden path

​Build quality gates with LLM judges

​Integrate with CI/CD

​Generate tests with AI

​Debug failing tests

​Next steps

Dataset Testing

Custom Evaluators

Performance Tuning

Write your first test

Test an MCP server comprehensively

Create and enforce a golden path

Build quality gates with LLM judges

Integrate with CI/CD

Generate tests with AI

Debug failing tests

Next steps