Core Concepts

Master these core concepts to write effective tests with mcp-eval. Each concept builds on the previous ones to create a complete testing framework.

Overview

mcp-eval orchestrates interactions between three key components:

Agents - AI models that can use tools
MCP Servers - Tool providers implementing the Model Context Protocol
Test Sessions - Orchestrators that manage execution and collect metrics

Think of it like a stage play: The agent is the actor, MCP servers provide the props and scenery, and the test session is the director capturing everything for review.

TestSession (single source of truth)

TestSession is the orchestrator that manages the entire test lifecycle. It configures OpenTelemetry tracing, runs the agent, collects spans, computes metrics, and saves artifacts.

Key responsibilities

Trace management: Configures and captures OTEL traces
Metrics extraction: Converts traces into actionable metrics (tool calls, latency, token usage, costs)
Assertion coordination: Manages immediate and deferred assertion evaluation
Report generation: Creates JSON, HTML, and Markdown reports

Metrics derived from traces

From the OTEL traces, TestSession extracts:

Tool invocation details (names, arguments, outputs, timing)
Iteration counts and conversation turns
Token usage and estimated costs
Performance breakdowns (LLM time vs tool time)
Error patterns and recovery sequences

Source: TestSession

TestAgent

TestAgent is a wrapper around the runtime agent that provides testing-specific functionality and connects assertions to the session.

Key features

Simplified API: generate_str() for string responses
Direct assertion access: agent.assert_that() shortcut
Session integration: Automatically connected to TestSession’s metrics

Example usage:

async with test_session("my-test") as session:
    agent = session.agent
    response = await agent.generate_str("Do something")
    await agent.assert_that(Expect.content.contains("done"), response=response)

Source: TestAgent

Unified assertion API

mcp-eval uses a single, discoverable API pattern for all assertions:

await session.assert_that(Expect.category.check(...), response=?, when=?)

Immediate vs deferred assertions

Understanding assertion timing is crucial for debugging test failures.

Immediate assertions (run when called with a response):

Content checks (contains, regex)
LLM judges (quality evaluation)

Deferred assertions (run at session end, need complete metrics):

Tool usage (was_called, count, sequence)
Performance (response_time_under, max_iterations)
Path efficiency analysis

Catalog source: Expect catalog

Test styles

mcp-eval supports three testing approaches to fit different workflows:

Decorator style

Simple and expressive for quick tests:

@task("Verify fetch works")
async def test_fetch(agent, session):
    response = await agent.generate_str("Fetch https://example.com")
    await session.assert_that(Expect.content.contains("Example"))

Features: @task, @setup, @teardown, @parametrize

Pytest integration

Familiar for teams already using pytest:

@pytest.mark.asyncio
async def test_with_pytest(mcp_agent):
    response = await mcp_agent.generate_str("Fetch data")
    assert "success" in response

Features: fixtures (mcp_session, mcp_agent), markers (@pytest.mark.mcp_agent)

Dataset style

Systematic evaluation with test matrices:

dataset = Dataset(
    name="Comprehensive suite",
    cases=[
        Case("test_1", inputs="Do X", evaluators=[...]),
        Case("test_2", inputs="Do Y", evaluators=[...])
    ]
)

Sources:

LLM judges

LLM-based evaluation for subjective quality assessment:

Single criterion

judge = Expect.judge.llm(
    rubric="Response should be helpful and accurate",
    min_score=0.8
)

Multi-criteria evaluation

criteria = [
    EvaluationCriterion("accuracy", "Factually correct", weight=2.0),
    EvaluationCriterion("clarity", "Easy to understand", weight=1.5)
]
judge = Expect.judge.multi_criteria(criteria)

Configuration via MCPEvalSettings for judge model/provider defaults. Sources:

Architecture flow

Next steps

With these concepts understood:

Write Your First Test

Start with a simple example

Explore Assertions

Learn the full assertion API

Getting Started

Writing Tests

Building with LLMs

Evaluation Guides

Configuration

CI/CD & Deployment

Test Reporting

API Reference

CLI Reference

Resources

Core Concepts

Overview

TestSession (single source of truth)

Key responsibilities

Metrics derived from traces

TestAgent

Key features

Unified assertion API

Immediate vs deferred assertions

Test styles

Decorator style

Pytest integration

Dataset style

LLM judges

Single criterion

Multi-criteria evaluation

Architecture flow

Next steps

Write Your First Test

Explore Assertions

Getting Started

Core Concepts

Writing Tests

Building with LLMs

Evaluation Guides

Configuration

CI/CD & Deployment

Test Reporting

API Reference

CLI Reference

Resources

​Overview

​TestSession (single source of truth)

​Key responsibilities

​Metrics derived from traces

​TestAgent

​Key features

​Unified assertion API

​Immediate vs deferred assertions

​Test styles

​Decorator style

​Pytest integration

​Dataset style

​LLM judges

​Single criterion

​Multi-criteria evaluation

​Architecture flow

​Next steps

Write Your First Test

Explore Assertions

Overview

TestSession (single source of truth)

Key responsibilities

Metrics derived from traces

TestAgent

Key features

Unified assertion API

Immediate vs deferred assertions

Test styles

Decorator style

Pytest integration

Dataset style

LLM judges

Single criterion

Multi-criteria evaluation

Architecture flow

Next steps