Agent Evaluation

Evaluate your agent’s reasoning, tool use, recovery, and quality by driving it through realistic tasks.

Define the test agent

Global default:

import mcp_eval
from mcp_agent.agents.agent_spec import AgentSpec

mcp_eval.use_agent(
    AgentSpec(name="Fetcher", instruction="You fetch.", server_names=["fetch"])  # see [Settings](https://github.com/lastmile-ai/mcp-eval/blob/main/src/mcp_eval/config.py)
)

Per‑test override with with_agent (place above @task):

from mcp_eval.core import with_agent, task
from mcp_agent.agents.agent import Agent

@with_agent(Agent(name="Custom", instruction="Custom", server_names=["fetch"]))  # see [Core](https://github.com/lastmile-ai/mcp-eval/blob/main/src/mcp_eval/core.py)
@task("Custom agent test")
async def test_custom(agent, session):
    resp = await agent.generate_str("Fetch https://example.com")

Factory for parallel safety:

from mcp_eval.config import use_agent_factory
from mcp_agent.agents.agent import Agent

def make_agent():
    return Agent(name="Isolated", instruction="...", server_names=["fetch"])  # see [Settings](https://github.com/lastmile-ai/mcp-eval/blob/main/src/mcp_eval/config.py)

use_agent_factory(make_agent)

More patterns: agent_definition_examples.py.

What to measure

Tool behavior: Expect.tools.was_called, called_with, sequence, output_matches
Efficiency and iterations: Expect.performance.max_iterations, Expect.path.efficiency
Quality: Expect.judge.llm, Expect.judge.multi_criteria
Performance: response times, concurrency (see metrics)

# Efficiency and iteration bounds
await session.assert_that(Expect.performance.max_iterations(3))

# Tool behavior and outputs
await session.assert_that(Expect.tools.was_called("fetch"))
await session.assert_that(Expect.tools.output_matches("fetch", {"isError": False}, match_type="partial"))

# Path and sequence
await session.assert_that(Expect.tools.sequence(["fetch"], allow_other_calls=True))
await session.assert_that(Expect.path.efficiency(expected_tool_sequence=["fetch"], allow_extra_steps=1))

Styles for agent evals

Decorator tests: test_decorator_style.py
Pytest style: test_pytest_style.py
Datasets: test_dataset_style.py

Inspecting spans and metrics

metrics = session.get_metrics()
span_tree = session.get_span_tree()

Sources:

Session/agent: session.py
Catalog: catalog.py
Evaluators: evaluators/

Getting Started

Core Concepts

Writing Tests

Building with LLMs

Evaluation Guides

Configuration

CI/CD & Deployment

Test Reporting

API Reference

CLI Reference

Resources

Define the test agent

What to measure

Styles for agent evals

Inspecting spans and metrics

Getting Started

Core Concepts

Writing Tests

Building with LLMs

Evaluation Guides

Configuration

CI/CD & Deployment

Test Reporting

API Reference

CLI Reference

Resources

​Define the test agent

​What to measure

​Styles for agent evals

​Inspecting spans and metrics

Define the test agent

What to measure

Styles for agent evals

Inspecting spans and metrics