Assertions

Assertions use a single entrypoint and an expressive catalog. Prefer structural checks for stability, and combine them with judges only when necessary.

Use one entrypoint:

await session.assert_that(Expect.content.contains("Example"), response=resp)

Catalog source: catalog.py

Immediate vs deferred

Immediate: content and judge (given response)
Deferred (requires final metrics): tools, performance, path

Content

Expect.content.contains(text, case_sensitive=False)
Expect.content.not_contains(text, case_sensitive=False)
Expect.content.regex(pattern, case_sensitive=False)

Examples:

resp = await agent.generate_str("Fetch https://example.com")
await session.assert_that(Expect.content.contains("Example Domain"), response=resp)
await session.assert_that(Expect.content.not_contains("Stack trace"), response=resp)
await session.assert_that(Expect.content.regex(r"Example\\s+Domain"), response=resp)

Tools

Expect.tools.was_called(name, min_times=1)
Expect.tools.called_with(name, {args})
Expect.tools.count(name, expected_count)
Expect.tools.success_rate(min_rate, tool_name=None)
Expect.tools.failed(name)
Expect.tools.output_matches(name, expected, field_path?, match_type?, case_sensitive?, call_index?)
Expect.tools.sequence([names], allow_other_calls=False)

Notes:

field_path supports nested dict/list access, e.g. content[0].text or content.0.text

Examples:

# Verify a fetch occurred with expected output pattern in first content block
await session.assert_that(
  Expect.tools.output_matches(
    tool_name="fetch",
    expected_output=r"use.*examples",
    match_type="regex",
    case_sensitive=False,
    field_path="content[0].text",
  ),
  name="fetch_output_match",
)

# Verify sequence and counts
await session.assert_that(Expect.tools.sequence(["fetch"], allow_other_calls=True))
await session.assert_that(Expect.tools.count("fetch", 1))
await session.assert_that(Expect.tools.success_rate(1.0, tool_name="fetch"))

Performance

Expect.performance.max_iterations(n)
Expect.performance.response_time_under(ms)

Example:

await session.assert_that(Expect.performance.max_iterations(3))
await session.assert_that(Expect.performance.response_time_under(10_000))

Judge

Expect.judge.llm(rubric, min_score=0.8, include_input=False, require_reasoning=True)
Expect.judge.multi_criteria(criteria, aggregate_method="weighted", require_all_pass=False, include_confidence=True, use_cot=True, model=None)

Examples:

# Single-criterion rubric
judge = Expect.judge.llm(
  rubric="Response should identify JSON format and summarize main points",
  min_score=0.8,
  include_input=True,
)
await session.assert_that(judge, response=resp, name="quality_check")

# Multi-criteria
from mcp_eval.evaluators import EvaluationCriterion
criteria = [
  EvaluationCriterion(name="accuracy", description="Factual correctness", weight=2.0, min_score=0.8),
  EvaluationCriterion(name="completeness", description="Covers key points", weight=1.5, min_score=0.7),
]
judge_mc = Expect.judge.multi_criteria(criteria, aggregate_method="weighted", use_cot=True)
await session.assert_that(judge_mc, response=resp, name="multi_criteria")

Path

Expect.path.efficiency(optimal_steps?, expected_tool_sequence?, allow_extra_steps=0, penalize_backtracking=True, penalize_repeated_tools=True, tool_usage_limits?, default_tool_limit=1)

Evaluators source: evaluators/

Examples

# Enforce exact order and single use of tools
await session.assert_that(
  Expect.path.efficiency(
    expected_tool_sequence=["validate", "process", "save"],
    tool_usage_limits={"validate": 1, "process": 1, "save": 1},
    allow_extra_steps=0,
    penalize_backtracking=True,
  ),
  name="golden_path",
)

# Allow a retry of a fragile step without failing the whole path
await session.assert_that(
  Expect.path.efficiency(
    expected_tool_sequence=["fetch", "parse"],
    tool_usage_limits={"fetch": 2, "parse": 1},
    allow_extra_steps=1,
    penalize_repeated_tools=False,
  ),
  name="robust_path",
)

Tips

Prefer structural checks (output_matches) for tool outputs when possible for stability
Use name in assert_that(..., name="...") to label checks in reports
Combine judge + structural checks for high confidence

Combine a minimal judge (e.g., rubric with min_score=0.8) with one or two structural checks (like output_matches) for resilient tests.

Deferred assertions are evaluated when the session ends or when when="end" is used. If an assertion depends on final metrics (e.g., success rate), defer it.

Getting Started

Core Concepts

Writing Tests

Building with LLMs

Evaluation Guides

Configuration

CI/CD & Deployment

Test Reporting

API Reference

CLI Reference

Resources

Immediate vs deferred

Content

Tools

Performance

Judge

Path

Examples

Tips

Getting Started

Core Concepts

Writing Tests

Building with LLMs

Evaluation Guides

Configuration

CI/CD & Deployment

Test Reporting

API Reference

CLI Reference

Resources

​Immediate vs deferred

​Content

​Tools

​Performance

​Judge

​Path

​Examples

​Tips

Immediate vs deferred

Content

Tools

Performance

Judge

Path

Examples

Tips