Assertions use a single entrypoint and an expressive catalog. Prefer structural checks for stability, and combine them with judges only when necessary.
Use one entrypoint:
await session.assert_that(Expect.content.contains("Example"), response=resp)
Catalog source: catalog.py
- Immediate: content and judge (given
response
)
- Deferred (requires final metrics): tools, performance, path
Content
Expect.content.contains(text, case_sensitive=False)
Expect.content.not_contains(text, case_sensitive=False)
Expect.content.regex(pattern, case_sensitive=False)
Examples:
resp = await agent.generate_str("Fetch https://example.com")
await session.assert_that(Expect.content.contains("Example Domain"), response=resp)
await session.assert_that(Expect.content.not_contains("Stack trace"), response=resp)
await session.assert_that(Expect.content.regex(r"Example\\s+Domain"), response=resp)
Expect.tools.was_called(name, min_times=1)
Expect.tools.called_with(name, {args})
Expect.tools.count(name, expected_count)
Expect.tools.success_rate(min_rate, tool_name=None)
Expect.tools.failed(name)
Expect.tools.output_matches(name, expected, field_path?, match_type?, case_sensitive?, call_index?)
Expect.tools.sequence([names], allow_other_calls=False)
Notes:
field_path
supports nested dict/list access, e.g. content[0].text
or content.0.text
Examples:
# Verify a fetch occurred with expected output pattern in first content block
await session.assert_that(
Expect.tools.output_matches(
tool_name="fetch",
expected_output=r"use.*examples",
match_type="regex",
case_sensitive=False,
field_path="content[0].text",
),
name="fetch_output_match",
)
# Verify sequence and counts
await session.assert_that(Expect.tools.sequence(["fetch"], allow_other_calls=True))
await session.assert_that(Expect.tools.count("fetch", 1))
await session.assert_that(Expect.tools.success_rate(1.0, tool_name="fetch"))
Expect.performance.max_iterations(n)
Expect.performance.response_time_under(ms)
Example:
await session.assert_that(Expect.performance.max_iterations(3))
await session.assert_that(Expect.performance.response_time_under(10_000))
Judge
Expect.judge.llm(rubric, min_score=0.8, include_input=False, require_reasoning=True)
Expect.judge.multi_criteria(criteria, aggregate_method="weighted", require_all_pass=False, include_confidence=True, use_cot=True, model=None)
Examples:
# Single-criterion rubric
judge = Expect.judge.llm(
rubric="Response should identify JSON format and summarize main points",
min_score=0.8,
include_input=True,
)
await session.assert_that(judge, response=resp, name="quality_check")
# Multi-criteria
from mcp_eval.evaluators import EvaluationCriterion
criteria = [
EvaluationCriterion(name="accuracy", description="Factual correctness", weight=2.0, min_score=0.8),
EvaluationCriterion(name="completeness", description="Covers key points", weight=1.5, min_score=0.7),
]
judge_mc = Expect.judge.multi_criteria(criteria, aggregate_method="weighted", use_cot=True)
await session.assert_that(judge_mc, response=resp, name="multi_criteria")
Path
Expect.path.efficiency(optimal_steps?, expected_tool_sequence?, allow_extra_steps=0, penalize_backtracking=True, penalize_repeated_tools=True, tool_usage_limits?, default_tool_limit=1)
Evaluators source: evaluators/
Examples
# Enforce exact order and single use of tools
await session.assert_that(
Expect.path.efficiency(
expected_tool_sequence=["validate", "process", "save"],
tool_usage_limits={"validate": 1, "process": 1, "save": 1},
allow_extra_steps=0,
penalize_backtracking=True,
),
name="golden_path",
)
# Allow a retry of a fragile step without failing the whole path
await session.assert_that(
Expect.path.efficiency(
expected_tool_sequence=["fetch", "parse"],
tool_usage_limits={"fetch": 2, "parse": 1},
allow_extra_steps=1,
penalize_repeated_tools=False,
),
name="robust_path",
)
Tips
- Prefer structural checks (
output_matches
) for tool outputs when possible for stability
- Use
name
in assert_that(..., name="...")
to label checks in reports
- Combine judge + structural checks for high confidence
Combine a minimal judge (e.g., rubric with min_score=0.8
) with one or two structural checks (like output_matches
) for resilient tests.
Deferred assertions are evaluated when the session ends or when when="end"
is used. If an assertion depends on final metrics (e.g., success rate), defer it.