examples/mcp_server_fetch/
Overview
This example validates a simple MCP server that exposes afetch
tool to retrieve web content. The tests illustrate three styles (decorators, pytest, and legacy assertions) and demonstrate how to combine structural assertions, path constraints, and LLM judges.
Goals
- Verify the agent calls the
fetch
tool when appropriate - Check extracted content for known signals (e.g., “Example Domain”)
- Ensure efficient paths (no unnecessary steps)
- Evaluate quality with rubric-based judges
What you’ll learn
- Choosing assertions per outcome type (structural, tool, path, judge)
- Designing resilient tests using immediate vs deferred checks
- Reading metrics and span trees to diagnose behavior
Structure
datasets/
– YAML and Python datasetstests/
– pytest style, decorators, assertions stylegolden_paths/
– expected sequencesmcpeval.yaml
– config for provider, reports
Run
Assertion design and rationale
1) Prove the right tool was used
When a prompt requires reading a URL, we assert thefetch
tool was called:
Expect.tools.count("fetch", 1)
to detect duplicate calls.
2) Validate output structure rather than brittle text
For tool outputs, prefer structural checks over raw substring matching:3) Check content cues in the assistant’s final message
After tool use, assert the answer includes expected signals:4) Constrain the path and efficiency
For simple fetch tasks, we expect a singlefetch
and minimal steps:
5) Enforce iteration and latency budgets
6) Use judges when “quality” is subjective
Some checks need subjective evaluation (e.g., “good summary”). Use rubric‑based judges:7) Multi‑criteria judges for richer rubrics
Immediate vs deferred: how these fit together
- Immediate: content/judge checks that rely on
response
- Deferred: tools/path/performance checks that need session metrics
What it demonstrates
- Fetch tool end‑to‑end scenarios
- Dataset style configs and generated cases
- Tool sequence and output matching
- Judge rubric for quality checks
Placeholder: add screenshots of the HTML report for a passing run and for a failure showing a mismatched tool output.