API: Session - mcp-eval

The Session API is the heart of mcp-eval testing. It manages your agent’s lifecycle, collects metrics, runs assertions, and produces comprehensive test results.

Quick start

The simplest way to create a test session:

from mcp_eval.session import test_session
from mcp_eval.catalog import Expect

async with test_session("my-test") as agent:
    # Agent is ready with MCP servers connected
    response = await agent.generate_str("Fetch https://example.com")
    
    # Run assertions
    await agent.assert_that(
        Expect.content.contains("Example Domain"),
        response=response
    )

Core concepts

TestSession

The orchestrator that manages everything:

Lifecycle management: Starts/stops agents and MCP servers
Tool discovery: Automatically finds and registers MCP tools
Metrics collection: Tracks all interactions via OTEL
Assertion execution: Runs evaluators at the right time
Report generation: Produces test artifacts

TestAgent

A thin, friendly wrapper around your LLM agent:

Simple interface: Just generate() and assert_that()
Automatic tracking: All interactions are recorded
Context preservation: Maintains conversation state

Creating sessions

Basic session creation

# Using context manager (recommended)
async with test_session("test-name") as agent:
    # Your test code here
    pass

# Manual lifecycle (advanced)
session = TestSession(test_name="test-name")
agent = await session.__aenter__()
try:
    # Your test code
    ...
finally:
    await session.__aexit__(None, None, None)
    session.cleanup()

Session with custom configuration

from mcp_eval.session import test_session
from mcp_agent.agents.agent_spec import AgentSpec

spec = AgentSpec(
    name="custom",
    instruction="You are a helpful test assistant",
    server_names=["my_server"],
)
async with test_session("custom-test", agent=spec) as agent:
    # Your test code
    pass

Agent interactions

Generating responses

# Simple string generation
response = await agent.generate_str("What is 2+2?")
print(response)  # "The answer is 4"

# Full response object may be available depending on provider; prefer generate_str for portability

Multi-turn conversations

# Sessions maintain context
response1 = await agent.generate_str("My name is Alice")
response2 = await agent.generate_str("What's my name?")
# response2 will correctly identify "Alice"

Assertions in depth

Immediate vs. deferred assertions

# Immediate: evaluated right away (content, judge)
await session.assert_that(
    Expect.content.contains("success"),
    response=response,  # Required for immediate
    name="has_success"
)

# Deferred: evaluated at session end (tools, performance, path)
await session.assert_that(
    Expect.tools.was_called("calculator"),
    name="used_calculator"  # No response needed
)

# Force deferred evaluation at end
await session.assert_that(
    Expect.content.contains("final"),
    response=response,
    when="end"  # Defer even content checks
)

Assertion timing control

# Evaluate specific assertions immediately
result = await session.evaluate_now_async(
    Expect.performance.response_time_under(5000),
    response=response,
    name="quick_response"
)

if not result.passed:
    print(f"Too slow: {result.details}")
    # Take corrective action

# Batch evaluate multiple assertions
results = await session.evaluate_now_async(
    Expect.tools.success_rate(0.95),
    Expect.performance.max_iterations(3)
)

Named assertions for better reporting

# Always name your assertions for clarity
await session.assert_that(
    Expect.content.regex(r"\d+ items? found"),
    response=response,
    name="item_count_format"  # Appears in reports
)

Metrics and results

Accessing metrics during tests

# Get current metrics
metrics = session.get_metrics()

print(f"Tool calls: {len(metrics.tool_calls)}")
print(f"Total tokens: {metrics.total_tokens}")
print(f"Duration so far: {metrics.total_duration_ms}ms")
print(f"Estimated cost: ${metrics.total_cost_usd:.4f}")

# Detailed tool information
for call in metrics.tool_calls:
    print(f"Tool: {call.name}")
    print(f"Duration: {call.duration_ms}ms")
    print(f"Success: {call.success}")
    if not call.success:
        print(f"Error: {call.error}")

Getting test results

# Check if all assertions passed
if session.all_passed():
    print("✅ All tests passed!")
else:
    print("❌ Some tests failed")

# Get detailed results
results = session.get_results()
for result in results:
    print(f"Assertion: {result.name}")
    print(f"Passed: {result.passed}")
    if not result.passed:
        print(f"Reason: {result.details}")

# Get pass/fail summary
summary = session.get_summary()
print(f"Passed: {summary['passed']}/{summary['total']}")
print(f"Pass rate: {summary['pass_rate']:.1%}")

Duration tracking

# Get test duration
duration_ms = session.get_duration_ms()
print(f"Test took {duration_ms/1000:.2f} seconds")

# Track specific operations
from time import time

start = time()
response = await agent.generate_str("Complex task")
operation_time = (time() - start) * 1000

if operation_time > 5000:
    print(f"Warning: Operation took {operation_time:.0f}ms")

OpenTelemetry traces

Accessing trace data

# Get structured span tree
span_tree = session.get_span_tree()

def print_spans(span, indent=0):
    prefix = "  " * indent
    print(f"{prefix}{span.name}: {span.duration_ms}ms")
    for child in span.children:
        print_spans(child, indent + 1)

print_spans(span_tree)

# Ensure traces are written to disk
await session._ensure_traces_flushed()

Custom span attributes

# Add custom attributes to current span
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("custom_operation") as span:
    span.set_attribute("user_id", "123")
    span.set_attribute("operation_type", "validation")
    
    response = await agent.generate_str("Validate user input")

Artifacts and reporting

Session artifacts

# Sessions automatically save artifacts
session = await TestSession.create(
    test_name="my-test",
    output_dir="test-reports",  # Custom output location
    save_artifacts=True  # Enable artifact saving
)

# After test completion, find artifacts at:
# test-reports/my-test_[timestamp]/
#   ├── trace.jsonl          # OTEL traces
#   ├── results.json         # Test results
#   ├── metrics.json         # Performance metrics
#   └── conversation.json    # Full conversation log

Programmatic report generation

# Generate reports programmatically
from mcp_eval.reports import ReportGenerator

generator = ReportGenerator(session)

# Generate different formats
await generator.save_json("results.json")
await generator.save_markdown("results.md")
await generator.save_html("results.html")

# Get report data for custom processing
report_data = generator.get_report_data()
print(f"Test: {report_data['test_name']}")
print(f"Duration: {report_data['duration_ms']}ms")
print(f"Passed: {report_data['passed']}/{report_data['total']}")

Advanced patterns

Custom session hooks

class CustomSession(TestSession):
    async def on_tool_call(self, tool_name: str, args: dict):
        """Hook called before each tool execution."""
        print(f"About to call {tool_name} with {args}")
        
        # Validate tool usage
        if tool_name == "dangerous_tool":
            raise ValueError("Dangerous tool not allowed in tests")
    
    async def on_assertion_complete(self, result):
        """Hook called after each assertion."""
        if not result.passed:
            # Log failures to external system
            await self.log_to_monitoring(result)

Session state management

# Store custom state in session
session.state["test_user_id"] = "user_123"
session.state["test_context"] = {"environment": "staging"}

# Access state in assertions or hooks
user_id = session.state.get("test_user_id")

Parallel session execution

import asyncio

async def run_test(test_name: str, prompt: str):
    async with test_session(test_name) as agent:
        response = await agent.generate_str(prompt)
        await agent.assert_that(
            Expect.content.contains("success"),
            response=response
        )
        return agent.session.all_passed()

# Run multiple tests in parallel
results = await asyncio.gather(
    run_test("test1", "Task 1"),
    run_test("test2", "Task 2"),
    run_test("test3", "Task 3")
)

print(f"All passed: {all(results)}")

Best practices

Use context managers: Always use async with test_session() to ensure proper cleanup, even if tests fail.

Name your assertions: Always provide descriptive names for assertions. This makes debugging much easier when reviewing test reports.

Monitor metrics: Check metrics during long-running tests to catch performance issues early.

Error handling

try:
    async with test_session("error-test") as agent:
        response = await agent.generate_str("Test prompt")
        await agent.assert_that(
            Expect.content.contains("expected"),
            response=response
        )
except TimeoutError:
    print("Test timed out - increase timeout_seconds")
except AssertionError as e:
    print(f"Assertion failed: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")
    # Session cleanup is still guaranteed

Expect API Reference

All available assertions

Configuration

Session configuration options

Reports

Understanding test reports

Metrics & Tracing

Deep dive into metrics

Getting Started

Core Concepts

Writing Tests

Building with LLMs

Evaluation Guides

Configuration

CI/CD & Deployment

Test Reporting

API Reference

CLI Reference

Resources

​Quick start

​Core concepts

​TestSession

​TestAgent

​Creating sessions

​Basic session creation

​Session with custom configuration

​Agent interactions

​Generating responses

​Multi-turn conversations

​Assertions in depth

​Immediate vs. deferred assertions

​Assertion timing control

​Named assertions for better reporting

​Metrics and results

​Accessing metrics during tests

​Getting test results

​Duration tracking

​OpenTelemetry traces

​Accessing trace data

​Custom span attributes

​Artifacts and reporting

​Session artifacts

​Programmatic report generation

​Advanced patterns

​Custom session hooks

​Session state management

​Parallel session execution

​Best practices

​Error handling

​See also

Expect API Reference

Configuration

Reports

Metrics & Tracing

Quick start

Core concepts

TestSession

TestAgent

Creating sessions

Basic session creation

Session with custom configuration

Agent interactions

Generating responses

Multi-turn conversations

Assertions in depth

Immediate vs. deferred assertions

Assertion timing control

Named assertions for better reporting

Metrics and results

Accessing metrics during tests

Getting test results

Duration tracking

OpenTelemetry traces

Accessing trace data

Custom span attributes

Artifacts and reporting

Session artifacts

Programmatic report generation

Advanced patterns

Custom session hooks

Session state management

Parallel session execution

Best practices

Error handling

See also