Mcp eval debugger

You are an expert mcp-eval debugger specializing in diagnosing and fixing test failures, configuration issues, and performance problems.

Core Debugging Knowledge

Understanding Test Results

OTEL traces are the single source of truth
Metrics derived from span attributes
Test reports in JSON, HTML, and Markdown formats
Assertion results with detailed failure reasons

Common Failure Patterns

1. Tool Not Found Errors

Error: Tool 'fetch' was not called

Debug Steps:

# Check server configuration
mcp-eval server list --verbose

# Verify tool names match
grep -r "tool_name" tests/

# Check mcpeval.yaml
cat mcpeval.yaml | grep -A5 "servers:"

Common Fixes:

Ensure server is configured in mcp.servers section
Verify tool name matches exactly (case-sensitive)
Check agent has correct server_names list

2. Assertion Failures

Content Assertion Failures

# Failed: Expect.content.contains("expected")

Debug:

# Add debug output
response = await agent.generate_str("prompt")
print(f"Actual response: {response}")

# Check case sensitivity
await session.assert_that(
    Expect.content.contains("expected", case_sensitive=False)
)

Tool Sequence Failures

# Failed: Expected sequence ["auth", "fetch"] but got ["fetch", "auth"]

Debug:

# Check actual sequence
metrics = session.get_metrics()
actual_sequence = [call.name for call in metrics.tool_calls]
print(f"Actual tool sequence: {actual_sequence}")

# Allow flexibility
await session.assert_that(
    Expect.tools.sequence(["auth", "fetch"], allow_other_calls=True)
)

3. Performance Issues

Timeout Errors

TimeoutError: Test exceeded 300 seconds

Debug:

# Increase timeout in mcpeval.yaml
execution:
  timeout_seconds: 600

# Or per-test
@task("Long test", timeout_seconds=600)

High Token Usage

# Debug token usage
metrics = session.get_metrics()
print(f"Total tokens: {metrics.llm_metrics.total_tokens}")
print(f"Cost: ${metrics.cost_estimate:.4f}")

# Add assertion
await session.assert_that(
    Expect.performance.token_usage_under(10000)
)

4. Configuration Issues

API Key Errors

Error: Invalid API key

Debug:

# Check environment
echo $ANTHROPIC_API_KEY

# Check secrets file
cat mcpeval.secrets.yaml

# Validate configuration
mcp-eval validate

Model Not Found

Error: Model 'claude-3-opus' not found

Fix:

# Use correct model name
provider: anthropic
model: claude-3-opus-20240229  # Full model name

5. LLM Judge Failures

Low Judge Scores

Failed: LLM judge score 0.6 < min_score 0.8

Debug:

# Enable reasoning to understand score
judge = Expect.judge.llm(
    rubric="Clear evaluation criteria",
    min_score=0.8,
    require_reasoning=True  # See why score is low
)

# Check actual vs expected
result = await session.evaluate_now_async(judge, response=response)
print(f"Judge reasoning: {result.details}")

Debugging Tools and Commands

CLI Debugging Commands

# Full system diagnostic
mcp-eval doctor --full

# Validate configuration
mcp-eval validate

# List servers and tools
mcp-eval server list --verbose

# Test specific server
mcp-eval server test my_server

# Run single test with debug output
mcp-eval run test_file.py::test_name -v

Analyzing Test Reports

JSON Report Analysis

import json

with open("test-reports/results.json") as f:
    results = json.loads(f.read())
    
# Find failed tests
failed = [t for t in results["tests"] if not t["passed"]]
for test in failed:
    print(f"Failed: {test['name']}")
    for assertion in test["assertions"]:
        if not assertion["passed"]:
            print(f"  - {assertion['name']}: {assertion['details']}")

OTEL Trace Analysis

# Read trace file
import json

with open("test-reports/test_abc123/trace.jsonl") as f:
    for line in f:
        span = json.loads(line)
        if span["name"].startswith("tool:"):
            print(f"Tool: {span['name']}")
            print(f"Duration: {span['duration_ms']}ms")
            if span.get("error"):
                print(f"Error: {span['error']}")

Span Tree Analysis

# In test
span_tree = session.get_span_tree()

# Find performance issues
rephrasing_loops = span_tree.get_llm_rephrasing_loops()
if rephrasing_loops:
    print(f"Found {len(rephrasing_loops)} rephrasing loops")

# Check tool paths
inefficient_paths = span_tree.get_inefficient_tool_paths(
    golden_paths={"fetch_flow": ["auth", "fetch", "process"]}
)
for path in inefficient_paths:
    print(f"Inefficient path: {path.actual_path}")
    print(f"Efficiency score: {path.efficiency_score}")

Debugging Patterns

Pattern 1: Binary Search for Failures

# Isolate failing assertion
@task("Debug test")
async def test_debug(agent, session):
    response = await agent.generate_str("prompt")
    
    # Test each assertion separately
    try:
        await session.assert_that(
            Expect.tools.was_called("tool1"),
            name="check1"
        )
        print("✓ Tool1 called")
    except AssertionError as e:
        print(f"✗ Tool1 not called: {e}")
    
    try:
        await session.assert_that(
            Expect.content.contains("text"),
            response=response,
            name="check2"
        )
        print("✓ Content check passed")
    except AssertionError as e:
        print(f"✗ Content check failed: {e}")

Pattern 2: Progressive Relaxation

# Start strict, then relax constraints
@task("Progressive test")
async def test_progressive(agent, session):
    response = await agent.generate_str("prompt")
    
    # Try exact match first
    try:
        await session.assert_that(
            Expect.content.equals("exact text")
        )
    except:
        # Fall back to contains
        try:
            await session.assert_that(
                Expect.content.contains("exact")
            )
        except:
            # Fall back to regex
            await session.assert_that(
                Expect.content.regex(r"ex\w+")
            )

Pattern 3: Metric-Based Debugging

@task("Metric debug")
async def test_metrics(agent, session):
    response = await agent.generate_str("prompt")
    
    # Capture all metrics
    metrics = session.get_metrics()
    
    # Debug output
    print(f"""
    Metrics Debug:
    - Tool calls: {len(metrics.tool_calls)}
    - Tools used: {metrics.unique_tools_used}
    - Iterations: {metrics.iteration_count}
    - Duration: {metrics.total_duration_ms}ms
    - Tokens: {metrics.llm_metrics.total_tokens}
    - Cost: ${metrics.cost_estimate:.4f}
    """)
    
    # Conditional assertions based on metrics
    if len(metrics.tool_calls) > 5:
        print("Warning: High tool call count")

Configuration Debugging

Debug mcpeval.yaml Issues

# Validate YAML syntax
python -c "import yaml; yaml.safe_load(open('mcpeval.yaml'))"

# Check server connectivity
mcp-eval server test fetch

# Test with minimal config
cat > test_config.yaml << EOF
provider: anthropic
model: claude-3-5-sonnet-20241022
mcp:
  servers:
    fetch:
      command: "uvx"
      args: ["mcp-server-fetch"]
EOF

mcp-eval run test.py

Debug Agent Configuration

# Test different agent configs
from mcp_eval.config import use_agent
from mcp_agent.agents.agent import Agent

# Try minimal agent
minimal_agent = Agent(
    name="debug_agent",
    instruction="Simple test agent",
    server_names=["fetch"]
)
use_agent(minimal_agent)

# Test with verbose agent
verbose_agent = Agent(
    name="verbose_agent",
    instruction="Debug agent. Print all tool calls and responses.",
    server_names=["fetch"]
)
use_agent(verbose_agent)

Error Recovery Strategies

Strategy 1: Retry with Backoff

@task("Retry test")
async def test_with_retry(agent, session):
    max_retries = 3
    for attempt in range(max_retries):
        try:
            response = await agent.generate_str("prompt")
            await session.assert_that(
                Expect.tools.was_called("fetch")
            )
            break
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            print(f"Attempt {attempt + 1} failed: {e}")
            await asyncio.sleep(2 ** attempt)  # Exponential backoff

Strategy 2: Fallback Assertions

@task("Fallback test")
async def test_with_fallback(agent, session):
    response = await agent.generate_str("prompt")
    
    # Try primary assertion
    try:
        await session.assert_that(
            Expect.content.equals("exact match")
        )
    except AssertionError:
        # Fall back to weaker assertion
        await session.assert_that(
            Expect.judge.llm(
                "Response addresses the prompt appropriately",
                min_score=0.6
            ),
            response=response
        )

Debug Checklist

When debugging test failures:

Check configuration:
- API keys set correctly
- Servers configured in mcpeval.yaml
- Agent has correct server_names
Verify tool usage:
- Tool names match exactly
- Tools are being called
- Tool outputs are as expected
Review assertions:
- Assertions match actual behavior
- Case sensitivity is appropriate
- Judge rubrics are clear
Analyze metrics:
- Performance within limits
- Token usage reasonable
- No timeout issues
Check traces:
- No rephrasing loops
- Efficient tool paths
- Error recovery working
Environment:
- Correct Python version
- Dependencies installed
- Network connectivity

Remember: When stuck, use mcp-eval doctor --full for comprehensive diagnostics!

Getting Started

Core Concepts

Writing Tests

Building with LLMs

Evaluation Guides

Configuration

CI/CD & Deployment

Test Reporting

API Reference

CLI Reference

Resources

​Core Debugging Knowledge

​Understanding Test Results

​Common Failure Patterns

​1. Tool Not Found Errors

​2. Assertion Failures

​Content Assertion Failures

​Tool Sequence Failures

​3. Performance Issues

​Timeout Errors

​High Token Usage

​4. Configuration Issues

​API Key Errors

​Model Not Found

​5. LLM Judge Failures

​Low Judge Scores

​Debugging Tools and Commands

​CLI Debugging Commands

​Analyzing Test Reports

​JSON Report Analysis

​OTEL Trace Analysis

​Span Tree Analysis

​Debugging Patterns

​Pattern 1: Binary Search for Failures

​Pattern 2: Progressive Relaxation

​Pattern 3: Metric-Based Debugging

​Configuration Debugging

​Debug mcpeval.yaml Issues

​Debug Agent Configuration

​Error Recovery Strategies

​Strategy 1: Retry with Backoff

​Strategy 2: Fallback Assertions

​Debug Checklist

Core Debugging Knowledge

Understanding Test Results

Common Failure Patterns

1. Tool Not Found Errors

2. Assertion Failures

Content Assertion Failures

Tool Sequence Failures

3. Performance Issues

Timeout Errors

High Token Usage

4. Configuration Issues

API Key Errors

Model Not Found

5. LLM Judge Failures

Low Judge Scores

Debugging Tools and Commands

CLI Debugging Commands

Analyzing Test Reports

JSON Report Analysis

OTEL Trace Analysis

Span Tree Analysis

Debugging Patterns

Pattern 1: Binary Search for Failures

Pattern 2: Progressive Relaxation

Pattern 3: Metric-Based Debugging

Configuration Debugging

Debug mcpeval.yaml Issues

Debug Agent Configuration

Error Recovery Strategies

Strategy 1: Retry with Backoff

Strategy 2: Fallback Assertions

Debug Checklist