🌟 Test like a pro! These best practices come from real-world experience testing MCP servers and agents at scale. Follow these guidelines to build a robust, maintainable test suite.
Quick best practice finder
Jump to what you need:Test Design
Writing effective tests
Organization
Structuring test suites
Assertions
Choosing the right checks
Performance
Fast and efficient testing
Reliability
Reducing flakiness
Maintenance
Keeping tests healthy
Test design principles
1. Test one thing at a time (by default)
✅ Do: Focus each test on a single behavior or feature- Keep assertions layered and named (content, tools, performance, judge)
- Bound scope (one coherent workflow per test)
- Use separate tests for alternative branches or failure paths
2. Use descriptive names
✅ Do: Name tests to describe what they verify3. Make tests independent
✅ Do: Each test should run in isolation4. Use explicit assertions
✅ Do: Be specific about expectationsTest organization
Directory structure
Organize tests by functionality and type:Test file naming
Follow consistent naming patterns:Grouping related tests
Use classes or modules to group related tests:Assertion strategies
Layered assertions
Build assertions from deterministic to probabilistic:Assertion selection guide
Choose assertions based on what you’re testing:Testing | Use These Assertions | Avoid |
---|---|---|
Correctness | contains , regex | LLM judges for exact values |
Tool Usage | was_called , called_with , sequence | Content checks for tool behavior |
Performance | response_time_under , max_iterations | Exact timing matches |
Quality | judge.llm , multi_criteria | Brittle string matching |
Error Handling | judge.llm with error rubric | Expecting exact error text |
Custom assertion patterns
Create reusable assertion combinations:Performance optimization
Minimize LLM calls
✅ Do: Batch operations when possibleUse appropriate models
Match model to test complexity:Parallel execution
Run independent tests concurrently:Cache when appropriate
Reliability patterns
Handle non-determinism
LLMs are probabilistic, so account for variation:Reduce flakiness
Common causes and solutions:Flakiness Source | Solution |
---|---|
Network issues | Add retries, increase timeouts |
Race conditions | Use explicit waits, not sleep |
Random data | Use fixed seeds or deterministic data |
External services | Mock or use test instances |
LLM variation | Lower temperature, use flexible assertions |
Test isolation
Ensure tests don’t affect each other:Maintainability
Documentation in tests
Document complex test logic:Parameterized test patterns
Make tests reusable with parameters:Test data management
Centralize test data:Version your tests
Track test evolution with your code:Anti-patterns to avoid
1. Testing implementation details
❌ Don’t: Test internal implementation2. Overusing LLM judges
❌ Don’t: Use judges for deterministic checks3. Ignoring test failures
❌ Don’t: Skip or ignore failing tests4. Magic numbers and strings
❌ Don’t: Use unexplained valuesTesting checklist
Use this checklist for every test you write:- Single purpose - Tests one specific behavior
- Descriptive name - Clearly indicates what’s being tested
- Independent - Doesn’t depend on other tests
- Deterministic - Produces consistent results
- Fast - Runs quickly (< 5 seconds for unit tests)
- Documented - Has docstring explaining purpose
- Maintainable - Easy to understand and modify
- Appropriate assertions - Uses right assertion types
- Error handling - Handles expected failures gracefully
- Cleanup - Cleans up any created resources
Advanced patterns
Property-based testing
Test properties rather than specific examples:Contract testing
Define contracts between components:Mutation testing
Verify your tests catch bugs:Continuous improvement
Metrics to track
Monitor your test suite health:- Pass rate - Should be > 95% for stable tests
- Execution time - Track trends, investigate increases
- Flakiness - Identify and fix flaky tests
- Coverage - Ensure critical paths are tested
- Maintenance cost - Time spent fixing tests
Regular reviews
Schedule periodic test suite reviews:- Weekly: Review failed tests, fix or mark as flaky
- Monthly: Remove obsolete tests, update assertions
- Quarterly: Refactor test organization, update patterns
- Yearly: Major test suite health assessment
You’re now equipped with best practices that will make your mcp-eval tests reliable, maintainable, and valuable! Remember: good tests are an investment in your project’s future. 🌟