Why evaluate tools?
evaluations ensure AI models use your tools correctly in production. Unlike traditional testing, evaluations measure two key aspects:
- selection: Does the model choose the right tools for the task?
- Parameter accuracy: Does the model provide correct arguments?
Arcade’s evaluation framework helps you validate tool-calling capabilities before deployment, ensuring reliability in real-world applications. You can evaluate tools from servers, Arcade Gateways, or custom implementations.
What can go wrong?
Without proper evaluation, AI models might:
- Misinterpret intents, selecting the wrong
- Provide incorrect arguments, causing failures or unexpected behavior
- Skip necessary calls, missing steps in multi-step tasks
- Make incorrect assumptions about parameter defaults or formats
How evaluation works
Evaluations compare the model’s actual calls with expected tool calls for each test case.
Scoring components
- selection: Did the model choose the correct tool?
- Parameter evaluation: Are the arguments correct? (evaluated by critics)
- Weighted scoring: Each aspect has a weight that affects the final score
Evaluation results
Each test case receives:
- Score: Calculated from weighted critic scores, normalized proportionally (weights can be any positive value)
- Status:
- Passed: Score meets or exceeds fail threshold (default: 0.8)
- Failed: Score falls below fail threshold
- Warned: Score is between warn and fail thresholds (default: 0.9)
Example output:
PASSED Get weather for city -- Score: 1.00
WARNED Send message with typo -- Score: 0.85
FAILED Wrong tool selected -- Score: 0.50Next steps
- Create an evaluation suite to start testing your
- Run evaluations with multiple providers
- Explore capture mode to bootstrap test expectations
- Compare sources with comparative evaluations
Advanced features
Once you’re comfortable with basic evaluations, explore these advanced capabilities:
Capture mode
Record calls without scoring to discover what models actually call. Useful for bootstrapping test expectations and debugging. Learn more →
Comparative evaluations
Test the same cases against different sources (tracks) with isolated registries. Compare how models perform with different tool implementations. Learn more →
Output formats
Save results in multiple formats (txt, md, html, json) for reporting and analysis. Mix formats with --format md,html,json or use --format all. Learn more →