Skip to Content
HomeEvaluate toolsRun evaluations

Run evaluations

The arcade evals command discovers and executes evaluation suites with support for multiple providers, models, and output formats.

Backward compatibility: All new features (multi-provider support, capture mode, output formats) work with existing evaluation suites. No code changes required.

Basic usage

Run all evaluations in the current directory:

Terminal
arcade evals .

The command searches for files starting with eval_ and ending with .py.

Show detailed results with critic feedback:

Terminal
arcade evals . --details

Filter to show only failures:

Terminal
arcade evals . --failed-only

Multi-provider support

Single provider with default model

Use OpenAI with default model (gpt-4o):

Terminal
export OPENAI_API_KEY=sk-... arcade evals .

Use Anthropic with default model (claude-sonnet-4-5-20250929):

Terminal
export ANTHROPIC_API_KEY=sk-ant-... arcade evals . --use-provider anthropic

Specific models

Specify one or more models for a provider:

Terminal
arcade evals . --use-provider openai:gpt-4o,gpt-4o-mini

Multiple providers

Compare performance across providers:

Terminal
arcade evals . \ --use-provider openai:gpt-4o \ --use-provider anthropic:claude-sonnet-4-5-20250929 \ --openai-key sk-... \ --anthropic-key sk-ant-...

When you specify multiple models, results show side-by-side comparisons.

API keys

are resolved in the following order:

PriorityOpenAIAnthropic
1. Explicit flag--openai-key--anthropic-key
2. EnvironmentOPENAI_API_KEYANTHROPIC_API_KEY
3. .env fileOPENAI_API_KEY=...ANTHROPIC_API_KEY=...

Create a .env file in your directory to avoid setting keys in every terminal session.

Capture mode

Record calls without scoring to bootstrap test expectations:

Terminal
arcade evals . --capture --file captures/baseline --format json

Include conversation in captured output:

Terminal
arcade evals . --capture --add-context --file captures/detailed

Capture mode is useful for:

  • Creating initial test expectations
  • Debugging model behavior
  • Understanding call patterns

See Capture mode for details.

Output formats

Save results to files

Save results in one or more formats:

Terminal
arcade evals . --file results/out --format md,html

Save in all formats:

Terminal
arcade evals . --file results/out --format all

Available formats

FormatExtensionDescription
txt.txtPlain text, pytest-style output
md.mdMarkdown with tables and collapsible sections
html.htmlInteractive HTML report
json.jsonStructured JSON for programmatic use

Multiple formats generate separate files:

  • results/out.txt
  • results/out.md
  • results/out.html
  • results/out.json

Command options

Quick reference

FlagPurposeExample
--use-providerSelect provider/model--use-provider openai:gpt-4o
--captureRecord without scoring--capture --file out
--detailsShow critic feedback--details
--failed-onlyFilter failures--failed-only
--formatOutput format(s)--format md,html,json
--max-concurrentParallel limit--max-concurrent 10

--use-provider

Specify which provider(s) and model(s) to use:

Terminal
--use-provider <provider>[:<model1>,<model2>,...]

Supported providers:

  • openai (default: gpt-4o)
  • anthropic (default: claude-sonnet-4-5-20250929)

Anthropic model names include date stamps. Check Anthropic’s model documentation  for the latest model versions.

Examples:

Terminal
# Default model for provider arcade evals . --use-provider anthropic # Specific model arcade evals . --use-provider openai:gpt-4o-mini # Multiple models from same provider arcade evals . --use-provider openai:gpt-4o,gpt-4o-mini # Multiple providers arcade evals . \ --use-provider openai:gpt-4o \ --use-provider anthropic:claude-sonnet-4-5-20250929

--openai-key, --anthropic-key

Provide explicitly:

Terminal
arcade evals . --use-provider openai --openai-key sk-...

--capture

Enable capture mode to record calls without scoring:

Terminal
arcade evals . --capture

--add-context

Include system messages and conversation history in output:

Terminal
arcade evals . --add-context --file out --format md

--file

Specify output file base name:

Terminal
arcade evals . --file results/evaluation

--format

Choose output format(s):

Terminal
arcade evals . --format md,html,json

Use all for all formats:

Terminal
arcade evals . --format all

--details, -d

Show detailed results including critic feedback:

Terminal
arcade evals . --details

--failed-only

Show only failed test cases:

Terminal
arcade evals . --failed-only

--max-concurrent, -c

Set maximum concurrent evaluations:

Terminal
arcade evals . --max-concurrent 10

Default is 5 concurrent evaluations.

--debug

Show debug information for troubleshooting:

Terminal
arcade evals . --debug

Displays detailed error traces and connection information.

Understanding results

Results are formatted based on evaluation type (regular, multi-model, or comparative) and selected flags.

Summary format

Results show overall performance:

PLAINTEXT
Summary -- Total: 5 -- Passed: 4 -- Failed: 1

How flags affect output:

  • --details: Adds per-critic breakdown for each case
  • --failed-only: Filters to show only failed cases (summary shows original totals)
  • --add-context: Includes system messages and conversation history
  • Multiple models: Switches to comparison table format
  • Comparative tracks: Shows side-by-side track comparison

Case results

Each case displays status and score:

PLAINTEXT
PASSED Get weather for city -- Score: 1.00 FAILED Weather with invalid city -- Score: 0.65

Detailed feedback

Use --details to see critic-level analysis:

PLAINTEXT
Details: location: Match: False, Score: 0.00/0.70 Expected: Seattle Actual: Seatle units: Match: True, Score: 0.30/0.30

Multi-model results

When using multiple models, results show comparison tables:

PLAINTEXT
Case: Get weather for city Model: gpt-4o -- Score: 1.00 -- PASSED Model: gpt-4o-mini -- Score: 0.95 -- WARNED

Advanced usage

High concurrency for fast execution

Increase concurrent evaluations:

Terminal
arcade evals . --max-concurrent 20

High concurrency may hit API rate limits. Start with default (5) and increase gradually.

Save comprehensive results

Generate all formats with full details:

Terminal
arcade evals . \ --details \ --add-context \ --file results/full-report \ --format all

Troubleshooting

Missing dependencies

If you see ImportError: MCP SDK is required, install the full package:

Terminal
pip install 'arcade-mcp[evals]'

For Anthropic support:

Terminal
pip install anthropic

Tool name mismatches

names are normalized (dots become underscores). If you see unexpected tool names, check your tool definitions and your expected tool calls.

API rate limits

Reduce --max-concurrent value:

Terminal
arcade evals . --max-concurrent 2

No evaluation files found

Ensure your evaluation files:

  • Start with eval_
  • End with .py
  • Contain functions decorated with @tool_eval()

Next steps

Last updated on

Run evaluations | Arcade Docs