Capture mode

Capture mode records calls without evaluating them. Use it to bootstrap test expectations or debug model behavior.

Backward compatibility: Capture mode works with existing evaluation suites. Simply add the --capture flag to any arcade evals command. No code changes needed.

When to use capture mode

Bootstrapping test expectations: When you don’t know what calls to expect, run capture mode to see what the model actually calls.

Debugging model behavior: When evaluations fail unexpectedly, capture mode shows exactly what the model is doing.

Exploring new : When adding new tools, capture mode helps you understand how models interpret them.

Documenting usage: Create examples of how models use your tools in different scenarios.

Typical workflow

PLAINTEXT


1. Create suite with empty expected_tool_calls
   ↓
2. Run: arcade evals . --capture --format json
   ↓
3. Review captured tool calls in output file
   ↓
4. Copy tool calls into expected_tool_calls
   ↓
5. Add critics for validation
   ↓
6. Run: arcade evals . --details

Basic usage

Create an evaluation suite without expectations

Create a suite with test cases but empty expected_tool_calls:

Python


from arcade_evals import EvalSuite, tool_eval
 
@tool_eval()
async def capture_weather_suite():
    suite = EvalSuite(
        name="Weather Capture",
        system_message="You are a weather assistant.",
    )
 
    await suite.add_mcp_stdio_server(["python", "weather_server.py"])
 
    # Add cases without expected tool calls
    suite.add_case(
        name="Simple weather query",
        user_message="What's the weather in Seattle?",
        expected_tool_calls=[],  # Empty for capture
    )
 
    suite.add_case(
        name="Multi-city comparison",
        user_message="Compare the weather in Seattle and Portland",
        expected_tool_calls=[],
    )
 
    return suite

Run in capture mode

Run evaluations with the --capture flag:

Terminal


arcade evals . --capture --file captures/weather --format json

This creates captures/weather.json with all calls.

Review captured output

Open the JSON file to see what the model called:

JSON


{
  "suite_name": "Weather Capture",
  "model": "gpt-4o",
  "provider": "openai",
  "captured_cases": [
    {
      "case_name": "Simple weather query",
      "user_message": "What's the weather in Seattle?",
      "tool_calls": [
        {
          "name": "Weather_GetCurrent",
          "args": {
            "location": "Seattle",
            "units": "fahrenheit"
          }
        }
      ]
    }
  ]
}

Convert to test expectations

Copy the captured calls into your evaluation suite:

Python


from arcade_evals import ExpectedMCPToolCall, BinaryCritic
 
suite.add_case(
    name="Simple weather query",
    user_message="What's the weather in Seattle?",
    expected_tool_calls=[
        ExpectedMCPToolCall(
            "Weather_GetCurrent",
            {"location": "Seattle", "units": "fahrenheit"}
        )
    ],
    critics=[
        BinaryCritic(critic_field="location", weight=0.7),
        BinaryCritic(critic_field="units", weight=0.3),
    ],
)

CLI options

Basic capture

Record calls to JSON:

Terminal


arcade evals . --capture --file captures/baseline --format json

Include conversation context

Capture system messages and conversation history:

Terminal


arcade evals . --capture --add-context --file captures/detailed --format json

Output includes:

JSON


{
  "case_name": "Weather with context",
  "user_message": "What about the weather there?",
  "system_message": "You are a weather assistant.",
  "additional_messages": [
    {"role": "user", "content": "I'm traveling to Tokyo"},
    {"role": "assistant", "content": "Tokyo is a great city!"}
  ],
  "tool_calls": [...]
}

Multiple formats

Save captures in multiple formats:

Terminal


arcade evals . --capture --file captures/out --format json,md

Markdown format is more readable for quick review:

Markdown


## Weather Capture
 
### Model: gpt-4o
 
#### Case: Simple weather query
 
**Input:** What's the weather in Seattle?
 
**Tool Calls:**
 
- `Weather_GetCurrent`
  - location: Seattle
  - units: fahrenheit

Multiple providers

Capture from multiple providers to compare behavior:

Terminal


arcade evals . --capture \
  --use-provider openai:gpt-4o \
  --use-provider anthropic:claude-sonnet-4-5-20250929 \
  --file captures/comparison --format json

Capture with comparative tracks

Capture from multiple sources to see how different implementations behave:

Python


@tool_eval()
async def capture_comparative():
    suite = EvalSuite(
        name="Weather Comparison",
        system_message="You are a weather assistant.",
    )
 
    # Register different tool sources
    await suite.add_mcp_server(
        "http://weather-api-1.example/mcp",
        track="Weather API v1"
    )
 
    await suite.add_mcp_server(
        "http://weather-api-2.example/mcp",
        track="Weather API v2"
    )
 
    # Capture will run against each track
    suite.add_case(
        name="get_weather",
        user_message="What's the weather in Seattle?",
        expected_tool_calls=[],
    )
 
    return suite

Run capture:

Terminal


arcade evals . --capture --file captures/apis --format json

Output shows captures per track:

JSON


{
  "captured_cases": [
    {
      "case_name": "get_weather",
      "track_name": "Weather API v1",
      "tool_calls": [
        {"name": "GetCurrentWeather", "args": {...}}
      ]
    },
    {
      "case_name": "get_weather",
      "track_name": "Weather API v2",
      "tool_calls": [
        {"name": "Weather_Current", "args": {...}}
      ]
    }
  ]
}

Best practices

Start with broad queries

Begin with open-ended prompts to see natural model behavior:

Python


suite.add_case(
    name="explore_weather_tools",
    user_message="Show me everything you can do with weather",
    expected_tool_calls=[],
)

Capture edge cases

Record model behavior on unusual inputs:

Python


suite.add_case(
    name="ambiguous_location",
    user_message="What's the weather in Portland?",  # OR or ME?
    expected_tool_calls=[],
)

Include context variations

Capture with different conversation :

Python


suite.add_case(
    name="weather_from_context",
    user_message="How about the weather there?",
    additional_messages=[
        {"role": "user", "content": "I'm going to Seattle"},
    ],
    expected_tool_calls=[],
)

Capture multiple providers

Compare how different models interpret your :

Terminal


arcade evals . --capture \
  --use-provider openai:gpt-4o,gpt-4o-mini \
  --use-provider anthropic:claude-sonnet-4-5-20250929 \
  --file captures/models --format json,md

Converting captures to tests

Step 1: Identify patterns

Review captured calls to find patterns:

JSON


// Most queries use "fahrenheit"
{"location": "Seattle", "units": "fahrenheit"}
{"location": "Portland", "units": "fahrenheit"}
 
// Some use "celsius"
{"location": "Tokyo", "units": "celsius"}

Step 2: Create base expectations

Create expected calls based on patterns:

Python


# Default to fahrenheit for US cities
ExpectedMCPToolCall("GetWeather", {"location": "Seattle", "units": "fahrenheit"})
 
# Use celsius for international cities
ExpectedMCPToolCall("GetWeather", {"location": "Tokyo", "units": "celsius"})

Step 3: Add critics

Add critics to validate parameters. See Critics for options.

Step 4: Run evaluations

Test with real evaluations:

Terminal


arcade evals . --details

Step 5: Iterate

Use failures to refine:

Adjust expected values
Change critic weights
Modify descriptions
Add more test cases

Troubleshooting

No tool calls captured

Symptom: Empty tool_calls arrays

Possible causes:

Model didn’t call any
not properly registered
System message doesn’t encourage use

Solution:

Python


suite = EvalSuite(
    name="Weather",
    system_message="You are a weather assistant. Use the available weather tools to answer questions.",
)

Missing parameters

Symptom: Some parameters are missing from captured calls

Explanation: Models may omit optional parameters.

Solution: Check if parameters have defaults in your schema. The evaluation framework applies defaults automatically.

Different results per provider

Symptom: OpenAI and Anthropic capture different calls

Explanation: Providers interpret descriptions differently.

Solution: This is expected. Use captures to understand provider-specific behavior, then create provider-agnostic tests.

Next steps

Learn about comparative evaluations to compare sources
Create evaluation suites with expectations