How big should my test suite be?

AI in Markets Explainer

Agent Skill Testing

A test suite where each case has: (1) input (prompt, tool environment, retrieved context), (2) expected behavior (correct answer, valid tool calls, refusal where appropriate), (3) graders (deterministic checks for structured output, LLM-judges for free-form, human reviewers for ambiguous). The suite is run on every agent change and on every model update. Pass rate over time is the capability trend.

1 FAQSPublished May 10, 2026Live Content

By Orbyd Editorial · AI Fin Hub Team

On This Page

Definition Example Key takeaways Related terms FAQ

Definition

Agent skill testing

Why it matters

An agent's behavior changes with every prompt edit, model swap, or provider patch. Without a test suite, those changes ship as silent regressions. With one, regression is detected at integration time, not in production. Skill testing is to LLM agents what unit tests are to software — non-optional in any system that handles money.

How it works

Build a curated test set covering: (1) common cases (the agent should do these well), (2) edge cases (the agent should handle gracefully), (3) adversarial cases (prompt injection, tool misuse, ambiguous instruction), (4) refusal cases (the agent should decline). Mix automated graders (structured-output validators, fact-checkers) with LLM-judge panels (for free-form outputs) and a small human-review sample to validate the LLM judges. Run on every change.

Example

Trading-research agent regression suite

Total cases

240

Common cases pass rate

97%

Edge cases pass rate

82%

Adversarial cases pass rate

91% (refusal correct)

Time per full run

12 minutes, $3.40 in API cost

Trend over weeks shows whether new prompts improve or degrade behavior. A 5-point drop on adversarial pass-rate is a red flag worth blocking deploy on.

Key Takeaways

LLM-judge graders need their own validation — sample human-review the judges before trusting them.

Track per-case-type pass rates, not just overall pass rate; the aggregate hides important regressions.

Re-run the full suite on every model-version change, including silent provider patches.

Try These Tools

Run the numbers next

PlaygroundsCalculator

Agent Skill Tester for Markets

Paste a SKILL.md definition + sample input + your Anthropic API key. See structured extraction, token cost, and latency — all in your browser. No signup.

Launch toolOpen ->

PlaygroundsCalculator

Prompt Regression Tester

Run the same prompt against multiple models (Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5) with your own keys. Diff outputs, score drift, catch regressions.

Launch toolOpen ->

PlaygroundsCalculator

Calibration Dojo

Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; track Brier score and reliability curve over time. All.

Launch toolOpen ->

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

Big enough to cover the workflows you actually deploy plus a respectable adversarial set. Practical floor: 100-200 cases for a focused agent. Beyond 500 the per-run cost and time become friction; consider tiered runs (smoke set per commit, full set per deploy).

Sources & References

Building effective agents — Anthropic Engineering

Keep the topic connected

AI in Markets1 FAQS

Model Drift

Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.

Keep readingRead ->

AI in Markets1 FAQS

Hallucination Detection

Detecting LLM hallucinations in financial outputs: the verifiable-claim approach, citation grounding, and cross-model agreement signals that work.

Keep readingRead ->