Agent Skill Testing
A test suite where each case has: (1) input (prompt, tool environment, retrieved context), (2) expected behavior (correct answer, valid tool calls, refusal where appropriate), (3) graders (deterministic checks for structured output, LLM-judges for free-form, human reviewers for ambiguous). The suite is run on every agent change and on every model update. Pass rate over time is the capability trend.
On This Page
Definition
Agent skill testing
A test suite where each case has: (1) input (prompt, tool environment, retrieved context), (2) expected behavior (correct answer, valid tool calls, refusal where appropriate), (3) graders (deterministic checks for structured output, LLM-judges for free-form, human reviewers for ambiguous). The suite is run on every agent change and on every model update. Pass rate over time is the capability trend.
Why it matters
An agent's behavior changes with every prompt edit, model swap, or provider patch. Without a test suite, those changes ship as silent regressions. With one, regression is detected at integration time, not in production. Skill testing is to LLM agents what unit tests are to software — non-optional in any system that handles money.
How it works
Build a curated test set covering: (1) common cases (the agent should do these well), (2) edge cases (the agent should handle gracefully), (3) adversarial cases (prompt injection, tool misuse, ambiguous instruction), (4) refusal cases (the agent should decline). Mix automated graders (structured-output validators, fact-checkers) with LLM-judge panels (for free-form outputs) and a small human-review sample to validate the LLM judges. Run on every change.
Example
Trading-research agent regression suite
Total cases
240
Common cases pass rate
97%
Edge cases pass rate
82%
Adversarial cases pass rate
91% (refusal correct)
Time per full run
12 minutes, $3.40 in API cost
Trend over weeks shows whether new prompts improve or degrade behavior. A 5-point drop on adversarial pass-rate is a red flag worth blocking deploy on.
Key Takeaways
LLM-judge graders need their own validation — sample human-review the judges before trusting them.
Track per-case-type pass rates, not just overall pass rate; the aggregate hides important regressions.
Re-run the full suite on every model-version change, including silent provider patches.
Related Terms
Try These Tools
Run the numbers next
Agent Skill Tester for Markets
Paste a SKILL.md definition + sample input + your Anthropic API key. See structured extraction, token cost, and latency — all in your browser. No signup.
Prompt Regression Tester
Run the same prompt against multiple models (Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5) with your own keys. Diff outputs, score drift, catch regressions.
Calibration Dojo
Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; track Brier score and reliability curve over time. All.
FAQ
Questions people ask next
The short answers readers usually want after the first pass.
Sources & References
- Building effective agents — Anthropic Engineering
Related Content
Keep the topic connected
Model Drift
Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.
Hallucination Detection
Detecting LLM hallucinations in financial outputs: the verifiable-claim approach, citation grounding, and cross-model agreement signals that work.