Why test cost and latency, not just accuracy?

Because an agent calls a skill repeatedly, so the skill's per-invocation token cost and latency are multiplied across every loop that uses it. An accurate skill that is expensive can be unaffordable at the agent's volume, and a slow one can break the agent's latency budget. Measuring cost and latency per call lets you decide whether to optimize the skill, cache part of its prompt, or accept the cost, rather than discovering the problem only when the agent runs at scale.

Should I test a skill against injection attacks?

Yes, if the skill processes external or user-supplied content, which most finance document skills do. A skill that reads a filing, transcript, or web page can encounter text that tries to override its instructions or trigger unintended behavior. Include known injection patterns in the adversarial test set, treat retrieved content as untrusted, and confirm the skill reports rather than acts on embedded instructions. A skill that is accurate but injectable is a security hole the agent inherits.

How does skill testing relate to a prompt regression suite?

They are the same discipline at different scopes. Skill testing validates a single reusable capability before the agent relies on it; a regression suite re-runs that validation on every change to the skill, its prompt, or the underlying model. Once a skill passes its initial tests, those tests should become a standing regression suite so a model update or a skill edit cannot silently degrade it. Test once to validate, then keep testing to catch drift.

AI in Markets Guide

How to Test an Agent Skill for a Finance Task

An agent skill is a reusable capability you hand to an LLM agent, such as extracting a structured field from a document or running a defined analysis. Before an agent relies on it in a finance workflow, the skill needs the same scrutiny as any component: does it produce correct, structured output on the inputs it will actually see, and what does each invocation cost in tokens and time. How to test a skill so the agent that depends on it is built on something measured, not assumed, is covered below.

8 MIN READPublished May 26, 2026Live Content

By AI Fin Hub Research · AI Fin Hub Team

On This Page

Before you start 5 steps Common mistakes FAQ

Before You Start

Set up the inputs that make the next steps easier

A defined skill: the instruction or specification, its expected input, and its expected structured output.

A set of representative and adversarial sample inputs the skill will encounter.

Known correct outputs for those inputs, so the skill's results can be scored.

Guide Steps

Move through it in order

Each step focuses on one decision so you can keep momentum without losing the thread.

1

Define the skill and its contract

Write the skill specification precisely: what it takes in, what it does, and the exact structure it returns. The clearer the contract, the more testable the skill. A vague skill that summarizes a document is hard to validate; a skill that extracts a named set of fields in a defined schema is checkable against known answers. Treat the skill definition as an interface contract that both the agent and your tests rely on.

Specify the output schema as part of the skill, not as an afterthought. A skill with a defined structured output is testable; one that returns prose is not.

Use The ToolPlaygrounds
Agent Skill Tester for Markets
Paste a SKILL.md definition + sample input + your Anthropic API key. See structured extraction, token cost, and latency — all in your browser. No signup.
ToolOpen ->
2

Run representative inputs

Feed the skill the inputs it will actually encounter in production: real or realistic documents spanning the normal range of formats and sizes. This establishes the baseline behavior. A skill validated only on a clean demo input gives false confidence, because production inputs are messier. The representative set should reflect the real distribution the agent will hand the skill, not a curated easy subset chosen to make it pass.

Use real production-shaped inputs, not idealized examples. The messy formatting of real filings and transcripts is exactly what reveals whether a skill is robust.
3

Add adversarial and edge-case inputs

Deliberately test the hard cases: malformed inputs, ambiguous content, footnote-buried figures, and any known failure pattern for the task. In finance, include inputs that might carry injection attempts if the skill processes external content. These cases are where a skill breaks, and finding the breaks before the agent does is the point of testing. A skill that handles the easy inputs but collapses on the edge cases is not ready to be relied on.

Every way you can imagine the input being malformed is a test case. The agent will eventually hand the skill exactly the input you did not test for.
4

Score the structured output

Compare the skill's output against the known correct answers, field by field for extraction tasks. Score both whether each field is correct and whether the output validates against the skill's schema. A skill can produce a value that is in the right shape but wrong, or a correct value in a malformed structure; both are failures the agent would propagate. Per-field scoring localizes which parts of the skill are reliable and which need work.

Score per field, not just pass or fail. A skill that nails most fields but reliably misses one needs a targeted fix, not a rewrite.

Use The ToolPlaygrounds
Structured Schema Validator for Finance
Paste LLM JSON output and validate against four pre-built finance schemas — research output, trade decision, risk snapshot, peer comparison — with sanity.
ToolOpen ->
5

Measure token cost and latency per invocation

A skill the agent calls repeatedly contributes its token cost and latency to every loop that uses it. Measure both per invocation: the tokens consumed and the time taken. A skill that is accurate but expensive or slow may be unaffordable at the volume the agent runs it, or may blow the agent's latency budget. Knowing the per-call cost lets you decide whether to optimize the skill, cache part of it, or accept the cost as worthwhile.

A skill's cost gets multiplied by every agent loop that calls it. A small per-call inefficiency becomes a large bill at the scale an agent runs it.

Common Mistakes

The misses that undo good inputs

Validating a skill on one demo input

A clean demo hides the failure modes that matter. The skill that aces a demo can break on messy production inputs and edge cases, which the agent will eventually feed it.

Scoring only whether output is well-formed

A skill can return a perfectly structured output with a wrong value. Checking schema validity without checking correctness against known answers lets wrong-but-well-formed results propagate through the agent.

Ignoring per-invocation cost and latency

A skill the agent calls in every loop multiplies its cost and latency across the whole workload. An accurate skill that is expensive or slow can be unaffordable at scale or break the agent's latency budget.

Try These Tools

Run the numbers next

PlaygroundsCalculator

Prompt Regression Tester

Run the same prompt against multiple models (Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5) with your own keys. Diff outputs, score drift, catch regressions.

Launch toolOpen ->

CalculatorsCalculator

Agent Cost Envelope Calculator

Model an LLM research loop end-to-end — steps, tool calls, convergence checks, markets per day — and see per-loop, daily, and monthly cost with cost-cap.

Launch toolOpen ->

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

A precise contract: a defined input, a clear task, and a structured output schema. A skill that returns prose or has a vague specification is hard to validate because there is no objective standard to score against. A skill that extracts named fields in a defined schema can be checked field by field against known correct answers. Specifying the structured output as part of the skill definition, rather than as an afterthought, is what makes rigorous testing possible.

Sources & References

Building Effective Agents — Anthropic (2024)
Agent Skills — Anthropic

Keep the topic connected

AI in Markets1 FAQS

Agent Skill Testing

Agent skill testing: the regression-test discipline for LLM-driven agents. What to test, how to score, and the difference between pass-rate and capability.

Keep readingRead ->

AI in Markets2 FAQS

MCP (Model Context Protocol)

Model Context Protocol: Anthropic's open standard for letting LLMs discover and call tools — the interface, why it matters, and finance MCP server checks.

Keep readingRead ->

AI in Markets1 FAQS

LLM Hallucination Detection in Finance

How to detect LLM hallucinations in financial outputs: citation grounding, verifiable-claim checks, and cross-model agreement that flag fabricated data.

Keep readingRead ->

AI in Markets14 ITEMS

LLM for Finance Deployment Checklist

A pre-flight checklist for putting a large language model into a finance workflow: scoping, grounding, input security, numerical verification, and drift monitoring.

Keep readingRead ->

Set up the inputs that make the next steps easier

Move through it in order

Define the skill and its contract

Run representative inputs

Add adversarial and edge-case inputs

Score the structured output

Measure token cost and latency per invocation

The misses that undo good inputs

Validating a skill on one demo input

Scoring only whether output is well-formed

Ignoring per-invocation cost and latency

Run the numbers next

Prompt Regression Tester

Agent Cost Envelope Calculator

Questions people ask next

Keep the topic connected

Agent Skill Testing

MCP (Model Context Protocol)

LLM Hallucination Detection in Finance

LLM for Finance Deployment Checklist