How to use Agent Skill Tester for Markets
Paste a SKILL.md definition, a sample input, and your Anthropic API key. The page runs the skill in your browser and returns the structured extraction, token cost, and latency — useful for evaluating Claude skills before wiring them into production.
What It Does
Use the calculator with intent
Paste a SKILL.md definition, a sample input, and your Anthropic API key. The page runs the skill in your browser and returns the structured extraction, token cost, and latency — useful for evaluating Claude skills before wiring them into production.
Engineers iterating on Claude skill definitions who want fast feedback without standing up a backend or burning CI minutes per change.
Interpreting Results
Validate the extracted output matches your schema (every required field present, types correct). Watch latency — skills with deep context or many tool calls easily push past p95 budgets for interactive use-cases.
Input Steps
Field by field
- 1
Paste inputs
Paste your SKILL.md definition into the editor. The schema spec must be valid JSON Schema for strict-mode validation to pass.
- 2
Paste inputs
Paste sample input that matches your skill's input schema. Use a realistic example, not a minimal one.
- 3
Enter inputs
Enter your Anthropic API key. The key stays in browser memory only — not persisted, not logged.
- 4
Click
Click Run. Watch the structured output, token cost (input + output × current pricing), and end-to-end latency.
- 5
Re-run
Re-run several times. Variance in outputs is informative — high variance suggests the prompt is under-constrained.
Common Scenarios
Use realistic starting points
Simple structured extraction skill
Skill input length
~500 tokens
Expected output
JSON with 4 fields
Output validates against the schema, latency under ~3s, cost under one cent per call. Anything else suggests the skill needs trimming.
Multi-step research skill
Skill input length
~3000 tokens
Expected output
Bulleted analysis
Latency 5–15s expected; check token cost against your per-decision budget before scaling.
Try These Tools
Run the numbers next
Prompt Regression Tester
Run the same prompt against multiple models (Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5) with your own keys. Diff outputs, score drift, catch regressions.
Hallucination Detector
Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.
Structured Schema Validator for Finance
Paste LLM JSON output and validate against four pre-built finance schemas — research output, trade decision, risk snapshot, peer comparison — with sanity.
FAQ
Questions people ask next
The short answers readers usually want after the first pass.
Related Content
Keep the topic connected
Agent Skill Testing
Agent skill testing: the regression-test discipline for LLM-driven agents. What to test, how to score, and the difference between pass-rate and capability.
Hallucination Detection
Detecting LLM hallucinations in financial outputs: the verifiable-claim approach, citation grounding, and cross-model agreement signals that work.
Model Drift
Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.