Playground
Prompt Regression Tester
Run the same prompt across Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5 with your own keys. Diff outputs; catch drift before production. Browser-only. Free.
- Inputs
- Prompt / input + API key
- Runtime
- 2–15 s per model call
- Privacy
- Client-side · no upload
- API key
- BYO key (Anthropic · OpenAI · Google)
- Methodology
- Open →
BYO keys · client-side only
You provide your own API keys for each target. Keys stay in your browser's React state for the session; nothing is persisted or sent anywhere except the respective provider's API.
1 · Prompt
2 · Targets
How to use
Step-by-step
- 1
Define a test set: pairs of (prompt, expected output category or schema).
- 2
Run the test set against the current model. Save outputs as the golden set.
- 3
On schedule (or after a model upgrade announcement), re-run the same set against the new model.
- 4
Read the diff: per-test pass/fail, plus aggregate similarity scores. Drops in similarity flag potential regressions.
- 5
Investigate flagged tests. Some drops are legitimate model improvements; others are subtle behavior changes that need prompt adjustments.
For agents
Use in an agent
Same math, same result shape as the UI above — as a static ES module. No HTTP request, no auth, no rate limit.
import { compute } from "https://aifinhub.io/engines/prompt-regression-tester.js"; Contract: /contracts/prompt-regression-tester.json Full agent guide →
Glossary references
Terms used by this tool
Questions people ask next
FAQ
What's prompt regression?
Detecting when a prompt that previously produced good output starts producing degraded output, usually because the model behind the API changed. New model versions, even minor ones, can shift outputs subtly. A regression tester catches this before users do.
How does the tester decide outputs are equivalent?
Three checks: (1) string similarity (Levenshtein for short outputs, semantic embedding cosine for long), (2) schema match (does the output still validate against the structured-output spec?), (3) downstream-metric stability (does the agent's eventual answer change?). Any failure flags a regression.
How do I baseline?
Run your test suite against the current model and store outputs as the golden set. Re-run on schedule (or on every API update notice) and diff. The tool tracks deltas over time so you can see drift trajectory, not just point-in-time changes.
How often should I re-baseline?
After every confirmed model upgrade in production, plus quarterly if you're tracking ambient drift on a 'static' model version. Do not re-baseline reactively after a regression alert — investigate first.
What if my prompts produce non-deterministic outputs?
Use temperature=0 for the regression tests (the tool does by default). For inherently variable outputs (creative tasks), run N samples and check distribution overlap rather than exact match. The methodology page documents both modes.
Related deep dive
All articles →Read further
Long-form context behind the tool output.
- Methodology · Opinion·8 min
The Price-Blind LLM Research Harness
Price-blind LLM research — most harnesses leak the current price and the model confabulates. The architectural fix and a 30-line Python scaffold.
Read - Methodology · Opinion·9 min
The 8-Step LLM Research Prompt Template
Free-form prompts yield uncalibrated LLM output. An 8-step template makes research reproducible and better-calibrated across model versions.
Read - Methodology · Opinion·8 min
The Token-Cost Reality of LLM Trading Research
What LLM trading research costs per idea and per validated trade across Claude, GPT-5, and Gemini 2.5. Pricing, caching, model-mix under $200/month.
Read
Used in
Decision workflows that use this tool
Goal-driven flows that bundle this tool with adjacent ones.
Complementary tools
Users of this tool often explore
Agent Skill Tester for Markets
Paste a SKILL.md definition + sample input + your Anthropic API key. See structured extraction, token cost, and latency — all in your browser. No signup.
Prompt Injection Tester
Red-team a finance agent against 24 documented prompt-injection attacks — direct override, role confusion, indirect injection via retrieved content.
Token-Cost Optimizer
Compute the dollar cost of a trading research loop across Claude, GPT, and Gemini. Prompt length × model × retry × call volume → cost per idea and per.