Playground

Prompt Regression Tester

Name: Prompt Regression Tester
Author: AI Fin Hub Research

Run the same prompt across Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5 with your own keys. Diff outputs; catch drift before production. Browser-only. Free.

AI Fin Hub Research Published Apr 20, 2026 Methodology Corrections

Inputs: Prompt / input + API key
Runtime: 2–15 s per model call
Privacy: Client-side · no upload
API key: BYO key (Anthropic · OpenAI · Google)
Methodology: Open →

Education · Not investment advice. BaFin/EU framework. Past performance does not indicate future results. Editorial standards Sponsor disclosure Corrections

BYO keys · client-side only

You provide your own API keys for each target. Keys stay in your browser's React state for the session; nothing is persisted or sent anywhere except the respective provider's API.

1 · Prompt

2 · Targets

onProviderModelAnthropic API key

onProviderModelOpenAI API key

onProviderModelGoogle API key

How to use

Step-by-step

Full calculator guide →

1
Define a test set: pairs of (prompt, expected output category or schema).
2
Run the test set against the current model. Save outputs as the golden set.
3
On schedule (or after a model upgrade announcement), re-run the same set against the new model.
4
Read the diff: per-test pass/fail, plus aggregate similarity scores. Drops in similarity flag potential regressions.
5
Investigate flagged tests. Some drops are legitimate model improvements; others are subtle behavior changes that need prompt adjustments.

For agents

Use in an agent

Same math, same result shape as the UI above — as a static ES module. No HTTP request, no auth, no rate limit.

import { compute } from "https://aifinhub.io/engines/prompt-regression-tester.js";

Contract: /contracts/prompt-regression-tester.json Full agent guide →

Glossary references

Terms used by this tool

All glossary →

Questions people ask next

FAQ

What's prompt regression?

Detecting when a prompt that previously produced good output starts producing degraded output, usually because the model behind the API changed. New model versions, even minor ones, can shift outputs subtly. A regression tester catches this before users do.

How does the tester decide outputs are equivalent?

Three checks: (1) string similarity (Levenshtein for short outputs, semantic embedding cosine for long), (2) schema match (does the output still validate against the structured-output spec?), (3) downstream-metric stability (does the agent's eventual answer change?). Any failure flags a regression.

How do I baseline?

Run your test suite against the current model and store outputs as the golden set. Re-run on schedule (or on every API update notice) and diff. The tool tracks deltas over time so you can see drift trajectory, not just point-in-time changes.

How often should I re-baseline?

After every confirmed model upgrade in production, plus quarterly if you're tracking ambient drift on a 'static' model version. Do not re-baseline reactively after a regression alert — investigate first.

What if my prompts produce non-deterministic outputs?

Use temperature=0 for the regression tests (the tool does by default). For inherently variable outputs (creative tasks), run N samples and check distribution overlap rather than exact match. The methodology page documents both modes.

Related deep dive

All articles →

Read further

Long-form context behind the tool output.

Used in

Decision workflows that use this tool

Goal-driven flows that bundle this tool with adjacent ones.

Audit Your Pipeline
Catch hallucinations, prompt injections, and regression drift before they ship.
Open

Complementary tools

Agent Skill Tester for Markets

Paste a SKILL.md definition + sample input + your Anthropic API key. See structured extraction, token cost, and latency — all in your browser. No signup.

Playgrounds Open

Prompt Injection Tester

Red-team a finance agent against 24 documented prompt-injection attacks — direct override, role confusion, indirect injection via retrieved content.