Skip to main content
aifinhub

Playground

Prompt Regression Tester

Run the same prompt across Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5 with your own keys. Diff outputs; catch drift before production. Browser-only. Free.

Inputs
Prompt / input + API key
Runtime
2–15 s per model call
Privacy
Client-side · no upload
API key
BYO key (Anthropic · OpenAI · Google)
Methodology
Open →

Education · Not investment advice. BaFin/EU framework. Past performance does not indicate future results. Editorial standards Sponsor disclosure Corrections

BYO keys · client-side only

You provide your own API keys for each target. Keys stay in your browser's React state for the session; nothing is persisted or sent anywhere except the respective provider's API.

1 · Prompt

2 · Targets

How to use

Step-by-step

Full calculator guide →
  1. 1

    Define a test set: pairs of (prompt, expected output category or schema).

  2. 2

    Run the test set against the current model. Save outputs as the golden set.

  3. 3

    On schedule (or after a model upgrade announcement), re-run the same set against the new model.

  4. 4

    Read the diff: per-test pass/fail, plus aggregate similarity scores. Drops in similarity flag potential regressions.

  5. 5

    Investigate flagged tests. Some drops are legitimate model improvements; others are subtle behavior changes that need prompt adjustments.

For agents

Use in an agent

Same math, same result shape as the UI above — as a static ES module. No HTTP request, no auth, no rate limit.

import { compute } from "https://aifinhub.io/engines/prompt-regression-tester.js";

Contract: /contracts/prompt-regression-tester.json Full agent guide →

Glossary references

Terms used by this tool

All glossary →

Questions people ask next

FAQ

What's prompt regression?

Detecting when a prompt that previously produced good output starts producing degraded output, usually because the model behind the API changed. New model versions, even minor ones, can shift outputs subtly. A regression tester catches this before users do.

How does the tester decide outputs are equivalent?

Three checks: (1) string similarity (Levenshtein for short outputs, semantic embedding cosine for long), (2) schema match (does the output still validate against the structured-output spec?), (3) downstream-metric stability (does the agent's eventual answer change?). Any failure flags a regression.

How do I baseline?

Run your test suite against the current model and store outputs as the golden set. Re-run on schedule (or on every API update notice) and diff. The tool tracks deltas over time so you can see drift trajectory, not just point-in-time changes.

How often should I re-baseline?

After every confirmed model upgrade in production, plus quarterly if you're tracking ambient drift on a 'static' model version. Do not re-baseline reactively after a regression alert — investigate first.

What if my prompts produce non-deterministic outputs?

Use temperature=0 for the regression tests (the tool does by default). For inherently variable outputs (creative tasks), run N samples and check distribution overlap rather than exact match. The methodology page documents both modes.

Complementary tools

Planning estimates only — not financial, tax, or investment advice.