aifinhub

Methodology · Playground · Last updated 2026-04-20

How Prompt Regression Tester works

How the Prompt Regression Tester tool actually works — assumptions, algorithms, limitations.

What it does

Sends the same prompt to multiple LLM endpoints in parallel with your BYO keys. Renders outputs side-by-side + a pairwise drift matrix.

Drift metric

Jaccard similarity on the 3-gram character set of the two outputs.

A = {all length-3 substrings of output_A (lowercased, whitespace-normalized)}
B = {all length-3 substrings of output_B}
drift(A, B) = 1 − |A ∩ B| / |A ∪ B|

0 = outputs share all trigrams (near-identical text). 1 = zero overlap. This is a cheap syntactic gauge, not a semantic judge. Two paraphrases of the same answer can score surprisingly high drift; an embedding-based similarity is the right next step for semantic equivalence testing.

Parallelization

All enabled targets run in parallel via Promise.all(). The wall-clock latency reported for the batch is the slowest of all calls. Per-target latency is reported on each tile.

API calls

  • Anthropic: POST /v1/messages with anthropic-dangerous-direct-browser-access.
  • OpenAI: POST /v1/chat/completions.
  • Google Gemini: POST /v1beta/models/{model}:generateContent.

Privacy + key handling

  • Each API key stays in React state. Never persisted. Never sent to any origin other than the respective provider's API.
  • Refreshing the page clears all keys.

Limitations

  1. Rate limits + retries. No backoff is implemented. A provider rate-limit error surfaces as a per-target error without retry.
  2. Timeouts. The browser's default fetch timeout applies; very slow responses hang until the browser kills them.
  3. Token counts. Reported directly from each provider. Gemini's token accounting uses SentencePiece, Anthropic uses its own tokenizer, OpenAI uses BPE — raw token counts are not directly comparable.
  4. No cross-run history. Each run starts fresh; there is no "regression against last release" replay mode. Save your prompt + outputs locally for longitudinal comparisons.
Planning estimates only — not financial, tax, or investment advice.