Methodology · Playground · Last updated 2026-04-20
How Prompt Regression Tester works
How the Prompt Regression Tester tool actually works — assumptions, algorithms, limitations.
What it does
Sends the same prompt to multiple LLM endpoints in parallel with your BYO keys. Renders outputs side-by-side + a pairwise drift matrix.
Drift metric
Jaccard similarity on the 3-gram character set of the two outputs.
A = {all length-3 substrings of output_A (lowercased, whitespace-normalized)}
B = {all length-3 substrings of output_B}
drift(A, B) = 1 − |A ∩ B| / |A ∪ B| 0 = outputs share all trigrams (near-identical text). 1 = zero overlap. This is a cheap syntactic gauge, not a semantic judge. Two paraphrases of the same answer can score surprisingly high drift; an embedding-based similarity is the right next step for semantic equivalence testing.
Parallelization
All enabled targets run in parallel via Promise.all(). The
wall-clock latency reported for the batch is the slowest of all calls.
Per-target latency is reported on each tile.
API calls
- Anthropic:
POST /v1/messageswithanthropic-dangerous-direct-browser-access. - OpenAI:
POST /v1/chat/completions. - Google Gemini:
POST /v1beta/models/{model}:generateContent.
Privacy + key handling
- Each API key stays in React state. Never persisted. Never sent to any origin other than the respective provider's API.
- Refreshing the page clears all keys.
Limitations
- Rate limits + retries. No backoff is implemented. A provider rate-limit error surfaces as a per-target error without retry.
- Timeouts. The browser's default fetch timeout applies; very slow responses hang until the browser kills them.
- Token counts. Reported directly from each provider. Gemini's token accounting uses SentencePiece, Anthropic uses its own tokenizer, OpenAI uses BPE — raw token counts are not directly comparable.
- No cross-run history. Each run starts fresh; there is no "regression against last release" replay mode. Save your prompt + outputs locally for longitudinal comparisons.