Methodology: Prompt Regression Tester

What it does

Sends the same prompt to multiple LLM endpoints in parallel with your BYO keys. Renders outputs side-by-side + a pairwise drift matrix.

Drift metric

Jaccard similarity on the 3-gram character set of the two outputs.

A = {all length-3 substrings of output_A (lowercased, whitespace-normalized)}
B = {all length-3 substrings of output_B}
drift(A, B) = 1 − |A ∩ B| / |A ∪ B|

0 = outputs share all trigrams (near-identical text). 1 = zero overlap. This is a cheap syntactic gauge, not a semantic judge. Two paraphrases of the same answer can score surprisingly high drift; an embedding-based similarity is the right next step for semantic equivalence testing.

Parallelization

All enabled targets run in parallel via Promise.all(). The wall-clock latency reported for the batch is the slowest of all calls. Per-target latency is reported on each tile.

API calls

Anthropic: POST /v1/messages with anthropic-dangerous-direct-browser-access.
OpenAI: POST /v1/chat/completions.
Google Gemini: POST /v1beta/models/{model}:generateContent.

Privacy + key handling

Each API key stays in React state. Never persisted. Never sent to any origin other than the respective provider's API.
Refreshing the page clears all keys.

Limitations

Rate limits + retries. No backoff is implemented. A provider rate-limit error surfaces as a per-target error without retry.
Timeouts. The browser's default fetch timeout applies; very slow responses hang until the browser kills them.
Token counts. Reported directly from each provider. Gemini's token accounting uses SentencePiece, Anthropic uses its own tokenizer, OpenAI uses BPE — raw token counts are not directly comparable.
No cross-run history. Each run starts fresh; there is no "regression against last release" replay mode. Save your prompt + outputs locally for longitudinal comparisons.