Model Drift
Two flavors. Version drift: the provider updates the model and behavior changes (gpt-4-0613 → gpt-4-1106 was a famous case). Within-version drift: the same model name produces different outputs over time as the provider tunes it without changing the version string. Both flavors break agents whose prompts, parsers, or downstream rules were calibrated against the old behavior.
On This Page
Definition
Model drift
Two flavors. Version drift: the provider updates the model and behavior changes (gpt-4-0613 → gpt-4-1106 was a famous case). Within-version drift: the same model name produces different outputs over time as the provider tunes it without changing the version string. Both flavors break agents whose prompts, parsers, or downstream rules were calibrated against the old behavior.
Why it matters
An agent that worked yesterday can fail today with no code change. The provider patched the model, the structured output format shifted slightly, your parser fails, your trading rule receives a malformed input. Without explicit drift monitoring, the failure mode is silent — outputs degrade, you don't know, performance drifts down.
How it works
Maintain a regression test suite of input/output pairs. Run it on every model update or on a periodic schedule. Track output-distribution metrics: response length distribution, structured-output validity rate, citation rate, refusal rate. Pin models by exact version where possible. When a provider deprecates, run the regression suite against the new version before cutover, not after.
Example
Agent summarizes earnings calls, monitored over 3 months
Week 1 — JSON validity rate
99.2%
Week 8 — JSON validity rate
94.7%
Week 12 — JSON validity rate
88.1%
Same model name, same prompt, same temperature. Validity drift of 11 percentage points over 12 weeks — silent until you measured it. Drift monitoring catches it; absence of monitoring leaves you debugging a failed pipeline at 9:30am market open.
Key Takeaways
Pin model versions exactly. Provider auto-update is unsafe in production.
Maintain a regression test suite and run it continuously, not just at provider-update time.
Track distribution-level metrics, not just task-pass rates — drift shows up in distributions before it shows up in pass/fail.
Related Terms
Try These Tools
Run the numbers next
Prompt Regression Tester
Run the same prompt against multiple models (Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5) with your own keys. Diff outputs, score drift, catch regressions.
Agent Skill Tester for Markets
Paste a SKILL.md definition + sample input + your Anthropic API key. See structured extraction, token cost, and latency — all in your browser. No signup.
Calibration Dojo
Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; track Brier score and reliability curve over time. All.
FAQ
Questions people ask next
The short answers readers usually want after the first pass.
Sources & References
- A Survey on Concept Drift Adaptation — Gama et al. (2014), ACM Computing Surveys 46(4)
Related Content
Keep the topic connected
Hallucination Detection
Detecting LLM hallucinations in financial outputs: the verifiable-claim approach, citation grounding, and cross-model agreement signals that work.
Agent Skill Testing
Agent skill testing: the regression-test discipline for LLM-driven agents. What to test, how to score, and the difference between pass-rate and capability.