Should I fine-tune a model or use a general model with good prompting?

For most finance pipelines, a capable general model with retrieval and strong verification outperforms a fine-tune that you then have to maintain. Fine-tuning helps for narrow, high-volume extraction tasks with stable formats. Start with retrieval and prompting, measure where it falls short, and only fine-tune the specific subtask that justifies the maintenance cost.

Where is a human still required in the loop?

Keep a human gate in front of any output that triggers an irreversible or money-moving action, and on a continuous sample of routine outputs for drift monitoring. The model can draft, extract, and summarize at scale, but accountability for decisions has to rest with a person, which is also what most financial regulators expect.

How do I know when the pipeline has drifted?

Track the verification disagreement rate, the citation-faithfulness rate, and the volume of human corrections over time. A sustained rise in any of these is the drift signal. It commonly follows a provider model update or a change in the upstream document source, so log the model version alongside outputs to localize the cause.

AI in Markets Guide

How to Deploy an LLM in a Finance Pipeline

Large language models are useful in finance for summarizing filings, drafting memos, extracting structured data, and routing questions. They are also confident when wrong, and finance is unforgiving of small numerical and citation errors. The reliability comes from the system around the model, not the model alone. The layers that turn a capable but fallible model into a deployable pipeline are laid out below.

9 MIN READPublished May 26, 2026Live Content

By AI Fin Hub Research · AI Fin Hub Team

On This Page

Before you start 6 steps Common mistakes FAQ

Before You Start

Set up the inputs that make the next steps easier

A clearly scoped task with a defined input and a checkable output, not an open-ended assistant brief.

Access to the source documents or data the answers must be grounded in.

A deterministic way to compute any number the model is allowed to state, so its arithmetic can be verified rather than trusted.

Guide Steps

Move through it in order

Each step focuses on one decision so you can keep momentum without losing the thread.

1

Scope the task so errors are catchable

The first decision is what the model is allowed to do. A pipeline that extracts a named field from a filing, or summarizes a document with citations, has outputs you can check against the source. A pipeline that freely generates investment conclusions does not. Pick tasks where a wrong answer is detectable by a downstream check or a human reviewer, and keep the model away from final decisions that move money without a gate in front of them.

Prefer extraction and summarization with citations over open generation. The narrower the task, the cheaper the verification.

Use The ToolComparators
Model Selector for Finance
Input task, latency budget, cost budget, context size, and quality sensitivity; get ranked model recommendations with rationale — grounded in published.
ToolOpen ->
2

Ground answers in retrieved sources

Connect the model to the documents it must answer from using retrieval, so it works from the actual filing or dataset rather than its training memory. Grounding lowers fabrication and lets you trace each claim to a passage. Require the model to cite the passage it used, and reject answers whose citations do not support the claim. Retrieval reduces hallucination but does not eliminate it, so the citation check is what makes grounding trustworthy.

Chunk source documents at a size that keeps each retrieved passage self-contained. Over-large chunks dilute relevance; over-small chunks lose context.

Use The ToolGenerators
SEC Filing Chunk Optimizer
Pick a filing archetype, tune chunk size and overlap, and see chunk count, embedding cost, and structural-boundary warnings across three chunking strategies.
ToolOpen ->
3

Guard the input against prompt injection

When the model reads external content, that content can contain instructions aimed at the model rather than at you. A filing, an email, or a web page can carry text that tries to override the system prompt or exfiltrate data. Treat all retrieved content as untrusted, separate instructions from data in the prompt structure, and test the pipeline against known injection patterns before it goes live. This is a security control, not a tuning step.

Never let retrieved text occupy the same trust level as your system instructions. Keep tool permissions least-privilege so a successful injection has little to act on.

Use The ToolPlaygrounds
Prompt Injection Tester
Red-team a finance agent against 24 documented prompt-injection attacks — direct override, role confusion, indirect injection via retrieved content.
ToolOpen ->
4

Verify every number against a deterministic engine

Do not let the model do the arithmetic that matters. If an answer states a ratio, a total, or a projection, compute that number with a deterministic engine and have the model present the verified value rather than its own. Models are weak at multi-step numerical reasoning and the errors compound across steps. Routing all figures through a calculator turns the model into a presenter of checked numbers, which is the role it is actually reliable in.

Flag any answer where the model's stated number and the engine's number disagree beyond a tiny tolerance, and surface the disagreement rather than silently overriding it.

Use The ToolPlaygrounds
Hallucination Detector
Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.
ToolOpen ->
5

Pin the prompt and add a regression suite

Once the pipeline works, freeze the prompt and the model version, then build a regression suite of representative inputs with expected outputs. Run it on every prompt change and every model update. Model providers change behavior between versions, and a prompt that worked can quietly degrade. The regression suite is what tells you a change broke something before your users find out in production.

Include adversarial and edge-case inputs in the suite, not just the happy path. The cases that break are the ones you did not think to test manually.

Use The ToolPlaygrounds
Prompt Regression Tester
Run the same prompt against multiple models (Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5) with your own keys. Diff outputs, score drift, catch regressions.
ToolOpen ->
6

Log outputs and monitor for drift

Persist inputs, retrieved context, and outputs so you can audit decisions and detect drift. Track quality metrics over time: citation faithfulness, verification disagreement rate, and reviewer corrections. A rising disagreement rate is an early signal that the model version, the data distribution, or an upstream source has changed. Monitoring closes the loop and makes the pipeline maintainable rather than a one-time deployment that silently rots.

Sample a fraction of production outputs for human review continuously, not just at launch. Drift shows up in the tails first.

Common Mistakes

The misses that undo good inputs

Letting the model state numbers it computed itself

Models make frequent multi-step arithmetic errors that compound, and they state wrong numbers with full confidence. Any figure that matters must be computed deterministically and verified.

Treating retrieved content as trusted input

External documents can carry prompt-injection instructions. Mixing them with system instructions at the same trust level invites the model to follow attacker text instead of your rules.

Shipping without a regression suite

Provider model updates and prompt edits silently change behavior. With no regression suite, a degradation reaches production undetected and is discovered by users rather than by tests.

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

No. Grounding a model in retrieved sources reduces fabrication substantially, but published evaluations show grounded systems still produce unsupported claims and occasionally cite passages that do not back the statement. That is why a citation-faithfulness check, where you reject answers whose cited source does not support the claim, sits on top of retrieval rather than replacing it.

Sources & References

Survey of Hallucination in Natural Language Generation — Ziwei Ji et al., ACM Computing Surveys (2023)
OWASP Top 10 for Large Language Model Applications — OWASP Foundation (2023)
Artificial intelligence in UK financial services 2024 — Bank of England and Financial Conduct Authority (2024)

Keep the topic connected

AI in Markets1 FAQS

LLM Hallucination Detection in Finance

How to detect LLM hallucinations in financial outputs: citation grounding, verifiable-claim checks, and cross-model agreement that flag fabricated data.

Keep readingRead ->

AI in Markets1 FAQS

Prompt Injection

Prompt injection: when untrusted text in a prompt overrides system instructions. The attack patterns and the structural defenses that work in production.

Keep readingRead ->

AI in Markets2 FAQS

MCP (Model Context Protocol)

Model Context Protocol: Anthropic's open standard for letting LLMs discover and call tools — the interface, why it matters, and finance MCP server checks.

Keep readingRead ->

AI in Markets14 ITEMS

LLM for Finance Deployment Checklist

A pre-flight checklist for putting a large language model into a finance workflow: scoping, grounding, input security, numerical verification, and drift monitoring.

Keep readingRead ->

Set up the inputs that make the next steps easier

Move through it in order

Scope the task so errors are catchable

Ground answers in retrieved sources

Guard the input against prompt injection

Verify every number against a deterministic engine

Pin the prompt and add a regression suite

Log outputs and monitor for drift

The misses that undo good inputs

Letting the model state numbers it computed itself

Treating retrieved content as trusted input

Shipping without a regression suite

Questions people ask next

Keep the topic connected

LLM Hallucination Detection in Finance

Prompt Injection

MCP (Model Context Protocol)

LLM for Finance Deployment Checklist