RAG vs Fine-Tuning Economics Statistics
Fine-tuning alone lifted domain-task accuracy by over 6 percentage points; RAG alone by about 5; combining both produced gains beyond either individually. That additive relationship is the key finding. The data comes from a Microsoft research case study and peer-reviewed surveys, each with source and year; none was generated by this site. These are accuracy effects on specific tasks, not dollar costs, so the practical question the numbers answer is which approach captures more of the gain for less effort, not what each one costs absolutely.
On This Page
Statistics
The numbers worth quoting
Fine-tuning alone improved domain-task accuracy by over 6 percentage points, and RAG alone by about 5 percentage points, in a Microsoft case study
The study evaluated Llama2-13B, GPT-3.5, and GPT-4. RAG captured most of the accuracy gain of the more expensive fine-tuning route, an economically relevant finding for build-versus-retrieve decisions.
Combining RAG and fine-tuning produced cumulative accuracy gains beyond either approach alone
The two techniques addressed different failure modes (knowledge retrieval versus behavioral adaptation), so their gains added rather than overlapped, supporting hybrid pipelines where budget allows.
Retrieval-augmented generation reduces but does not eliminate hallucination, with grounded systems still producing unsupported claims
The survey documents that grounding answers in retrieved context lowers fabrication yet leaves residual faithfulness errors, including citing sources that do not support the claim, which matters acutely in finance.
Domain-adapted financial models such as BloombergGPT outperformed comparable general models on finance tasks, showing the upper bound of pre-training adaptation
BloombergGPT represents the heaviest end of the adaptation spectrum (domain pre-training), more costly than fine-tuning or RAG, and improved finance accuracy without solving general reasoning.
FinBen added a dedicated RAG evaluation track, reflecting that retrieval is now a standard component of financial LLM systems
Including a RAG track in a finance benchmark signals that grounded generation, not raw model recall, is the assumed deployment pattern for finance tasks.
Key Takeaways
Methodology
Figures are drawn from a Microsoft research case study, peer-reviewed surveys, and benchmark papers, each reported with its source and year. Accuracy effects are task-specific and not finance dollar costs. No statistic on this page is derived from data collected by this site.
Try These Tools
Run the numbers next
SEC Filing Chunk Optimizer
Pick a filing archetype, tune chunk size and overlap, and see chunk count, embedding cost, and structural-boundary warnings across three chunking strategies.
Token-Cost Optimizer
Compute the dollar cost of a trading research loop across Claude, GPT, and Gemini. Prompt length × model × retry × call volume → cost per idea and per.
Financial Document Token Estimator
Paste a 10-K, 10-Q, 8-K or earnings transcript and see token count + one-pass extraction cost across eight frontier LLMs, with cache-hit toggle.
Sources & References
- RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture — Angels Balaguer et al., Microsoft (2024)
- Survey of Hallucination in Natural Language Generation — Ziwei Ji et al., ACM Computing Surveys (2023)
- BloombergGPT: A Large Language Model for Finance — Shijie Wu et al. (2023)
- FinBen: A Holistic Financial Benchmark for Large Language Models — Qianqian Xie et al., NeurIPS Datasets and Benchmarks (2024)
Related Content
Keep the topic connected
How to Build a RAG Pipeline Over SEC Filings
Build a RAG pipeline over SEC filings: ingest and chunk 10-Ks, embed and retrieve passages, ground answers with citations, and verify extracted numbers.
LLM Reliability in Finance Statistics
Sourced statistics on large language model reliability for financial tasks: hallucination rates, numerical-reasoning accuracy, citation faithfulness, and benchmark gaps.
LLM Cost Trends for Finance Workloads Statistics
LLM cost trends for finance workloads: sourced per-token price declines, fold reductions over time, and the drivers behind falling inference costs.