TL;DR
Batch APIs from Anthropic (Message Batches) and OpenAI (Batch API) return results within a 24-hour window at roughly fifty percent of real-time pricing1. Finance workloads with soft deadlines, such as nightly 10-K triage, multi-year earnings-call backfills, weekly macro-data synthesis, and monthly sector retrospectives, are a near-perfect fit. Real-time APIs remain the only option for earnings-day reaction, intraday agent research, and tight prompt-regression loops. The break-even test is one question: can the job tolerate up to 24 hours of latency? If yes, batch halves the bill. If no, real-time is mandatory. Hybrid is the default pattern for production loops: batch the long tail, then escalate the top quantile to real-time on demand.
What batch APIs actually do
A batch API accepts a large payload of independent requests, runs them asynchronously at lower priority, and returns results within a published service window at a discounted rate. The trade the practitioner makes is latency-for-price. The batch endpoints are not a different model or a degraded response; the same model weights produce the same output quality, and only the queuing behavior and the invoice change. Pricing and SLAs evolve; the numbers below are at 2026-04 published rates and should be re-checked on vendor pricing pages before a production rollout.
Anthropic Message Batches
Anthropic's Message Batches API accepts up to 100,000 requests or 256 MB per batch file2. Each request is a full Messages API payload embedded inside a batch envelope with a custom_id for correlation. The service target is processing within 24 hours; in practice batches of a few thousand requests typically complete in minutes to a few hours, with the 24-hour figure being the upper bound the SLA guarantees. Pricing is fifty percent off the real-time input and output rates, and prompt caching still applies inside batch jobs. Cache reads are billed at the cached rate, and the fifty-percent batch discount compounds on top. Result retrieval is a streaming JSONL pull keyed by custom_id; partial failures inside a batch are reported per-request without failing the whole submission.
OpenAI Batch API
OpenAI's Batch API ingests a JSONL file of Chat Completions or Responses API requests uploaded through the Files endpoint, then processes the job within a 24-hour window at fifty percent of real-time pricing3. Each line of the JSONL carries a custom_id, a method, a url, and a body. Results come back as a second JSONL file with success rows and an errors file for rejected requests. Jobs can be canceled while queued or in-progress but already-processed requests are billed. Rate-limit quotas for batch are tracked separately from real-time quotas, so a batch submission does not eat into the synchronous request budget.
Google Vertex AI batch prediction
Google's batch prediction on Vertex AI supports Gemini 2.5 Pro and Flash via BigQuery or Cloud Storage input sources. Pricing and SLAs differ from the Anthropic and OpenAI pattern: Vertex batch historically published per-node-hour rates or per-token discounts that vary by region and model, rather than a flat fifty-percent cut. The published documentation is the source of truth and should be checked at the time of sizing4. The operational pattern mirrors the others: submit a file of inputs, wait for the job, pull results.
Finance workloads that are batch-friendly
The pattern is consistent across four common retail-research loops. Dollar figures below are derived from at-2026-04 published Anthropic pricing for Sonnet 4.6 (real-time input $3/MT, output $15/MT; batch input $1.50/MT, output $7.50/MT)12 assuming 8K input + 2K output per document and no prompt caching.
| Workload | Volume | Real-time cost/mo | Batch cost/mo | Saves |
|---|---|---|---|---|
| Nightly 10-K triage, 500 filings | 500 docs/mo | $27.00 | $13.50 | $13.50 |
| Earnings-call backfill, 5,000 transcripts | 5,000 docs (one-shot) | $270.00 | $135.00 | $135.00 |
| Weekly macro-data synthesis, 40 reports | 160 docs/mo | $8.64 | $4.32 | $4.32 |
| Monthly sector rotation research, 200 notes | 200 docs/mo | $10.80 | $5.40 | $5.40 |
The per-workload savings look modest on a single model pass. The real leverage shows up at larger context windows and in multi-pass pipelines. A 10-K triage loop that pushes the full filing through a 60K-token context window (Sonnet 4.6, no caching) costs roughly $0.21 per document at real-time rates and $0.105 at batch rates, so over 500 filings per month that's $105 vs $52.50. On a 5,000-transcript backfill pass, the same math yields $1,050 real-time vs $525 batch. The absolute spread scales linearly with token count; the relative discount stays at fifty percent.
Batch also changes the risk profile of backfill jobs. A one-shot 5,000-transcript sweep at real-time rates is an HTTP retry nightmare: thousands of synchronous calls subject to transient 429s, network errors, and connection resets. The batch pathway converts that into a single file upload and one polling loop. Fewer moving parts, fewer places for a script to die halfway through and leave a partial invoice.
The earnings-call backfill case deserves a closer look because it is the single largest one-shot workload most retail research operations ever run. A typical S&P 500 five-year backfill covers roughly 10,000 transcripts at an average of 35,000 tokens per transcript, call it 350 million input tokens in total, plus whatever structured output the extractor emits. At real-time Sonnet 4.6 input pricing, that's $1,050 just for the input side before output and retries. The batch equivalent is $525. Adding an output budget of 2,000 tokens per transcript lifts the real-time total to roughly $1,350 and the batch total to $675. These are not exotic numbers: they describe the cost of building a minimal earnings-call research corpus once and never paying for re-ingestion again. A practitioner who batches the initial load and then incrementally processes each new quarterly release in real time pays the batch rate on 95% of the corpus and the real-time rate only on the thin current-quarter margin.
One nuance that rewards attention is the interplay between batch and prompt caching on Anthropic's endpoints. Cached reads bill at roughly one-tenth the real-time input price, and that reduction compounds with the batch discount, so a cached token read inside a batch job lands at one-twentieth the uncached real-time rate. A practitioner batching a 500-filing nightly sweep can keep the extraction schema, the role definition, and the examples in a shared 4,000-token cached block and pay the 10x-cheaper cache rate on that prefix for every request in the batch. Cache breakpoints behave identically to the synchronous case; only the invoice differs. The combined lever is why well-run research loops often land at effective token costs an order of magnitude below the naive frontier-model headline rate, as documented in Prompt Caching Economics for Finance.
Finance workloads that must NOT be batch
Three categories should never run through batch endpoints regardless of cost.
Earnings-day reaction. When an issuer releases a quarterly report at market open, the usable window for a research agent is measured in minutes. A batch job returning four hours later is returning answers to questions the market has already priced. Real-time endpoints are the only viable path, and for the top-of-queue case a high-speed model like Haiku 4.5 or Gemini Flash often beats a batch-queued Sonnet because latency dominates quality.
Intraday agent research. An agent loop that makes iterative calls (tool use, search-and-synthesize, multi-turn reasoning) cannot batch its calls, because each step depends on the previous step's output. Trying to force batch semantics onto an agent graph creates a 24-hour round trip per decision node, which is unusable. Agents belong on real-time endpoints.
Prompt-regression sweeps with tight iteration. When tuning a prompt, the practitioner runs the same request dozens of times with small variations and reads the output immediately. A 24-hour turnaround on a regression sweep is the development equivalent of waiting a business day for a unit test to fail. Real-time is correct here, and cost is manageable because regression sweeps are short-duration tasks by design.
The boundary between batch-friendly and batch-hostile is fundamentally about the shape of the feedback loop. If a human or a scheduler is going to look at the results tomorrow, batch. If the next step of the system or the market is going to act in the next hour, real-time. The full decision tree lives in Model Selection Framework for Finance.
Operational patterns
A production batch loop has four moving parts: build the JSONL payload with stable idempotency keys, submit it, poll for completion, and validate every returned row against a schema. The loop below uses Anthropic's batch API; the OpenAI pattern is structurally identical.
import hashlib
import json
import time
from anthropic import Anthropic
client = Anthropic()
def idempotency_key(filing_id: str, prompt_version: str) -> str:
raw = f"{filing_id}:{prompt_version}".encode()
return hashlib.sha256(raw).hexdigest()[:32]
def build_batch(filings: list[dict], prompt_version: str) -> list[dict]:
requests = []
for f in filings:
requests.append({
"custom_id": idempotency_key(f["id"], prompt_version),
"params": {
"model": "claude-sonnet-4-6",
"max_tokens": 2000,
"system": "Extract risk factors as JSON matching the schema.",
"messages": [{"role": "user", "content": f["text"]}],
},
})
return requests
def submit_and_wait(requests: list[dict], poll_seconds: int = 60) -> dict:
batch = client.messages.batches.create(requests=requests)
while True:
status = client.messages.batches.retrieve(batch.id)
if status.processing_status == "ended":
return status
time.sleep(poll_seconds)
def validate_row(row: dict, schema_keys: set[str]) -> bool:
if row.get("result", {}).get("type") != "succeeded":
return False
try:
body = row["result"]["message"]["content"][0]["text"]
parsed = json.loads(body)
return schema_keys.issubset(parsed.keys())
except (KeyError, json.JSONDecodeError, IndexError):
return False
def run_nightly_triage(filings: list[dict], prompt_version: str = "v3"):
reqs = build_batch(filings, prompt_version)
done = submit_and_wait(reqs)
results = client.messages.batches.results(done.id)
schema_keys = {"risk_factors", "severity", "evidence"}
passed, failed = [], []
for row in results:
(passed if validate_row(row, schema_keys) else failed).append(row)
return {"batch_id": done.id, "passed": passed, "failed": failed}
Three points the code makes explicit. First, the idempotency key is deterministic. Hashing filing ID plus prompt version means re-submitting the same batch after a crash produces identical custom_id values, so downstream consumers can de-duplicate. Second, partial failures are first-class: the failed list collects rows that returned successfully from the API but did not pass schema validation, which is the common case for structured-output workflows. Third, retries are not implicit. A batch job does not auto-retry; the caller submits the failed subset as a new batch, or upgrades it to real-time if the deadline has shifted. That manual gate is what the next section's hybrid pattern exploits.
Cost-accrual accounting should record, per row, which endpoint served the request. A single table with columns filing_id, prompt_version, endpoint (batch or real-time), input_tokens, output_tokens, cost_usd gives the per-idea cost attribution needed for the measurements discussed in Inference Cost Attribution per Trade.
Gotchas
Structured outputs and schema versioning. If the downstream consumer expects a specific JSON schema, and the schema changes mid-job, the entire batch must complete on the prior schema. Submitting a batch on Monday with schema v3, then deploying schema v4 on Tuesday, means Monday's batch lands on Wednesday with v3 output into a v4-expecting consumer. The fix is to pin the schema version inside the prompt and inside the idempotency key, and to hold downstream migrations until all in-flight batches have drained.
Cost attribution. Batch jobs are billed to the submitting organization at the time of submission, not at the time the downstream consumer uses the results. For a research loop with multiple strategy owners sharing infrastructure, the invoice can arrive weeks before anyone assigns cost to a specific idea. Per-request cost attribution has to be computed at submission time from input_tokens and output_tokens forecasts, then reconciled against the actual batch invoice line items.
Non-determinism. The same batch run twice can produce different outputs even at temperature zero, because batch endpoints do not guarantee identical sampling between runs. For research that requires reproducibility (published reports, regulatory filings, shared benchmarks), version the prompt and store the model output verbatim. Do not assume a re-run produces the same text.
Cancellation semantics. Canceling a batch while it is processing stops further dispatch but does not refund already-processed rows. Build budget alarms around estimated spend at submission, not around the assumption that a runaway job can be killed cleanly.
Rate-limit interaction. A common misconception is that a huge batch will exhaust the organization's quota. It will not; batch rate limits are separate from synchronous limits. A practitioner can submit a 10,000-request batch and continue serving real-time queries from the same API key without throttling.
Hybrid pattern: batch by default, escalate to real-time
The production pattern that dominates well-run research loops is two-stage. Stage one submits the entire universe of documents to the batch endpoint at 50% pricing. Stage two, running the next morning after results land, identifies the top quantile by some relevance signal and re-runs those through real-time endpoints, either for a deeper model pass or for latency-sensitive downstream consumption.
Consider a nightly loop processing 500 10-K filings. A flat real-time pass at Sonnet 4.6 costs roughly $105/month at the parameters above. A flat batch pass costs $52.50/month. The hybrid (batch all 500, then re-run the top 50 on real-time Opus 4.7 at input $15/MT, output $75/MT for deeper synthesis) costs $52.50 for the batch Sonnet base plus roughly $75 for the real-time Opus pass on 50 filings at 60K input and 2K output, which totals $127.50/month. That is 21% more than flat batch Sonnet, but the top-quantile output is materially higher quality than any single-model pass. Flat real-time Opus on all 500 would cost $750/month for the same depth on the whole universe, most of which is wasted on low-relevance filings.
def hybrid_triage(filings, relevance_threshold=0.75):
# Stage 1: batch all filings at discount
batch_result = run_nightly_triage(filings, prompt_version="v3")
# Parse relevance from batch output
top = []
for row in batch_result["passed"]:
body = row["result"]["message"]["content"][0]["text"]
parsed = json.loads(body)
if parsed.get("relevance_score", 0) >= relevance_threshold:
top.append(parsed["filing_id"])
# Stage 2: real-time deep pass on top quantile only
deep_results = []
for fid in top:
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=4000,
system="Produce a full investment thesis JSON.",
messages=[{"role": "user", "content": load_filing(fid)}],
)
deep_results.append({"filing_id": fid, "thesis": resp.content[0].text})
return {"batch": batch_result, "deep": deep_results}
The same pattern generalizes: batch-tier model for coverage, real-time frontier model for the interesting subset. The relevance signal can be an LLM-emitted score, a deterministic filter on extracted fields, or a human review checkpoint. The key property is that stage one's cost is the cost of the bulk universe, and stage two's cost scales only with the top quantile, so total spend is dominated by coverage at a discounted rate.
Estimating before submitting
The Batch vs Real-Time Cost Calculator takes document count, token sizes, and model choice and emits the break-even point for a hybrid pattern. Pairing it with the Financial Document Token Estimator gives a pre-submission estimate of batch spend down to the dollar, and the Token Cost Optimizer extends the analysis across model tiers for mixed-workload loops.
Connects to
- The 2026 Engineer's Guide to AI in Markets — pillar reference for the full stack.
- Reading Financial Filings with LLMs 2026 — upstream workload this article's batch sizing assumes.
- Prompt Caching Economics for Finance — stacks with batch pricing for additional savings.
- Model Selection Framework for Finance — full decision tree for real-time vs batch vs local.
- Inference Cost Attribution per Trade — how to book batch vs real-time spend to specific ideas.
- The Token-Cost Reality of LLM Trading Research — per-validated-trade costing with the same pricing baseline.
- Batch vs Real-Time Cost Calculator — break-even calculator for batch submissions.
- Financial Document Token Estimator — token forecasts for 10-Ks, 10-Qs, transcripts before you submit.
- Token Cost Optimizer — per-model cost table across the full price/performance curve.
References
- OpenAI. "Pricing." https://openai.com/api/pricing/ (accessed 2026-04-23). Real-time baseline used for OpenAI batch discount calculations.
- Anthropic. "Prompt caching." https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching (accessed 2026-04-23). Interaction of caching discounts with batch pricing.
- SEC EDGAR. https://www.sec.gov/edgar.shtml. Source of the 10-K and 10-Q corpora referenced in worked batch volumes.
Footnotes
-
Anthropic. "Message Batches API." Anthropic Documentation. https://docs.anthropic.com/en/docs/build-with-claude/message-batches (accessed 2026-04-23). Published 50% discount on batch input and output pricing vs real-time Messages API. ↩ ↩2
-
Anthropic. "Pricing." https://www.anthropic.com/pricing (accessed 2026-04-23). Batch pricing table and per-model real-time baselines used for worked examples. ↩ ↩2
-
OpenAI. "Batch API." OpenAI Platform Documentation. https://platform.openai.com/docs/guides/batch (accessed 2026-04-23). 24-hour SLA, JSONL submission format, 50% discount vs synchronous endpoints. ↩
-
Google Cloud. "Get batch predictions for Gemini." Vertex AI Documentation. https://cloud.google.com/vertex-ai/generative-ai/docs/batch-prediction-api (accessed 2026-04-23). Vertex batch prediction interface and pricing posture. ↩