TL;DR
Earnings call transcripts are not 10-Ks. Three differences break naive prompts: off-the-cuff spoken language replaces lawyer-audited prose, the Q&A section mixes multiple speakers whose statements carry different legal weight, and forward-looking language is drenched in hedging verbs ("expect", "anticipate", "likely", "could") that a summarizer flattens into false certainty. Five prompt patterns fix this: speaker-attributed extraction, hedged-guidance confidence scoring, multi-quarter guidance-delta tracking, risk-surface aggregation across management disclosures, and a forward-outlook vs historical-performance separator. Below: each pattern as a copy-paste system prompt with JSON schema, cost per call at Sonnet 4.6 published 2026-04 rates, typical failure modes, and a 30-line Python block that runs all five against a synthetic SYNTHETIC_A Corp excerpt.
Why earnings calls are harder than 10-Ks
A 10-K is a lawyered, audited 100,000-word document. Every number has a source in the financials; every risk is phrased in defensive, consistent language. An earnings call transcript looks similar on the surface but breaks three assumptions filing extractors rely on.
First, the language is spoken. Sentences run on. Numbers are restated verbally ("revenue was one billion, two hundred and thirty million dollars, up roughly twelve percent"). Unit qualifiers drop off mid-sentence. A field-by-field JSON extractor tuned on 10-K prose will misparse 5-10% of numeric claims on a typical transcript.1
Second, the Q&A section is multi-speaker. A 10-K has one voice. An earnings call has prepared remarks (CEO, CFO, sometimes a segment head) followed by 30-45 minutes of Q&A where analysts from seven to fifteen firms ask pointed questions. A CFO statement under direct questioning carries different legal and analytical weight than an analyst's framing. Regulation FD safe-harbor provisions apply differently to prepared remarks and spontaneous answers.2 An extractor that ignores speaker identity conflates all of it into one homogeneous claim stream.
Third, forward-looking statements are hedged. "We expect 5-7% EPS growth next year" is not a forecast of 5-7% EPS growth. It is a probability-weighted forecast of a range, explicitly qualified as forward-looking under the Private Securities Litigation Reform Act of 1995 safe harbor. A naive summarizer writes "SYNTHETIC_A Corp guided 5-7% EPS growth" and strips the hedging. The downstream consumer reads it as a point estimate.
The patterns below are ordered so each builds on the previous.
Pattern 1: Speaker attribution
What it extracts. Every claim in the transcript tagged with who said it, what role they hold, when in the call they said it, and which section (prepared remarks vs Q&A) the claim came from.
Why naive prompting fails. Ask for "a summary of what was said" and the output returns a fused narrative with no speaker metadata. An analyst's speculative question gets merged with the CFO's answer. The most common downstream error is attributing an analyst's framing to management. "Will operating margin compress below 20% next quarter?" asked by Morgan Chen becomes "management indicated operating margin may compress below 20%" in the summary.
The prompt:
You are an earnings call transcript extractor. For the attached transcript,
return a JSON array of claim objects. Do NOT summarize. Do NOT paraphrase
the substance. Extract every factual or forward-looking claim as a separate
object.
Each claim must have:
{
"claim_id": string, // "C001", "C002", ...
"speaker_role": string, // "ceo" | "cfo" | "coo" | "ir" |
// "segment_head" | "analyst" | "moderator"
"speaker_name": string, // full name as stated in transcript
"speaker_firm": string | null, // only non-null for analysts
"section": string, // "prepared_remarks" | "qna"
"time_in_call_seconds": integer,// estimate from transcript ordering
"claim_text": string, // verbatim sentence, <= 400 chars
"claim_type": string // "historical_fact" | "forward_guidance" |
// "qualitative_color" | "question"
}
Rules:
- An analyst's question is a "question" claim_type, NOT a fact.
- If the speaker is introduced as "CFO" use "cfo". If only a name is given,
use role from earlier in the transcript.
- Never attribute a claim to a speaker whose name does not appear in the
transcript.
Return only the JSON array.
Expected JSON (one row):
{
"claim_id": "C014",
"speaker_role": "cfo",
"speaker_name": "Bob Ruiz",
"speaker_firm": null,
"section": "qna",
"time_in_call_seconds": 2340,
"claim_text": "We expect gross margin to land in the 42 to 44 percent range for the full year, excluding one-time integration costs.",
"claim_type": "forward_guidance"
}
Model recommendation. Sonnet 4.6 is the default. A 45,000-token transcript extracts in one call at roughly $0.14 input + $0.03 output at published 2026-04 rates.3 Haiku 4.5 misattributes speakers at 3-4x the Sonnet rate on transcripts where analysts interrupt.
Typical failure mode. The model attributes an analyst's rhetorical question to the CFO. Catch it with a post-processor: for every claim_type == "forward_guidance", require speaker_role to be in a whitelist of management roles.
Pattern 2: Hedged-guidance confidence scoring
What it extracts. Forward guidance with explicit hedging language preserved, plus a confidence level derived from the hedging verbs used.
Why naive prompting fails. The single most damaging distortion in earnings transcripts. Consider the contrast:
| Source sentence (verbatim) | Naive summary | What was lost |
|---|---|---|
| "We expect 5 to 7 percent EPS growth next year." | "EPS growth will be 5-7%." | Hedging verb "expect", range framing, next-year scope. |
| "We continue to believe margins could expand modestly in the second half." | "Margins will expand in H2." | "Continue to believe" = not a new data point. "Could" = possibility. "Modestly" = magnitude cap. |
| "If current trends persist, we anticipate a return to mid-single-digit growth." | "Growth returning to mid-single digits." | Entire conditional clause dropped. |
Every flattening of hedge language is a downstream error waiting to happen. A summary that strips "we expect" treats a probability-weighted statement as a fact.
The prompt:
You are a forward-guidance extractor. For the attached transcript, return a
JSON array of guidance objects. Only extract statements by management
(ceo / cfo / coo / segment_head). Ignore analyst questions.
Each object must have:
{
"guidance_id": string,
"metric": string, // "revenue" | "eps" | "gross_margin" |
// "operating_margin" | "fcf" | "capex" |
// "headcount" | "buybacks" | "other"
"metric_detail": string, // free text if metric == "other"
"value_text": string, // verbatim: "5 to 7 percent", "north of $2B"
"value_low": number | null,
"value_high": number | null,
"value_unit": string, // "percent" | "usd" | "usd_billions" | ...
"time_horizon": string, // "q1_2026" | "fy_2026" | "long_term" | ...
"hedging_language": string[], // ["expect", "likely", "anticipate"]
"confidence_level": string, // "high" | "medium" | "low"
"source_sentence": string, // verbatim quote, <= 400 chars
"conditional_on": string | null // "if current trends persist"
}
Confidence-level rules:
- "high" = verbs like "will", "is", "reaffirm", "on track to", "committed to"
- "medium" = "expect", "anticipate", "target", "plan to", "see"
- "low" = "could", "may", "might", "possibly", "believe"
- If multiple verbs, use the weakest.
Return only the JSON array.
Expected JSON (one row):
{
"guidance_id": "G003",
"metric": "eps",
"metric_detail": "",
"value_text": "5 to 7 percent",
"value_low": 5.0,
"value_high": 7.0,
"value_unit": "percent",
"time_horizon": "fy_2026",
"hedging_language": ["expect"],
"confidence_level": "medium",
"source_sentence": "We expect 5 to 7 percent EPS growth for fiscal 2026.",
"conditional_on": null
}
Model recommendation. Sonnet 4.6. Hedging classification is subtle; Haiku over-assigns "high" confidence because it pattern-matches on the metric mention and skips past the verb. Cost: roughly $0.14 input + $0.04 output per 45,000-token call.
Typical failure mode. The model collapses a range into a midpoint and drops the hedging. Fix by asserting value_text must be verbatim; the post-processor rejects rows where value_text contains no hedge word when confidence_level != "high".
Pattern 3: Multi-quarter guidance-delta
What it extracts. Run Pattern 2 over four consecutive quarterly calls for the same issuer, then compute per-metric deltas: guidance revised up, down, or maintained.
Why naive prompting fails. A single-quarter extraction shows what management said this quarter. It does not show that FY revenue guidance was quietly revised from "9-11% growth" in Q1 to "7-9% growth" in Q2 to "mid-single-digit growth" in Q3. That sequence is the signal; a per-call summarizer misses it.
The prompt (to be run on a list of four Pattern-2 outputs):
You are a guidance-drift analyzer. Given four JSON arrays of guidance
objects from four consecutive earnings calls (Q1, Q2, Q3, Q4) for the same
issuer, produce a JSON array of per-metric drift records.
Each record:
{
"metric": string,
"time_horizon": string, // the target period the guidance refers to
"timeline": [
{
"quarter": string, // "q1_2026", "q2_2026", ...
"value_text": string,
"value_low": number | null,
"value_high": number | null,
"confidence_level": string,
"hedging_language": string[]
}
],
"delta_classification": string, // "revised_up" | "revised_down" |
// "maintained" | "widened" | "narrowed" |
// "withdrawn" | "new"
"delta_commentary": string // 1-2 sentences, evidence-based
}
Rules:
- Only join rows where "metric" AND "time_horizon" match across calls.
- "revised_down" requires value_high of later quarter < value_high of earlier
quarter by at least 50 bps OR confidence_level stepped from medium to low.
- "withdrawn" = guidance present in earlier call, absent in later call.
- Do NOT invent continuity. If the metric appears in Q1 and Q3 but not Q2,
keep only Q1 and Q3 in timeline.
- Never describe a drift as a fact beyond what the raw values show.
Return only the JSON array.
Expected JSON (one row):
{
"metric": "revenue",
"time_horizon": "fy_2026",
"timeline": [
{"quarter": "q1_2026", "value_text": "9 to 11 percent", "value_low": 9.0, "value_high": 11.0, "confidence_level": "medium", "hedging_language": ["expect"]},
{"quarter": "q2_2026", "value_text": "7 to 9 percent", "value_low": 7.0, "value_high": 9.0, "confidence_level": "medium", "hedging_language": ["anticipate"]},
{"quarter": "q3_2026", "value_text": "mid-single-digit", "value_low": 4.0, "value_high": 6.0, "confidence_level": "low", "hedging_language": ["could", "believe"]}
],
"delta_classification": "revised_down",
"delta_commentary": "FY revenue guidance has stepped down in each of the last three calls, with the hedging language weakening from 'expect' to 'could believe' by Q3."
}
Model recommendation. This pattern runs on already-structured JSON, not raw transcripts. Sonnet 4.6 handles it at roughly $0.02 per issuer per quarter rollup. Haiku 4.5 is acceptable here because the input is already structured.
Typical failure mode. Phantom continuity: the model joins two rows describing different time horizons (Q1 guidance for "next quarter" vs Q2 guidance for "next quarter") and reports a spurious revision. Prevent with the time_horizon key match.
Pattern 4: Risk-surface aggregator
What it extracts. Every management-acknowledged risk mentioned across the call, bucketed into taxonomy categories (macro, competitive, operational, regulatory, financial, geopolitical).
Why naive prompting fails. Risks are scattered. The CEO mentions FX headwinds in opening remarks; the CFO mentions channel-inventory normalization mid-script; an analyst asks about a competitive threat and the CFO confirms it exists. A naive summarizer surfaces three of ten risks and misses the Q&A confirmations entirely.
The prompt:
You are a risk-disclosure extractor. For the attached transcript, return a
JSON array of risk objects. Extract any statement by management
(ceo / cfo / coo / segment_head) that acknowledges a risk, headwind,
uncertainty, or unfavorable development.
Each object:
{
"risk_id": string,
"category": string, // "macro" | "competitive" | "operational" |
// "regulatory" | "financial" |
// "geopolitical" | "technological"
"subcategory": string, // free text, e.g. "fx_headwind", "supply_chain"
"description": string, // <= 200 chars, paraphrase allowed
"source_sentence": string, // verbatim, <= 400 chars
"speaker_role": string,
"speaker_name": string,
"materiality_language": string, // "significant" | "modest" | "temporary" |
// "not quantified"
"is_new_this_call": boolean | null // null if not determinable from this call alone
}
Rules:
- Include risks acknowledged in Q&A answers in addition to those raised
in prepared remarks.
- Do NOT include analyst-raised risks that management deflected or denied.
A risk where the CFO says "we don't see that as a factor" is not a risk.
- Do NOT invent risks from general knowledge of the industry.
Return only the JSON array.
Expected JSON (one row):
{
"risk_id": "R004",
"category": "macro",
"subcategory": "fx_headwind",
"description": "Dollar strength expected to compress reported growth by 150-200 bps in H2.",
"source_sentence": "The stronger dollar is likely to weigh on reported growth by roughly 150 to 200 basis points in the back half.",
"speaker_role": "cfo",
"speaker_name": "Bob Ruiz",
"materiality_language": "significant",
"is_new_this_call": null
}
Model recommendation. Sonnet 4.6. Haiku 4.5 tends to hallucinate risks not present in the transcript, drawing on training-data priors about the industry. Cost: ~$0.14 + $0.03 per call.
Typical failure mode. The model lists a risk an analyst raised that management denied. Catch with a second-pass filter: for each extracted risk, verify the source_sentence was spoken by a management role.
Pattern 5: Forward-outlook vs historical-performance separator
What it extracts. Every sentence in the transcript tagged as HISTORICAL (describing what has already happened) or FORWARD (describing what might or will happen).
Why naive prompting fails. The single most common error in LLM-written earnings call summaries: a forward-looking statement gets restated as fact. "We expect to see margin expansion in H2" becomes "SYNTHETIC_A Corp saw margin expansion in H2." Both contain "margin expansion in H2"; the extractor stripped the tense. A quant ingesting such summaries to detect earnings-trend signals sees a hallucinated historical uptrend.
The prompt:
You are a tense classifier for earnings call transcripts. For the attached
transcript, return a JSON array, one object per sentence (skip sentences
shorter than 8 words).
Each object:
{
"sentence_id": string,
"speaker_role": string,
"speaker_name": string,
"section": string, // "prepared_remarks" | "qna"
"sentence": string, // verbatim, as spoken
"tense": string, // "HISTORICAL" | "FORWARD" | "MIXED" | "EVERGREEN"
"tense_evidence": string // the specific verbs/phrases that triggered
// the classification
}
Rules:
- HISTORICAL = past tense, perfect tense, or present tense describing a
completed period ("Q3 revenue was", "we delivered", "margins came in at").
- FORWARD = future tense, modal verbs ("will", "expect", "anticipate",
"plan to", "target"), conditional clauses.
- MIXED = sentence contains both a past-completed clause and a
forward-looking clause. Tag both in tense_evidence.
- EVERGREEN = general statements about the business with no time binding
("our platform serves enterprise customers").
- If a hedging verb ("expect", "believe") points at a future period, tense
is FORWARD even if the hedging verb itself is present tense.
Return only the JSON array.
Expected JSON (one row):
{
"sentence_id": "S087",
"speaker_role": "ceo",
"speaker_name": "Alice Carter",
"section": "prepared_remarks",
"sentence": "We expect gross margin to expand by roughly 100 basis points in the back half of the year.",
"tense": "FORWARD",
"tense_evidence": "\"expect\" + \"to expand\" + \"in the back half\""
}
Model recommendation. Sonnet 4.6. Sentence-level classification over a 45,000-token transcript produces ~800-1,200 output tokens; budget $0.14 + $0.08 per transcript. Haiku 4.5 misclassifies MIXED sentences where clauses nest irregularly.
Typical failure mode. The classifier marks a FORWARD sentence as HISTORICAL because the outer verb is present tense ("we believe margins will expand"). Train the post-processor to prefer the embedded clause's tense when a hedging verb points at a time-bound phrase.
Cost analysis
Five patterns applied to a single earnings call. Published Sonnet 4.6 rates as of 2026-04 are $3 per million input tokens and $15 per million output tokens.3 A typical large-cap earnings call transcript runs 30,000-50,000 tokens. Assume 45,000 input tokens and per-pattern output as shown.
| Pattern | Input tokens | Output tokens | Cost per call |
|---|---|---|---|
| 1 — Speaker attribution | 45,000 | 8,000 | $0.255 |
| 2 — Hedged-guidance scoring | 45,000 | 2,500 | $0.173 |
| 3 — Multi-quarter delta | 6,000 (4 Pattern-2 outputs) | 1,500 | $0.041 |
| 4 — Risk-surface aggregator | 45,000 | 2,000 | $0.165 |
| 5 — Forward vs historical | 45,000 | 4,000 | $0.195 |
| Per-call total | $0.829 |
A 20-ticker watchlist covered quarterly (80 calls per year) runs roughly $66 per year at full-quality Sonnet 4.6. Prompt caching on the system-prompt header brings that closer to $50 per year.4 Batch API delivery halves the per-call cost for non-urgent extractions. At 200 tickers the bill scales to ~$660 without caching, ~$400 with caching and batch. Token cost is not the bottleneck; quality verification is.
Runnable Python
The block below runs all five patterns against a synthetic transcript excerpt. It is end-to-end: parse transcript, call each pattern, print structured output. Swap SYNTHETIC_TRANSCRIPT for a real transcript loaded from a vendor feed in production use.
import json
import os
from anthropic import Anthropic
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
MODEL = "claude-sonnet-4-6"
SYNTHETIC_TRANSCRIPT = """
SYNTHETIC_A Corp Q3 2026 Earnings Call
Prepared remarks:
Alice Carter (CEO): Q3 revenue was $1.23 billion, up 12 percent year over year.
We expect 5 to 7 percent EPS growth for fiscal 2026. The stronger dollar is
likely to weigh on reported growth by roughly 150 to 200 basis points in H2.
Bob Ruiz (CFO): Operating margin came in at 21.4 percent, up 80 bps. We
anticipate gross margin in the 42 to 44 percent range for the full year.
Q&A:
Morgan Chen (analyst, Placeholder Capital): Will operating margin compress
below 20 percent next quarter?
Bob Ruiz: We don't see margin dropping below 20 percent in Q4.
"""
PROMPTS = {
"speaker_attribution": "Return JSON array of claim objects with claim_id, speaker_role, speaker_name, section, time_in_call_seconds, claim_text, claim_type.",
"hedged_guidance": "Return JSON array of guidance objects with metric, value_text, value_low, value_high, time_horizon, hedging_language, confidence_level, source_sentence.",
"risk_surface": "Return JSON array of risk objects with category, subcategory, description, source_sentence, speaker_role, materiality_language.",
"forward_vs_historical":"Return JSON array of sentence objects with sentence_id, speaker_role, sentence, tense (HISTORICAL|FORWARD|MIXED|EVERGREEN), tense_evidence.",
}
def run_pattern(name: str, instruction: str, transcript: str) -> list:
resp = client.messages.create(
model=MODEL,
max_tokens=4000,
system=f"You are an earnings call extractor. {instruction} Return ONLY valid JSON.",
messages=[{"role": "user", "content": transcript}],
)
text = resp.content[0].text.strip()
if text.startswith("```"):
text = text.split("```")[1].lstrip("json").strip()
return json.loads(text)
if __name__ == "__main__":
results = {name: run_pattern(name, prompt, SYNTHETIC_TRANSCRIPT)
for name, prompt in PROMPTS.items()}
# Pattern 3 (multi-quarter delta) requires four quarters of Pattern 2
# output; omitted here for single-call demo.
print(json.dumps(results, indent=2))
The pattern is deliberately uniform — one function call per pattern, consistent JSON schema — so that downstream tooling (/tools/structured-schema-validator-finance/) can validate every output before it reaches a decision system.
Connects to
- The 2026 Engineer's Guide to AI in Markets — pillar context for where these patterns fit in the broader LLM-in-finance stack.
- LLM Prompt Patterns for 10-K and 8-K Extraction — the 10-K counterpart to this tutorial; structurally similar but tuned for lawyered prose.
- The 8-Step LLM Research Prompt Template — broader research-prompt framework that subsumes these earnings-specific patterns.
- Structured Schema Validator for Finance — validates the JSON outputs of all five patterns against their declared schemas.
- Agent Skill Tester — exercise a prompt against curated transcript fixtures and diff outputs.
- Prompt Regression Tester — track prompt-version drift across sequential earnings seasons.
References
- Anthropic, "Tool use with Claude," https://docs.anthropic.com/en/docs/agents-and-tools/tool-use — structured tool calling is an alternative to JSON-in-prose extraction for schema-strict workflows.
- OpenAI, "Structured Outputs," https://platform.openai.com/docs/guides/structured-outputs — strict JSON schema conformance mode; directly comparable to Anthropic's tool-calling for extraction tasks.
- Private Securities Litigation Reform Act of 1995, 15 U.S.C. Sec. 78u-5. Defines the safe-harbor for forward-looking statements that earnings calls rely on; explains why hedging language carries legal weight beyond its analytical meaning.
- Bushee, B., Gow, I. D., & Taylor, D. J. (2018). "Linguistic Complexity in Firm Disclosures: Obfuscation or Information?" Journal of Accounting Research 56(1). Empirical treatment of disclosure-language complexity including conference-call Q&A.
Footnotes
-
Error-rate estimates on spoken-transcript number parsing are consistent with published evaluations of LLM extraction on conversational financial text. Reproducible benchmarks are scarce; practitioners should run their own fixtures before relying on a vendor number. ↩
-
U.S. Securities and Exchange Commission, "Regulation FD (Fair Disclosure)," 17 CFR 243.100-243.103 (2000). Governs selective-disclosure rules that apply unevenly across prepared remarks and spontaneous Q&A answers. ↩
-
Anthropic, "Models overview" and "Pricing," https://docs.anthropic.com/en/docs/about-claude/models , retrieved 2026-04-23. Sonnet 4.6 input $3 / output $15 per million tokens at the published rate card on that date. ↩ ↩2
-
Anthropic, "Prompt caching," https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching , retrieved 2026-04-23. Cached reads are billed at 10% of the base input rate for up to five minutes after the last cache hit. ↩