Building a Production Claude Agent for Finance

TL;DR

Moving a Claude-driven research loop from notebook to production is not a modeling problem; it's an ops + safety problem. The working shape in April 2026: price-blind research prompt (Sonnet 4.6 with prompt caching) → structured probability → conviction-tier sizer → idempotent execution via Alpaca MCP V2 → append-only decision log + heartbeat + watchdog + circuit breaker. Under 600 lines of Python, $0/mo infra (Mac Mini + Cloudflare Pages for the dashboard), $125/mo LLM cost for a medium-universe loop. Below: the full scaffold walkthrough, file-by-file, with the specific tools on aifinhub.io that validate each layer.

What "production" means here

Not "HFT." Not "hedge fund infrastructure." The bar for a solo operator or 1–3 person team:

Runs unattended on a cheap machine (Mac Mini, $5 VPS, or a single Cloudflare Worker).
Correct under retry (no duplicate fills).
Observable when it breaks (heartbeat + watchdog + Telegram).
Auditable after the fact (append-only decision log).
Cost-bounded (LLM budget ceiling + cost-per-validated-trade target).
Vacation-proof (can run 4 weeks without intervention).
Regulatory-safe (education, not advice — see BaFin + EU guide if EU-domiciled).

The file layout

trading-system/
├── README.md
├── .env.example
├── pyproject.toml
├── scripts/
│   ├── 01_fetch_bars.py            # data source → DuckDB
│   ├── 02_research.py              # Claude call (price-blind)
│   ├── 03_decide.py                # conviction tier + sizing
│   ├── 04_execute.py               # Alpaca MCP with idempotency
│   ├── 05_reconcile.py             # post-trade reconciliation
│   ├── heartbeat.py                # cycle heartbeat
│   └── watchdog.py                 # independent health check
├── data/
│   ├── bars.duckdb                 # price + fundamentals
│   ├── heartbeat.json              # last cycle timestamp
│   ├── circuit.json                # {"paused": false, "reason": …}
│   └── tickers/{T}.md              # per-ticker research dossier
├── memory/
│   └── decisions.jsonl             # append-only decision log
├── plists/
│   └── com.you.trader-*.plist      # launchd schedules
└── tests/
    ├── test_no_price_leak.py
    ├── test_sizing_caps.py
    └── test_idempotent_orders.py

Layer 1: Data

Fetch OHLCV + fundamentals once per cycle into DuckDB. Pick a vendor via Data-Vendor TCO Calculator.

# scripts/01_fetch_bars.py
from alpaca.data import StockHistoricalDataClient, StockBarsRequest
import duckdb, os, time

api_key = os.environ["ALPACA_API_KEY"]
api_secret = os.environ["ALPACA_API_SECRET"]
client = StockHistoricalDataClient(api_key, api_secret)

def fetch(tickers, lookback_days=60):
    req = StockBarsRequest(
        symbol_or_symbols=tickers,
        timeframe="1Day",
        start=None,  # defaults to lookback window
    )
    bars = client.get_stock_bars(req).df
    with duckdb.connect("data/bars.duckdb") as con:
        con.register("incoming", bars.reset_index())
        con.execute("""
            CREATE TABLE IF NOT EXISTS bars AS SELECT * FROM incoming LIMIT 0;
            INSERT INTO bars SELECT * FROM incoming
              WHERE (timestamp, symbol) NOT IN (SELECT timestamp, symbol FROM bars);
        """)

DuckDB on disk is perfect for this scale: ACID, columnar, zero-setup, ~2GB for 5 years of daily data on 500 tickers.

The critical architectural boundary. The LLM never sees current price. It gets a research pack: filings excerpts, earnings transcripts, competitor context, macro regime summary. It returns a structured probability + thesis.

# scripts/02_research.py
import anthropic, json, os
from dataclasses import asdict, dataclass

SYSTEM = """You are an 8-step research analyst. Follow the template strictly.
NEVER mention prices, chart patterns, position sizes, or 'should buy/sell'.
Return only the JSON payload specified in step 8."""

@dataclass
class ResearchPack:
    ticker_anonymous: str  # e.g. SYNTHETIC_A
    filings_excerpts: list[str]
    earnings_transcripts: list[str]
    competitor_context: list[str]
    macro_regime_summary: str
    # No prices. No positions. No PnL.

def research(pack: ResearchPack) -> dict:
    client = anthropic.Anthropic()
    resp = client.messages.create(
        model="claude-sonnet-4-6-20260115",
        max_tokens=1500,
        temperature=0,
        system=[{"type": "text", "text": SYSTEM, "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": json.dumps(asdict(pack))}],
    )
    return json.loads(resp.content[0].text)

The cache_control: ephemeral on the system prompt is the cost lever — at 50% cache hit rate the effective input cost drops by ~67%. See Token-Cost Optimizer to model your specific loop.

Validate this layer: paste a real research prompt into the Prompt Regression Tester and confirm it never references price across Opus 4.5/4.6/4.7. Run extractions through the Hallucination Detector to catch fabricated numbers.

Layer 3: Decide (sizing)

The LLM emits probability. Your code decides position size via conviction-scaled Kelly with a per-trade cap.

# scripts/03_decide.py
import json, time, uuid
from pathlib import Path

def conviction_tier(p: float) -> str:
    if p >= 0.85: return "SUPREME"
    if p >= 0.70: return "HIGH"
    if p >= 0.55: return "MEDIUM"
    return "LOW"

def conviction_fraction(tier: str) -> float:
    return {"SUPREME": 0.25, "HIGH": 0.15, "MEDIUM": 0.05, "LOW": 0.0}[tier]

def sized_fraction(p: float, b: float = 1.5, tier_cap: float = 0.04) -> float:
    tier = conviction_tier(p)
    kelly = max(0.0, (b * p - (1 - p)) / b)
    return min(kelly * conviction_fraction(tier), tier_cap)

def decide(research_output: dict, current_bankroll_usd: float) -> dict:
    p = research_output["probability_yes"]
    fraction = sized_fraction(p)
    notional = fraction * current_bankroll_usd
    decision = {
        "id": str(uuid.uuid4()),
        "at": time.time(),
        "ticker": research_output["ticker_real"],  # mapped back here
        "probability": p,
        "tier": conviction_tier(p),
        "fraction": fraction,
        "notional_usd": notional,
        "thesis": research_output["thesis"],
        "invalidation": research_output["invalidation_conditions"],
    }
    # Append-only log (critical for audit).
    with open("memory/decisions.jsonl", "a") as f:
        f.write(json.dumps(decision) + "\n")
    return decision

Validate: stress-test the sizing via the Kelly Sizer across thousands of Monte Carlo paths. Confirm the drawdown distribution is acceptable. Verify risk metrics on historical via Risk-Adjusted Returns.

Layer 4: Execute (idempotent)

The ONE rule: every order carries a client-supplied idempotency key. Retry on error never produces double fills.

# scripts/04_execute.py
from alpaca.trading import TradingClient, LimitOrderRequest, OrderSide, TimeInForce

client = TradingClient(api_key, api_secret, paper=True)  # start paper!

def execute(decision: dict, current_price: float):
    qty = int(decision["notional_usd"] / current_price)
    if qty == 0: return None
    req = LimitOrderRequest(
        symbol=decision["ticker"],
        qty=qty,
        side=OrderSide.BUY if decision["fraction"] > 0 else OrderSide.SELL,
        time_in_force=TimeInForce.DAY,
        limit_price=round(current_price * 1.001, 2),
        client_order_id=decision["id"],   # <— IDEMPOTENCY KEY
    )
    return client.submit_order(req)

Crucial: the client_order_id is the decision UUID from the log. Alpaca dedupes on this key. If the request network-errors + you retry, the second call returns the existing order — no duplicate fill.

Validate: verify your MCP server / broker supports idempotency via the Finance MCP Directory. Alpaca V2 = grade A, supports it natively. Tradier community MCP = grade C, does not.

Layer 5: Reconcile

Post-trade: fetch actual fills, update decision log with realized state.

# scripts/05_reconcile.py
from alpaca.trading import TradingClient, GetOrdersRequest
import json

def reconcile():
    orders = client.get_orders(GetOrdersRequest())
    with open("memory/decisions.jsonl") as f:
        decisions = [json.loads(l) for l in f]
    unreconciled = [d for d in decisions if d.get("filled_at") is None]
    for d in unreconciled:
        order = next((o for o in orders if o.client_order_id == d["id"]), None)
        if order and order.filled_qty:
            with open("memory/fills.jsonl", "a") as f:
                f.write(json.dumps({
                    "decision_id": d["id"],
                    "filled_at": order.filled_at.isoformat(),
                    "filled_qty": float(order.filled_qty),
                    "filled_avg_price": float(order.filled_avg_price),
                }) + "\n")

Layer 6: Robustness

The layer that determines whether you can sleep.

# scripts/heartbeat.py
import json, time
from pathlib import Path

def write():
    Path("data/heartbeat.json").write_text(json.dumps({
        "at": time.time(),
        "iso": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
        "status": "ok",
    }))

# scripts/watchdog.py (independent process)
import json, time, subprocess
from pathlib import Path

MAX_AGE = 900  # 15 min during market hours

hb = json.loads(Path("data/heartbeat.json").read_text())
age = time.time() - hb["at"]
if age > MAX_AGE:
    subprocess.run(["/usr/local/bin/telegram-notify", f"heartbeat stale: {age:.0f}s"])
    Path("data/circuit.json").write_text(json.dumps({
        "paused": True,
        "reason": f"stale heartbeat ({age:.0f}s)",
        "since": time.time(),
    }))

Every pipeline layer checks data/circuit.json at startup:

def check_circuit():
    if not Path("data/circuit.json").exists(): return True
    return not json.loads(Path("data/circuit.json").read_text()).get("paused", False)

Resume is deliberately manual. See Heartbeats, Watchdogs, and Circuit Breakers for the full pattern.

Layer 7: Schedule (launchd)

<!-- com.you.trader-run.plist — every 5 minutes during market hours -->
<plist>
  <dict>
    <key>Label</key><string>com.you.trader-run</string>
    <key>ProgramArguments</key>
    <array>
      <string>/opt/homebrew/bin/python3</string>
      <string>/Users/you/trader/scripts/run.py</string>
    </array>
    <key>StartCalendarInterval</key>
    <array>
      <!-- ... entries for every 5 minutes 09:30-16:00 ET Mon-Fri ... -->
    </array>
  </dict>
</plist>

Three plists: run (pipeline), watchdog (every 15 min 24/7), notify-resume (daily 09:00). Total monthly cost: $0.

Cost envelope

At Sonnet 4.6 with 50% prompt caching, running 10 ideas/day × 5 calls × 8K input / 1.5K output with 15% retry:

Line	Monthly
Data (Alpaca Algo Trader Plus or Polygon Starter)	$29–99
LLM (Sonnet 4.6, 50% cache)	~$125
Broker commissions	$0 (PFOF)
Infra (Mac Mini + Cloudflare Pages free)	$0
Total	~$150–225/month

At 30% validation rate, that's roughly $2/validated-trade. If your strategy's expected value per trade exceeds that + execution costs + capital cost, the stack is paying for itself.

Verify via Token-Cost Optimizer.

The tests that matter

# tests/test_no_price_leak.py
def test_research_pack_has_no_price():
    pack = fetch_research_pack("AAPL")  # internal ticker map
    payload = json.dumps(asdict(pack))
    assert "price" not in payload.lower()
    assert "$" not in payload
    assert "chart" not in payload.lower()
    assert "momentum" not in payload.lower()

# tests/test_sizing_caps.py
def test_sized_fraction_never_exceeds_tier_cap():
    for p in [0.5, 0.7, 0.85, 0.95, 0.99]:
        for b in [1.0, 1.5, 2.0, 3.0]:
            assert sized_fraction(p, b, tier_cap=0.04) <= 0.04

# tests/test_idempotent_orders.py
def test_same_client_order_id_never_double_fills():
    decision = {"id": "test-uuid-1", "ticker": "SYNTHETIC_A", "notional_usd": 1000, "fraction": 0.01}
    result1 = execute(decision, current_price=100)
    result2 = execute(decision, current_price=100)  # retry
    assert result1.id == result2.id

These three tests catch the three most dangerous failure modes. Run them in CI on every commit. If they don't exist in your production trader, you have a ticking time bomb.

What changes for scale

The shape above works up to ~$500K bankroll. Beyond that:

Data: move to Databento for tick granularity. See Data-Vendor TCO.
Execution: layer smart-order routing on top of broker primitives — simple POV / TWAP wrappers.
Research: staged cascade (Haiku filter → Sonnet extract → Opus synthesize) — see Token-Cost Reality.
Ops: move scheduler off launchd to managed cron (GitHub Actions scheduled workflow works for most solo scale); add Sentry for error tracking.
Risk: layer Walk-Forward Validator into nightly CI — gate live trading on passing WF efficiency threshold.

Run paper for 90 days with the full pipeline above.
Verify PBO < 0.3 and DSR > 90% on the paper run.
Confirm heartbeat + watchdog fire Telegram correctly via induced failure drills.
Read the BaFin + EU guide if EU-domiciled and ensure Impressum + compliance banner are right.
Go live with 10% of your intended bankroll. Hold for another 30 days. Scale up only if live matches paper within ±20%.

Connects to

The 2026 Engineer's Guide to AI in Markets — the strategic overview this tactical guide operationalizes.
Price-Blind LLM Research Harness — deeper dive on Layer 2.
Heartbeats, Watchdogs, and Circuit Breakers — Layer 6 in detail.
Conviction-Scaled Kelly — Layer 3 math.