How to use Prompt Injection Tester
Red-team your finance agent against 24 documented prompt-injection attacks — direct override, role confusion, indirect injection via retrieved content. The page reports which attacks the agent followed and which it correctly refused.
What It Does
Use the calculator with intent
Red-team your finance agent against 24 documented prompt-injection attacks — direct override, role confusion, indirect injection via retrieved content. The page reports which attacks the agent followed and which it correctly refused.
Engineers deploying agents that read external content (news, filings, social posts) and need to know which prompt-injection vectors break their guardrails.
Interpreting Results
Any 'followed' attack is a critical bug — patch the system prompt or input sanitization before deployment. 'Refused' attacks are passing the bar; 'partial' means the agent didn't follow but also didn't flag the attempt, which is a softer fail.
Input Steps
Field by field
- 1
Pick option
Pick the prompt being tested (e.g., your agent's system prompt + sample user message).
- 2
Run calculation
Run the injection battery: instruction override, context smuggling, role confusion, data exfiltration, refusal bypass.
- 3
Read outputs
Read the per-attack pass/fail. Aggregate pass rate is the high-level metric; per-attack details show specific weaknesses.
- 4
Investigate
Investigate failures. Each failure shows the attack prompt and the model's compromised output — useful for hardening the system prompt.
- 5
Re-run
Re-run after every system prompt change. Injection resistance is fragile; small prompt edits can introduce regressions.
Common Scenarios
Use realistic starting points
Naive agent (no defenses)
System prompt
minimal
Input sanitization
none
Most attacks succeed. Direct-override attacks score 80%+. Use this as a baseline before hardening.
Hardened agent (post-defenses)
System prompt
explicit refusal rules
Input sanitization
yes
Direct attacks largely refused; indirect (RAG-based) attacks still leak through if retrieved content can include instructions.
Try These Tools
Run the numbers next
Agent Skill Tester for Markets
Paste a SKILL.md definition + sample input + your Anthropic API key. See structured extraction, token cost, and latency — all in your browser. No signup.
Hallucination Detector
Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.
Price-Blind Research Auditor
Paste a research prompt or agent context bundle. The auditor flags price numbers, directional words, and outcome-leaking phrases that cause LLMs.
FAQ
Questions people ask next
The short answers readers usually want after the first pass.
Related Content
Keep the topic connected
Prompt Injection
Prompt injection: when untrusted text in a prompt overrides system instructions. The attack patterns and the structural defenses that work in production.
Agent Skill Testing
Agent skill testing: the regression-test discipline for LLM-driven agents. What to test, how to score, and the difference between pass-rate and capability.
Model Drift
Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.