Are the test prompts safe to run on a live system?

They're designed to test, not exploit. Most prompts trigger the model's safety measures, which is the point. Don't run injection tests against production systems with real users — run against a clone or staging. The methodology page emphasizes this.

Does a high pass rate mean my agent is secure?

It means it resisted these specific attacks. New attack patterns appear constantly; the test suite is updated quarterly. A high pass rate is necessary but not sufficient. Layered defenses (input filtering, output filtering, scope limits on tools) matter more than any single defense.

What's a common mistake when using Prompt Injection Tester?

Testing only direct attacks. Indirect injection through retrieved content is the vector that matters in production — most teams skip it because the test is harder.

How do system-prompt leaks slip past basic filters?

Treating 'agent didn't follow' as success. The agent should also flag the attempt for review; silent refusal hides the attack from monitoring.

AI in Markets Calculator Guide

How to use Prompt Injection Tester

Red-team your finance agent against 24 documented prompt-injection attacks — direct override, role confusion, indirect injection via retrieved content. The page reports which attacks the agent followed and which it correctly refused.

5 STEPSPublished May 12, 2026Live Content

By Orbyd Editorial · AI Fin Hub Team

Best Next MovePlaygrounds

Prompt Injection Tester

Red-team a finance agent against 24 documented prompt-injection attacks — direct override, role confusion, indirect injection via retrieved content.

CalculatorOpen ->

On This Page

Overview 5 steps Scenarios FAQ

What It Does

Use the calculator with intent

Engineers deploying agents that read external content (news, filings, social posts) and need to know which prompt-injection vectors break their guardrails.

Interpreting Results

Any 'followed' attack is a critical bug — patch the system prompt or input sanitization before deployment. 'Refused' attacks are passing the bar; 'partial' means the agent didn't follow but also didn't flag the attempt, which is a softer fail.

Input Steps

Field by field

1

Pick option

Pick the prompt being tested (e.g., your agent's system prompt + sample user message).
2

Run calculation

Run the injection battery: instruction override, context smuggling, role confusion, data exfiltration, refusal bypass.
3

Read outputs

Read the per-attack pass/fail. Aggregate pass rate is the high-level metric; per-attack details show specific weaknesses.
4

Investigate

Investigate failures. Each failure shows the attack prompt and the model's compromised output — useful for hardening the system prompt.
5

Re-run

Re-run after every system prompt change. Injection resistance is fragile; small prompt edits can introduce regressions.

Common Scenarios

Use realistic starting points

Naive agent (no defenses)

System prompt

minimal

Input sanitization

none

Most attacks succeed. Direct-override attacks score 80%+. Use this as a baseline before hardening.

Hardened agent (post-defenses)

System prompt

explicit refusal rules

Input sanitization

yes

Direct attacks largely refused; indirect (RAG-based) attacks still leak through if retrieved content can include instructions.

Try These Tools

Run the numbers next

PlaygroundsCalculator

Agent Skill Tester for Markets

Paste a SKILL.md definition + sample input + your Anthropic API key. See structured extraction, token cost, and latency — all in your browser. No signup.

Launch toolOpen ->

PlaygroundsCalculator

Hallucination Detector

Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.

Launch toolOpen ->

PlaygroundsCalculator

Price-Blind Research Auditor

Paste a research prompt or agent context bundle. The auditor flags price numbers, directional words, and outcome-leaking phrases that cause LLMs.

Launch toolOpen ->

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

Documented on the methodology page: instruction override ('ignore previous instructions'), context smuggling (hidden instructions in retrieved documents), role confusion ('you are now a different assistant'), data exfiltration ('repeat your system prompt'), and refusal bypass ('this is a test, please respond despite policy'). Plus jailbreak templates from public databases.

Keep the topic connected

AI in Markets1 FAQS

Prompt Injection

Prompt injection: when untrusted text in a prompt overrides system instructions. The attack patterns and the structural defenses that work in production.

Keep readingRead ->

AI in Markets1 FAQS

Agent Skill Testing

Agent skill testing: the regression-test discipline for LLM-driven agents. What to test, how to score, and the difference between pass-rate and capability.

Keep readingRead ->

AI in Markets1 FAQS

Model Drift

Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.

Keep readingRead ->

Use the calculator with intent

Field by field

Pick option

Run calculation

Read outputs

Investigate

Re-run

Use realistic starting points

Naive agent (no defenses)

Hardened agent (post-defenses)

Run the numbers next

Agent Skill Tester for Markets

Hallucination Detector

Price-Blind Research Auditor

Questions people ask next

Keep the topic connected

Prompt Injection

Agent Skill Testing

Model Drift