How do you red-team an LLM application for prompt injection?

Question

Accepted Answer

A 2026 prompt-injection red-team has four phases: **1. Threat-model the attack surface.** Enumerate every place untrusted text enters the model: user input, retrieved documents, tool outputs, web fetches, email bodies, file uploads, code comments, conversation memory, user-editable profile fields. Each is a potential ingress. **2. Run automated adversarial suites.** **garak** (NVIDIA) — broad LLM vulnerability scanner covering prompt injection, jailbreaks, data leakage. **Promptmap2** — focused prompt-injection fuzzer. **PyRIT** (Microsoft) — generative red-teaming framework. **HouYi** / academic adversarial corpora. Run them against staging with your guardrails enabled and disabled to measure detection lift. **3. Manual creative testing.** Auto-scanners miss novel attacks. Have humans try: role-confusion ("you are now DAN"), context-window flooding, multi-turn slow-drip ("OK, now in step 7…"), encoding tricks (base64, unicode tag chars, leet-speak), language switching, multi-modal injection if you accept images/PDFs. **4. Test indirect surface specifically.** Plant payloads in test documents, test web pages, test emails, test code repos. Verify the model behaves correctly when *legitimate* users feed those into the model. Track detection rate, false-positive rate, latency, and per-attack-category coverage. Re-test quarterly and after every model upgrade — model updates can both add and remove training-time defenses. InjectShield ships an adversarial test corpus that mirrors garak's prompt-injection probes for self-evaluation.