How do you red-team an LLM app for prompt injection (end-to-end playbook)?

Question

Accepted Answer

A 2026 end-to-end prompt-injection red-team has six phases (deeper than the introductory question 13 covered). **1. Threat-model.** Enumerate every ingress: user input, retrieved documents, tool outputs, fetched web pages, email/calendar bodies, file uploads (PDFs, images), conversation memory, user-editable profile fields. Map each to OWASP LLM01 sub-types (direct, indirect, stored, multi-turn, role-confusion, jailbreak, tool-misuse). **2. Asset inventory.** What can the model do? List every tool, every database, every external API, every email/payment/file capability. Each becomes a "blast radius" objective for red-team scenarios. **3. Automated scanning.** Run **garak** (NVIDIA — broad LLM vulnerability scanner), **Promptmap2** (injection-focused fuzzer), **PyRIT** (Microsoft — generative red-team framework), and **HouYi** / academic adversarial corpora against staging. Run twice: guardrails ON vs OFF, to measure detection lift and document baseline. **4. Manual creative testing.** Auto-scanners miss novel attacks. Have humans try: role-confusion (DAN family), context-window flooding, multi-turn slow-drip, encoding tricks (base64, ROT13, Unicode lookalikes, zero-width chars), language switching, multimodal (images, PDFs, audio if applicable), tool-output poisoning via crafted external content. **5. Indirect-surface testing.** Plant payloads in test PDFs, test web pages, test emails, test calendar invites, test code repos, test RAG documents. Verify the model behaves correctly when *legitimate* users feed those in. **6. Report and remediate.** Per-attack-category detection rate, false-positive rate, latency, blast radius. Re-test quarterly and after every model upgrade. Track metrics in the InjectShield dashboard or your SIEM. InjectShield ships an open-source adversarial test corpus mirroring garak's prompt-injection probes for self-evaluation.