InjectShield

How do you red-team an LLM app for prompt injection (end-to-end playbook)?

A 2026 end-to-end prompt-injection red-team has six phases (deeper than the introductory question 13 covered).

1. Threat-model. Enumerate every ingress: user input, retrieved documents, tool outputs, fetched web pages, email/calendar bodies, file uploads (PDFs, images), conversation memory, user-editable profile fields. Map each to OWASP LLM01 sub-types (direct, indirect, stored, multi-turn, role-confusion, jailbreak, tool-misuse).

2. Asset inventory. What can the model do? List every tool, every database, every external API, every email/payment/file capability. Each becomes a "blast radius" objective for red-team scenarios.

3. Automated scanning. Run garak (NVIDIA — broad LLM vulnerability scanner), Promptmap2 (injection-focused fuzzer), PyRIT (Microsoft — generative red-team framework), and HouYi / academic adversarial corpora against staging. Run twice: guardrails ON vs OFF, to measure detection lift and document baseline.

4. Manual creative testing. Auto-scanners miss novel attacks. Have humans try: role-confusion (DAN family), context-window flooding, multi-turn slow-drip, encoding tricks (base64, ROT13, Unicode lookalikes, zero-width chars), language switching, multimodal (images, PDFs, audio if applicable), tool-output poisoning via crafted external content.

5. Indirect-surface testing. Plant payloads in test PDFs, test web pages, test emails, test calendar invites, test code repos, test RAG documents. Verify the model behaves correctly when *legitimate* users feed those in.

6. Report and remediate. Per-attack-category detection rate, false-positive rate, latency, blast radius. Re-test quarterly and after every model upgrade. Track metrics in the InjectShield dashboard or your SIEM.

InjectShield ships an open-source adversarial test corpus mirroring garak's prompt-injection probes for self-evaluation.