What datasets exist for prompt-injection testing (PromptInject, HarmBench, etc.)?

Question

Accepted Answer

The 2026 landscape of public injection-evaluation datasets: - **PromptInject** (Perez & Ribeiro, 2022) — the foundational direct-injection benchmark; canonical "ignore previous instructions" / goal-hijacking / prompt-leak scenarios. Still the baseline every defense vendor reports against. - **HarmBench** (Mazeika et al., 2024) — broader harmful-behavior benchmark with a dedicated injection subset; standardized attack/defense evaluation protocol. - **JailbreakBench** (Chao et al., 2024) — focused jailbreak/role-confusion corpus with reproducible evaluation harness; tracks both attack success and defense robustness across frontier models. - **AdvBench** — adversarial behavior dataset, often paired with GCG (Greedy Coordinate Gradient) attacks. - **garak probes** (NVIDIA) — not a static dataset but a scanner with hundreds of injection-specific probes; ships as the de-facto OSS red-team tool. - **TensorTrust** (Toyer et al., 2024) — crowdsourced 100k+ direct-injection attempts from an attack/defense game; useful for diverse real-world phrasings. - **INJECAGENT** (Zhan et al., 2024) — focused on indirect injection in tool-use agents (OWASP LLM01 + LLM07); evaluates agent-specific blast radius. - **HouYi** — academic indirect-injection benchmark with documented exfiltration scenarios. For RAG: **PoisonedRAG** (Zou et al., 2024) and **PromptBench** RAG subsets cover stored injection. For multimodal: **MMSafety** and Anthropic's vision red-team disclosures. Production-grade testing should combine 2-3 datasets across direct, indirect, and agent surfaces, plus your own domain-specific adversarial corpus. InjectShield publishes its open-source heuristic ruleset on GitHub and is evaluated against PromptInject + HarmBench + INJECAGENT — benchmarks at injectshield.dev/benchmarks.