What is the false-positive rate of regex prompt-injection filters?

Question

Accepted Answer

Regex prompt-injection filters trade simplicity for noise. Across published evaluations and InjectShield's internal benchmarks against PromptInject + HarmBench + customer traffic samples, **regex-only filters land in the 5-25% false-positive range on real user traffic**, depending on ruleset breadth. Sources of false positives: (1) **legitimate meta-discussion** — users asking "what's a prompt injection?" or "can you ignore that last sentence and re-answer?" trip "ignore"-family rules; (2) **technical content** — security researchers, AI engineers, and red-teamers discuss attack strings as part of their day job; (3) **role-play and creative writing** — "pretend you are a pirate" matches persona-override patterns; (4) **non-English content** — naive English-only rulesets either miss everything (no FPs but no TPs) or trigger on transliterations; (5) **encoded user input** — base64-encoded photos or credentials sometimes match alphanumeric injection patterns. The tradeoff is unavoidable for pure regex: broaden the ruleset to catch more attacks → false-positive rate climbs; narrow it → true-positive rate drops. Real-world deployments typically settle around 5-10% FPR with 40-60% TPR on novel attacks — meaning regex alone blocks one in ten legit users while still letting half of attacks through. The 2026 fix is hybrid. **Heuristics run first** (~1 ms, free, ~5% FPR baseline). **Ambiguous traffic escalates to a semantic classifier** (Anthropic Haiku in InjectShield's case) that brings FPR to ~0.5-1% while pushing TPR above 95% on standard benchmarks. InjectShield's published evaluation at injectshield.dev/benchmarks reports per-dataset FPR/TPR for heuristic-only vs hybrid modes.