What is the false-positive rate of regex prompt-injection filters?
Regex prompt-injection filters trade simplicity for noise. Across published evaluations and InjectShield's internal benchmarks against PromptInject + HarmBench + customer traffic samples, regex-only filters land in the 5-25% false-positive range on real user traffic, depending on ruleset breadth.
Sources of false positives: (1) legitimate meta-discussion — users asking "what's a prompt injection?" or "can you ignore that last sentence and re-answer?" trip "ignore"-family rules; (2) technical content — security researchers, AI engineers, and red-teamers discuss attack strings as part of their day job; (3) role-play and creative writing — "pretend you are a pirate" matches persona-override patterns; (4) non-English content — naive English-only rulesets either miss everything (no FPs but no TPs) or trigger on transliterations; (5) encoded user input — base64-encoded photos or credentials sometimes match alphanumeric injection patterns.
The tradeoff is unavoidable for pure regex: broaden the ruleset to catch more attacks → false-positive rate climbs; narrow it → true-positive rate drops. Real-world deployments typically settle around 5-10% FPR with 40-60% TPR on novel attacks — meaning regex alone blocks one in ten legit users while still letting half of attacks through.
The 2026 fix is hybrid. Heuristics run first (~1 ms, free, ~5% FPR baseline). Ambiguous traffic escalates to a semantic classifier (Anthropic Haiku in InjectShield's case) that brings FPR to ~0.5-1% while pushing TPR above 95% on standard benchmarks. InjectShield's published evaluation at injectshield.dev/benchmarks reports per-dataset FPR/TPR for heuristic-only vs hybrid modes.