Can a regex or keyword filter stop prompt injection?

Question

Accepted Answer

Short answer: no. Regex and keyword filters catch the most naive attacks ("ignore previous instructions," "you are now DAN," "system:") and that's worth doing — InjectShield's open-source heuristic layer includes hundreds of such patterns and runs in ~1 ms. But regex alone is insufficient for three reasons. **Semantic equivalence.** "Disregard everything above" / "set aside the prior guidance" / a paragraph of polite English ending in a behavior change all bypass keyword filters. LLMs are semantic systems; defense needs semantic understanding too. **Encoding bypasses.** Base64, ROT13, leet-speak, unicode tag characters, zero-width joiners, language switching, and prompt translation all defeat literal pattern matching while remaining executable instructions to the model. **Indirect and multi-turn.** A payload distributed across five turns, or split between a user message and a retrieved document, will never match a single regex. The 2026 best practice is a hybrid stack: cheap, transparent heuristics as a first pass (fast, free, auditable), with a semantic classifier (small LLM like Haiku, Llama Guard, or a fine-tuned encoder) for the cases regex can't see. InjectShield runs heuristics first and only escalates ambiguous traffic to the Haiku semantic layer — this keeps cost near-zero on benign traffic while catching the long tail.