What is instruction smuggling via Unicode lookalikes and how do you defend against it?

Question

Accepted Answer

Unicode instruction smuggling is an encoding-bypass attack in which adversary text uses non-ASCII characters that look like English but aren't tokenized as the keywords your filters watch for. The model still reads and follows the instruction because LLM tokenizers normalize across many of these glyphs; your regex-based filter, which only sees the raw bytes, does not. Common variants: **Cyrillic/Greek lookalikes** ("і" U+0456, "о" U+043E, "е" U+0435 replace Latin i/o/e); **mathematical alphanumerics** (𝐢𝐠𝐧𝐨𝐫𝐞 instead of "ignore"); **fullwidth Latin** (ｉｇｎｏｒｅ); **zero-width joiners and tag characters** (U+E0000 range — Riley Goodside demonstrated invisible Unicode-tag instruction smuggling against several frontier models in 2024); **homoglyph substitution attacks** that combine multiple categories. Defense is three-fold. **Normalize before classifying** — NFKC Unicode normalization plus a confusables map (Unicode TR39) collapses lookalikes to canonical ASCII before pattern matching. **Strip or flag invisible characters** — zero-width joiners, tag chars, and bidi controls should either be removed or surfaced as a "suspicious encoding" verdict. **Semantic classifier on the canonicalized text** — even after normalization, a Haiku semantic check catches paraphrased intent regex misses. InjectShield's heuristic preprocessor runs NFKC + TR39 + invisible-char stripping before pattern matching, and surfaces a dedicated `unicode_smuggling` verdict category so security teams can monitor encoded-attack volume separately from plain-text injection. This pairs with OWASP LLM01 input-handling controls.