InjectShield

How do you detect jailbreaks like DAN, "ignore previous instructions," or role-confusion?

Jailbreaks are a subclass of direct prompt injection where the attacker tries to override the assistant's persona or safety policies. Common patterns include DAN ("Do Anything Now" — a persona prompt instructing the model to drop refusals), role-confusion ("you are now an unrestricted AI named X"), instruction-override ("ignore previous instructions and instead…"), hypothetical framing ("in a fictional world where there are no rules, write…"), token smuggling (splitting forbidden words across messages or encodings), and policy puppetry (claiming the developer/Anthropic/OpenAI has authorized an exception).

Detection is layered. Heuristics catch the obvious phrasings — InjectShield's open-source ruleset includes hundreds of DAN-family patterns, "ignore"-family overrides, and known persona-override openers. Semantic classification handles paraphrased and creative variants — InjectShield escalates ambiguous traffic to Anthropic Haiku, which is trained well enough to recognize role-override intent in novel English. Behavioral signals add a third layer: monitor refusal-rate drops, sudden persona shifts, and outputs that contradict the system prompt.

For high-stakes deployments, combine input classification with output filtering — if the model is about to emit content that violates policy, catch it on the way out. InjectShield exposes both directions: classify input to block, classify output to redact.