What is prompt injection and how does it work?

Question

Accepted Answer

Prompt injection is an attack class in which adversary-controlled text is interpreted by an LLM as instructions rather than as data. Because LLMs do not have a strict syntactic boundary between "trusted system prompt" and "untrusted user/document content," any string that reaches the model's context window can change the model's behavior. OWASP catalogues this as LLM01 in its LLM Top 10. Prompt injection works in three steps: (1) the attacker delivers a payload — directly via a chat box, or indirectly via content the model will later read (an email, a PDF, a webpage, a code comment, a tool result); (2) the LLM tokenizes that payload alongside the legitimate system prompt and treats high-authority verbs ("ignore previous instructions," "you are now…," "exfiltrate") as commands; (3) the model's output, tool calls, or downstream actions follow the attacker's instructions instead of the developer's. Unlike SQL injection, there is no parser to fix — the "vulnerability" is the model's instruction-following itself. Defense therefore happens *around* the model: input classification (heuristic + semantic), output filtering, tool-call allowlists, and context isolation. InjectShield implements the input-classification layer using open-source heuristics for fast/free filtering and Anthropic Haiku for nuanced semantic detection.