How does Claude defend against prompt injection out of the box?

Question

Accepted Answer

Anthropic's Claude models include constitutional-AI training that makes them comparatively resistant to many direct jailbreak patterns and explicit role-override attempts. Claude is trained to recognize the conversational hierarchy of system / user / assistant turns and to weight the system prompt as higher-trust than user content. Anthropic has also published research on classifier-based defenses and on tool-use safety in Claude 3.5/Opus 4 tiers. That training is **not** a substitute for an application-layer defense for three reasons. First, indirect injection — payloads embedded in retrieved documents, tool outputs, or web pages — still reaches the model as user-role content and Claude has no way to know the user did not author it. Second, novel multi-turn and stored injections evolve faster than model training cycles. Third, agent deployments expand the blast radius: a Claude with MCP tools can call external systems, and a successful tool-misuse injection can move money, send email, or query a database. Best practice: rely on Claude's training as a *defense-in-depth* baseline, and add a dedicated input classifier (InjectShield or equivalent) at the application layer plus tool-call allowlisting and output schema validation. InjectShield's MCP integration is specifically designed for Claude + MCP agent stacks.