InjectShield

How do you prevent tool-misuse and agent prompt injection in agents that call MCP servers?

Agent stacks — Claude + MCP, OpenAI Assistants with function calling, LangGraph, AutoGen — multiply prompt-injection impact because the model can take real-world actions. Injection that previously just produced bad text now sends money, emails, or queries. Three controls compound:

1. Pre-model input classification on every ingress. Scan user input, tool outputs, retrieved docs, and any string entering the context window. InjectShield's @injectshield/mcp server plugs directly into MCP-host configs so any compatible agent gets injection scanning for free; the REST API works for non-MCP frameworks.

2. Tool-call allowlists and policy enforcement. Maintain a per-agent allowlist of permitted tools and per-tool parameter schemas. Reject tool calls whose arguments don't validate, whose destinations aren't allowlisted (e.g., outbound HTTP only to approved hosts), or whose effect crosses a privilege boundary (e.g., a "read inbox" agent attempting to call "send email"). This is OWASP LLM07 territory and pairs with LLM01.

3. Output validation and human-in-the-loop for high-impact actions. Any irreversible action — payment, deletion, external email — should go through schema validation and a confirmation step. Log the injection-classifier verdict alongside every tool call for forensics.

InjectShield's tool-misuse detector specifically flags model output strings that look like attempts to call tools the agent should not be calling, even before they reach the tool router.