What metrics should I track for prompt injection defense in production?

Question

Accepted Answer

A 2026 production monitoring set for any prompt-injection guardrail: **Detection metrics.** Injection-rate per hour, broken out by category (direct, indirect, stored, multi-turn, jailbreak, role-confusion, tool-misuse) and by ingress surface (user input, retrieved doc, tool output, memory). Spikes correlate with active attacks; sustained shifts indicate adversary adaptation. **Quality metrics.** True-positive rate (measured via labeled red-team corpus and customer reports), false-positive rate (measured via user-feedback negative signals — "this was blocked but shouldn't have been"), precision/recall per category. Re-baseline monthly. **Performance metrics.** P50/P95/P99 classifier latency, classifier error rate (timeouts, 5xx), heuristic-vs-semantic escalation rate, per-request cost. **Downstream impact.** Tool-call refusal rate after classifier verdict, conversation-abandonment rate after a block (proxy for false-positives), customer-support tickets mentioning blocks. **Forensics.** Full request payloads (subject to your data-retention policy) for any positive verdict, classifier verdict logs joined to tool-call logs joined to model-output logs, document-provenance trail for any RAG-mediated injection. **Business signals.** Cost per request, total monthly classifier spend, security-team SLA on triaging high-confidence injection alerts. InjectShield's dashboard exposes all of the above out of the box; the REST API can stream events to SIEM/Datadog/Honeycomb for teams that want to roll their own views.