Skip to content

    LLM Agent Incident Response Playbook for a Compromised MCP Server

    When an LLM agent goes rogue, the failure mode does not look like a classic intrusion. There is no port scan, no malware dropper, no obvious lateral movement. Instead, a trusted agent with a valid token starts doing slightly wrong things at machine speed: reading files it has never touched, calling a Model Context Protocol (MCP) tool in an unusual order, or exfiltrating data through a perfectly legitimate API. By the time a human notices, the agent may have executed thousands of actions. This playbook is built for that reality.

    This is an operational incident response (IR) playbook for security engineers who run or defend MCP servers and the LLM agents that connect to them. It adapts the classic NIST SP 800-61 incident lifecycle to the specific physics of AI agents: prompt injection as an initial access vector, tool invocation as the primary action-on-objectives, and short-lived OAuth tokens as the credential that must be revoked. Every phase below includes concrete signals, commands, and decisions you can lift directly into your own runbook.

    The structure follows NIST's four phases, with detection and triage broken out because in agent incidents the two are tightly coupled. Use the phase-by-phase checklist table as your in-incident reference, and the example detection queries as a starting point for your SIEM or log pipeline.

    What counts as an LLM agent incident

    Before you can respond, you need a shared definition of what you are responding to. An LLM agent incident is any event where an MCP server, its tools, or the agent driving it behaves outside its intended authority. The most common scenarios fall into a handful of categories:

    • Indirect prompt injection — malicious instructions embedded in a document, web page, email, or tool result hijack the agent's behavior. The agent itself is not compromised in the traditional sense; its instructions are.
    • Token or credential theft — an attacker obtains the OAuth token, API key, or session the agent uses to reach the MCP server and replays it directly.
    • Malicious or compromised MCP server — a tool server that the agent trusts returns poisoned tool descriptions (tool poisoning) or silently shadows a legitimate tool.
    • Excessive agency / confused deputy — the agent is tricked into using its legitimate, over-broad permissions to perform an action on the attacker's behalf.
    • Supply-chain compromise — a dependency, MCP package, or model update introduces hostile behavior.

    The unifying property is that the actor abuses trusted machinery. That is why containment leans heavily on revoking trust (tokens, tool access, network paths) rather than removing malware. If you have not yet inventoried which of these apply to your environment, an MCP attack surface assessment and a review of the MCP threat matrix will tell you where you are exposed before an incident forces the question.

    Detection: the signals that an agent has been compromised

    Detection for agents is fundamentally behavioral. You are looking for deviations from an established baseline of normal tool usage, not signatures. Build your detection logic around four signal families: tool-call anomalies, data-access anomalies, output anomalies, and identity anomalies.

    High-value detection signals

    • Tool sequence deviation — an agent that normally calls search → summarize suddenly calls read_file → http_post to an external host. Sequence and ordering matter more than any single call.
    • Privilege or scope escalation — invocation of a high-impact tool (delete, admin, send_email, execute) by an agent or session that has never used it.
    • Volume and velocity spikes — hundreds of tool calls per minute, large result-set reads, or pagination loops that pull entire datasets.
    • Prompt-injection fingerprints — tool results or user inputs containing imperative override phrases ("ignore previous instructions", "you are now", "system:"), encoded payloads, or invisible Unicode.
    • Egress to new destinations — the agent's tools making outbound calls to domains or IPs not on the allowlist.
    • Token misuse — the same token used from two geographies, an unusual user agent, or use outside the agent's normal operating window.

    Instrument every MCP tool call as a structured event before you need it. At minimum log: timestamp, agent/session ID, principal (token subject), tool name, argument hash, result size, destination, and a decision verdict from your guardrail layer. The depth of this telemetry is the single biggest factor in how fast you can triage; our guidance on MCP audit logging covers the schema in detail.

    Example detection queries

    The following are illustrative, written in a SIEM-agnostic pseudo-SQL against a normalized mcp_tool_calls table. Adapt the column names to your pipeline.

    -- 1. Sensitive tool used by a session that never used it before
    SELECT session_id, tool_name, principal_sub, MIN(ts) AS first_seen
    FROM mcp_tool_calls
    WHERE tool_name IN ('fs.delete','shell.exec','email.send','iam.grant')
      AND session_id NOT IN (
        SELECT DISTINCT session_id FROM mcp_tool_calls
        WHERE tool_name IN ('fs.delete','shell.exec','email.send','iam.grant')
          AND ts < NOW() - INTERVAL '24 hours')
    GROUP BY session_id, tool_name, principal_sub;
    
    -- 2. Egress to a destination not on the allowlist
    SELECT session_id, tool_name, dest_host, COUNT(*) AS calls
    FROM mcp_tool_calls
    WHERE tool_name LIKE 'http.%'
      AND dest_host NOT IN (SELECT host FROM egress_allowlist)
    GROUP BY session_id, tool_name, dest_host
    HAVING COUNT(*) > 0;
    
    -- 3. Prompt-injection phrase appearing in a tool RESULT (not user input)
    SELECT session_id, tool_name, ts
    FROM mcp_tool_calls
    WHERE result_text ~* '(ignore (all )?previous instructions|you are now|disregard the (system|above))'
       OR result_text ~ '[​-‏‪-‮]'   -- invisible/bidi Unicode
    ORDER BY ts DESC;
    
    -- 4. Velocity spike: tool calls per session per minute
    SELECT session_id, date_trunc('minute', ts) AS m, COUNT(*) AS c
    FROM mcp_tool_calls
    GROUP BY session_id, m
    HAVING COUNT(*) > 120
    ORDER BY c DESC;

    For continuous, automated checks against a server's configuration and exposed tools, the free open-source mcp-security-scanner can flag many of the misconfigurations that turn these signals into incidents in the first place.

    Triage: confirm, scope, and classify in the first 15 minutes

    Triage answers three questions fast: is this real, how far has it spread, and how bad is it. Resist the urge to contain before you have a scope, because premature, partial containment can tip off an attacker who still holds other credentials.

    • Confirm — pull the raw tool-call timeline for the suspect session. Distinguish a genuine compromise from a benign anomaly (a new workflow, a noisy retry loop). Look for intent: data leaving the boundary, destructive actions, or persistence attempts.
    • Scope — pivot on the principal (token subject), the agent ID, and the source IP. Find every session that shares those identifiers. One stolen token often drives many sessions.
    • Identify the entry vector — was the trigger a poisoned document (injection), a replayed token (credential theft), or a hostile tool server? The vector determines containment.
    • Classify severity — use a simple matrix below to set the response tier and who gets paged.

    Severity classification matrix

    TierIndicatorsExampleResponse
    SEV-1 CriticalConfirmed data exfiltration, destructive actions, or production credential theftAgent exported customer records to an external hostFull IR, revoke now, notify legal/leadership
    SEV-2 HighActive prompt injection with sensitive tool access, no confirmed exfil yetInjected doc made agent attempt iam.grantContain session, freeze token, investigate
    SEV-3 MediumAnomalous behavior, contained blast radius, low-sensitivity toolsVelocity spike on read-only search toolMonitor, rate-limit, root-cause
    SEV-4 LowSuspicious but likely benign or fully blocked by guardrailsInjection phrase blocked at the guardrailLog, tune detection, no page

    Document the classification decision and timestamp it. This record becomes the spine of your post-incident timeline and any compliance reporting under frameworks discussed in MCP compliance.

    Containment: revoke tokens and isolate tools

    Containment for an agent incident is about cutting trust paths in the right order. Because agents act through valid credentials and trusted tools, your levers are credential revocation, tool isolation, and network egress control. Prefer reversible, surgical actions first, then escalate to broad ones if the blast radius warrants it.

    Short-term containment, in priority order

    • Revoke the token, do not just rotate it. Rotation issues a new credential but a long-lived stolen token may still be valid. Hit the OAuth revocation endpoint and invalidate the refresh token so it cannot mint new access tokens.
    • Kill the live session(s). Terminate the agent's active MCP sessions and the underlying connection so in-flight tool calls stop.
    • Disable the specific tools the agent abused — pull shell.exec, fs.delete, or the relevant connector out of the agent's allowlist rather than taking the whole server offline if the rest is needed.
    • Block egress to the exfiltration destination at the proxy or firewall, and tighten the egress allowlist for the affected agent to deny-by-default.
    • Quarantine the agent identity — move the principal to a deny policy so any newly issued credential is also blocked.
    # Revoke an OAuth access AND refresh token (RFC 7009)
    curl -s -X POST https://auth.example.com/oauth/revoke \
      -u "$CLIENT_ID:$CLIENT_SECRET" \
      -d "token=$LEAKED_REFRESH_TOKEN" \
      -d "token_type_hint=refresh_token"
    
    # Terminate the agent's MCP sessions for a principal
    mcpctl sessions list --principal agent-7f3a | \
      awk 'NR>1 {print $1}' | xargs -n1 mcpctl sessions kill
    
    # Pull dangerous tools from the agent's policy (deny-by-default)
    mcpctl policy set --agent agent-7f3a \
      --deny-tools 'shell.exec,fs.delete,iam.*,http.post' \
      --reason "IR-2026-0142 containment"
    
    # Block the exfil destination at the egress proxy
    egressctl deny --agent agent-7f3a --host attacker-c2.example --ttl 72h

    Preserve evidence before you wipe

    Before recycling the agent runtime, capture volatile state: the in-memory conversation/context window, the system prompt actually in use, the loaded tool manifest, environment variables, and recent tool-call logs. The poisoned context window is often the only artifact that proves prompt injection, and it disappears when the process restarts. Snapshot it to write-once storage with a chain-of-custody note.

    Eradication and recovery: remove the cause, restore trust

    Eradication removes whatever allowed the incident; recovery brings the agent back online with confidence that it will not immediately re-compromise. Skipping straight to recovery is the most common cause of repeat incidents.

    Eradication

    • Remove the malicious input. If the vector was indirect prompt injection, purge the poisoned document, cache entry, RAG chunk, or memory record so the agent does not re-ingest it. Search your vector store and conversation memory for the injection fingerprint.
    • Distrust the hostile tool server. If a malicious or compromised MCP server was involved, remove it from the registry, pin tool definitions to known-good hashes, and verify no other agents trust it.
    • Close the credential gap. Rotate any secrets the agent could read, shorten token TTLs, and confirm the leaked token family is fully invalidated.
    • Patch excessive agency. Reduce the agent's scopes to least privilege so the same trick cannot reach high-impact tools next time.
    • Fix the detection gap. If a signal should have fired and did not, write the rule now while the incident is fresh.

    Recovery

    • Restore from a known-good agent configuration, system prompt, and tool manifest — not the compromised state.
    • Issue fresh, narrowly-scoped credentials with short TTLs.
    • Re-enable tools incrementally, starting with read-only, while watching the detection queries above.
    • Run the agent in an enhanced-monitoring window (lower thresholds, human-in-the-loop for sensitive tools) for a defined period before declaring full restoration.
    • Validate the fix by reproducing the original injection or attack in a sandbox and confirming the guardrails now block it — this is where a structured red-team test or the methodology in our AI red-teaming guide pays off.

    Recovery is complete only when you can articulate, in writing, what would stop a replay. If you cannot, you are still in eradication. Hardening the broader fleet against the same class of issue is the job of a focused hardening sprint and the MCP hardening checklist.

    Post-incident: turn the incident into controls

    NIST's final phase is post-incident activity, and for AI agents it is where most of the durable value lives. The goal is to convert a one-off response into permanent detective and preventive controls.

    • Build the timeline from your tool-call logs: first malicious action, detection, containment, eradication, recovery. Mean time to detect and contain are your key metrics.
    • Run a blameless retrospective. Focus on the control gaps: missing egress allowlist, over-broad token scope, no injection detection on tool results.
    • Codify new detections into your SIEM and guardrail layer so the same pattern is caught automatically. Convert ad-hoc queries from this incident into standing alerts.
    • Update the runbook with anything that was slow or ambiguous — especially the revocation and session-kill commands for your specific stack.
    • Reduce standing privilege across all agents, not just the affected one. One agent's incident usually reveals a fleet-wide pattern.
    • Report and document for any regulatory or contractual obligations.

    The strongest preventive control coming out of most agent incidents is a real-time guardrail layer that inspects tool inputs and outputs and enforces policy before a call executes; see MCP guardrails and prompt injection defense. Pair it with standing monitoring runbooks so the next anomaly is caught by a rule rather than a human noticing too late.

    Phase-by-phase IR checklist (NIST 800-61 adapted for AI agents)

    This table maps each NIST SP 800-61 phase to the concrete actions, owner, and AI-specific artifacts for an MCP/agent incident. Print it, paste it into your incident channel, and work top to bottom.

    NIST phaseAI-agent actionKey controls / commandsArtifact to capture
    PreparationInstrument every tool call; define agent baselines; pre-stage revocation scriptsStructured tool-call logging, egress allowlist, least-privilege scopes, guardrail policyBaseline of normal tool sequences per agent
    Detection & AnalysisAlert on tool-sequence, scope, velocity, egress, and injection-fingerprint anomaliesSIEM queries above; guardrail verdicts; token-misuse rulesTool-call timeline for the suspect session
    Triage (Analysis)Confirm intent, scope by principal/agent/IP, identify vector, classify severitySeverity matrix; pivot on token subjectSeverity decision + timestamp
    ContainmentRevoke token (not rotate), kill sessions, isolate abused tools, block egressoauth/revoke, sessions kill, deny-tools policy, egress denyContext window + system prompt + tool manifest snapshot
    EradicationPurge poisoned input, distrust hostile MCP server, rotate secrets, cut excess agencyVector-store cleanup, tool-hash pinning, scope reductionRoot-cause statement
    RecoveryRestore known-good config, fresh short-TTL creds, incremental tool re-enable, enhanced monitoringRead-only first, human-in-the-loop for sensitive toolsValidation test proving replay is blocked
    Post-IncidentBlameless retro, codify detections, fleet-wide privilege reduction, reportNew standing alerts; updated runbook; MTTD/MTTC metricsFinal timeline + lessons-learned record

    For a deeper, scenario-driven companion to this checklist see our write-up on MCP incident response, and when you need outside hands during a live event, our incident response service operates from exactly this playbook.

    Frequently Asked Questions

    What is an LLM agent incident response playbook?
    It is a structured, phase-by-phase procedure for detecting, containing, and recovering from incidents where an LLM agent or its MCP server behaves outside its intended authority — for example via prompt injection, token theft, or a malicious tool server. It adapts the NIST SP 800-61 lifecycle (preparation, detection and analysis, containment/eradication/recovery, and post-incident activity) to the specifics of AI agents, where the abused machinery is trusted and containment centers on revoking tokens and isolating tools rather than removing malware.
    What is the first thing to do when an MCP server is compromised?
    Triage before you contain: pull the suspect session's tool-call timeline to confirm the incident is real and determine its scope by pivoting on the token subject, agent ID, and source IP. Then, for confirmed compromise, the first containment action is to revoke the OAuth token (access and refresh) rather than merely rotate it, kill the live sessions, and disable the specific tools that were abused — while snapshotting the agent's context window and system prompt as evidence first.
    How do you detect a compromised LLM agent?
    Detection is behavioral, not signature-based. Baseline each agent's normal tool usage, then alert on deviations: unusual tool sequences, first-time use of high-impact tools, velocity and volume spikes, egress to destinations not on the allowlist, token misuse across geographies, and prompt-injection fingerprints (override phrases or invisible Unicode) appearing in tool results. This requires logging every tool call as a structured event with principal, tool name, arguments, result size, and destination.
    Should you rotate or revoke a leaked agent token?
    Revoke it. Rotation issues a new credential but can leave a stolen long-lived token valid until it expires, so the attacker keeps access. Call the OAuth revocation endpoint for both the access token and the refresh token so no new tokens can be minted, then move the agent's identity to a deny policy so any freshly issued credential is also blocked until eradication is complete.
    How does NIST 800-61 apply to AI agents?
    The four NIST phases map cleanly: preparation becomes tool-call instrumentation and least-privilege scoping; detection and analysis becomes behavioral anomaly detection on tool usage; containment, eradication, and recovery become token revocation, session termination, removal of poisoned inputs or hostile tool servers, and staged restoration with enhanced monitoring; and post-incident activity becomes codifying new guardrail rules and reducing standing agent privilege fleet-wide.
    What evidence should you preserve during an agent incident?
    Capture volatile state before restarting the runtime: the in-memory conversation and context window, the actual system prompt in use, the loaded tool manifest, environment variables, and recent tool-call logs. The poisoned context window is frequently the only artifact that proves indirect prompt injection occurred, and it is lost when the process restarts, so snapshot it to write-once storage with a chain-of-custody note.

    Related reading

    Secure your MCP deployment

    MCP Defense runs attack-surface assessments, hardening sprints, and 24/7 incident response for Model Context Protocol and AI-agent infrastructure.

    /* deployed 2026-04-08T12:08 */