Current runbook version
Trigger Conditions And Preconditions
Invoke the runbook only when the signal, scope, and ownership conditions are clear enough to support structured execution under the severity framework.
| Signal / Check | Threshold Or Rule | Source | Runbook Decision |
|---|---|---|---|
| Remote access latency sustained | P95 latency above 280 ms for 10 minutes | Gateway telemetry + synthetic check | Open a shared-service incident, declare severity, and attach the runbook |
| Cross-division impact | 25+ affected users or 3+ divisions impacted | Ticket correlation + service desk spike | Treat as a shared service issue, not a local workstation issue |
| No planned change in effect | No active maintenance, routing, or patch window | Change calendar | Prevent false-positive execution |
| Human ownership established | Incident commander and service owner named before step 3 | On-call roster | Allow AI execution under accountable oversight |
Execution Sequence
Each step defines what AI may do, what a human must confirm, and what evidence gets written back to the incident.
Detect And Open Incident
Validate shared-service scope and attach the runbook.
AI Action
Correlate gateway latency, session failure rate, and service desk spike; draft the incident summary and severity proposal.
Human Checkpoint
Incident commander confirms severity and accountable owner.
Write-Back
Incident summary, blast-radius estimate, and runbook attachment logged to the ticket.
Guardrail
Stop if the issue is isolated to one user, one endpoint, or a planned maintenance window.
Gather Diagnostics And Compare Baseline
Build a consistent evidence set before deeper action.
AI Action
Pull gateway, broker, and authentication telemetry; compare current state to the last-known-good baseline.
Human Checkpoint
Infrastructure lead validates that the evidence pack is complete enough to support next actions.
Write-Back
Diagnostics bundle, baseline diff, and suspected failure domain attached to the incident.
Guardrail
Escalate to manual triage if telemetry sources are missing, stale, or contradictory.
Draft Initial Communications
Prepare clear updates without letting AI publish them autonomously.
AI Action
Prepare an internal support note, stakeholder holding statement, and short leadership summary using current facts only.
Human Checkpoint
Service owner approves anything sent beyond the internal support note.
Write-Back
Communication drafts saved with requested send or hold decision.
Guardrail
AI may draft messages, but never send leadership or stakeholder communication autonomously.
Recommend Workaround Or Containment
Use AI for structured options, not autonomous service-impacting change.
AI Action
Recommend the least-risk workaround using recent incident patterns and known-good recovery paths.
Human Checkpoint
Network or infrastructure lead approves any routing change, failover, or service-impacting workaround.
Write-Back
Recommended action, risk summary, and rollback note appended to the ticket.
Guardrail
AI may recommend changes but may not execute network, firewall, or gateway actions.
Prepare Vendor Escalation Packet
Convert the evidence pack into a usable vendor case quickly and consistently.
AI Action
Assemble timestamps, impacted scope, comparative metrics, and evidence into a vendor-ready case draft.
Human Checkpoint
Incident commander confirms the facts and sends the case.
Write-Back
Vendor packet, case ID placeholder, and next-response expectation logged to the incident.
Guardrail
Keep language factual and evidence-based; do not assign fault or speculate beyond the data.
Verify Recovery And Prepare Closure
Close the loop only after service and governance conditions are both met.
AI Action
Monitor latency recovery, confirm session success trend, and draft recovery note plus follow-up tasks.
Human Checkpoint
Incident commander confirms service restoration, closure readiness, and post-incident owner.
Write-Back
Recovery confirmation, closure draft, and follow-up work items posted to the incident.
Guardrail
AI may not close the incident or mark service restored without human confirmation.
AI Prompt Pack
These are the governed instructions an AI operator would receive during execution.
System / Control Prompt
Sets the AI role, permissions, and required output format before execution begins.
Role: governed incident runbook executor for ODOE IT. Runbook: NT-22053-123 Remote Access Latency Response. Objectives: - reduce time to reliable diagnosis - keep communication factual and timely - produce auditable outputs at each step Rules: - never change routing, gateway configuration, firewall policy, or incident status without human approval - never send leadership or stakeholder communication without human approval - when evidence is incomplete, say so and request the next human decision Return format: 1. Situation 2. Evidence 3. Recommended Next Step 4. Ticket Updates
Execution Prompt
Provides the live incident context and the exact tasks to complete inside the guardrails.
Incident: INC-2041 Service: Remote access Current signal: - latency p95 = 312 ms for 14 minutes - 41 affected users across 3 divisions - no approved change window is active Tasks: 1. Validate runbook trigger conditions. 2. Assemble diagnostics bundle from gateway, broker, and auth telemetry. 3. Produce blast-radius summary. 4. Draft internal support note and stakeholder holding statement. 5. Draft vendor escalation packet. Do not execute routing, failover, or closure actions.
Vendor Escalation Prompt
Used only after diagnostics are attached and the incident commander approves vendor escalation.
Draft a vendor escalation using the attached evidence pack. Include: - incident start time and timeline highlights - impacted user scope and affected divisions - current latency and session failure indicators - comparison to the last-known-good baseline - current workaround status - requested vendor action in the next 30 minutes Keep tone factual. Do not assign blame. Flag any missing evidence explicitly.
AI-Executable Runbook Contract
The same runbook is represented in machine-readable form so an AI operator can follow the rules consistently and write outputs back to the incident.
runbook_id: NT-22053-123 title: Remote Access Latency Response execution_mode: ai_assisted_with_human_approval allowed_tools: - telemetry.read - baseline.compare - ticket.update - timeline.append - communications.draft - artifact.attach - vendor.case_draft blocked_tools: - network.change - gateway.failover - firewall.change - communications.send - incident.close invoke_when: - remote_access_latency_p95 > 280ms for 10m - affected_users >= 25 required_context: - incident_id - incident_commander - service_owner - last_known_good_baseline - current_gateway_metrics write_back_to_ticket: - blast_radius_summary - diagnostics_bundle_reference - communication_drafts - vendor_packet - next_human_decision success_criteria: - remote_access_latency_p95 < 140ms for 15m - incident_commander_confirms_service_restored = true