ODOE AI Runbook

AI-Enabled Incident Runbook

Governed runbook showing how ODOE IT could combine human ownership, AI-executable steps, and auditable prompts during incident response. Unlike a formal SOP, this runbook is trigger-based and intended for live execution during active events.

Runbook Governance

Execute repeatable incident response with governed AI support.

This example uses runbook NT-22053-123 Remote Access Latency Response for INC-2041. It shows how ODOE IT could let AI gather evidence, draft communications, and prepare vendor escalation while humans retain control over risk-bearing actions and final decisions.

Runbook ID: NT-22053-123 Incident: INC-2041 Mode: AI-assisted + human approvals
current build
v3.2

Current runbook version

automation ready
4 / 6

AI-executable steps

human controlled
2

Approval gates

service comms
< 10 min

Target to first status update

Trigger Conditions And Preconditions

Invoke the runbook only when the signal, scope, and ownership conditions are clear enough to support structured execution under the severity framework.

Severity 2+ only
Signal / Check Threshold Or Rule Source Runbook Decision
Remote access latency sustained P95 latency above 280 ms for 10 minutes Gateway telemetry + synthetic check Open a shared-service incident, declare severity, and attach the runbook
Cross-division impact 25+ affected users or 3+ divisions impacted Ticket correlation + service desk spike Treat as a shared service issue, not a local workstation issue
No planned change in effect No active maintenance, routing, or patch window Change calendar Prevent false-positive execution
Human ownership established Incident commander and service owner named before step 3 On-call roster Allow AI execution under accountable oversight

Execution Sequence

Each step defines what AI may do, what a human must confirm, and what evidence gets written back to the incident.

1 Trigger validation

Detect And Open Incident

Validate shared-service scope and attach the runbook.

AI-executable

AI Action

Correlate gateway latency, session failure rate, and service desk spike; draft the incident summary and severity proposal.

Human Checkpoint

Incident commander confirms severity and accountable owner.

Write-Back

Incident summary, blast-radius estimate, and runbook attachment logged to the ticket.

Guardrail

Stop if the issue is isolated to one user, one endpoint, or a planned maintenance window.

2 Evidence pack

Gather Diagnostics And Compare Baseline

Build a consistent evidence set before deeper action.

AI-executable

AI Action

Pull gateway, broker, and authentication telemetry; compare current state to the last-known-good baseline.

Human Checkpoint

Infrastructure lead validates that the evidence pack is complete enough to support next actions.

Write-Back

Diagnostics bundle, baseline diff, and suspected failure domain attached to the incident.

Guardrail

Escalate to manual triage if telemetry sources are missing, stale, or contradictory.

3 Communication drafts

Draft Initial Communications

Prepare clear updates without letting AI publish them autonomously.

Approval required

AI Action

Prepare an internal support note, stakeholder holding statement, and short leadership summary using current facts only.

Human Checkpoint

Service owner approves anything sent beyond the internal support note.

Write-Back

Communication drafts saved with requested send or hold decision.

Guardrail

AI may draft messages, but never send leadership or stakeholder communication autonomously.

4 Containment planning

Recommend Workaround Or Containment

Use AI for structured options, not autonomous service-impacting change.

Human-approved

AI Action

Recommend the least-risk workaround using recent incident patterns and known-good recovery paths.

Human Checkpoint

Network or infrastructure lead approves any routing change, failover, or service-impacting workaround.

Write-Back

Recommended action, risk summary, and rollback note appended to the ticket.

Guardrail

AI may recommend changes but may not execute network, firewall, or gateway actions.

5 Vendor packet

Prepare Vendor Escalation Packet

Convert the evidence pack into a usable vendor case quickly and consistently.

AI-prepared

AI Action

Assemble timestamps, impacted scope, comparative metrics, and evidence into a vendor-ready case draft.

Human Checkpoint

Incident commander confirms the facts and sends the case.

Write-Back

Vendor packet, case ID placeholder, and next-response expectation logged to the incident.

Guardrail

Keep language factual and evidence-based; do not assign fault or speculate beyond the data.

6 Recovery validation

Verify Recovery And Prepare Closure

Close the loop only after service and governance conditions are both met.

Approval required

AI Action

Monitor latency recovery, confirm session success trend, and draft recovery note plus follow-up tasks.

Human Checkpoint

Incident commander confirms service restoration, closure readiness, and post-incident owner.

Write-Back

Recovery confirmation, closure draft, and follow-up work items posted to the incident.

Guardrail

AI may not close the incident or mark service restored without human confirmation.

AI Prompt Pack

These are the governed instructions an AI operator would receive during execution.

Prompt-driven

System / Control Prompt

Sets the AI role, permissions, and required output format before execution begins.

Role: governed incident runbook executor for ODOE IT.
Runbook: NT-22053-123 Remote Access Latency Response.

Objectives:
- reduce time to reliable diagnosis
- keep communication factual and timely
- produce auditable outputs at each step

Rules:
- never change routing, gateway configuration, firewall policy, or incident status without human approval
- never send leadership or stakeholder communication without human approval
- when evidence is incomplete, say so and request the next human decision

Return format:
1. Situation
2. Evidence
3. Recommended Next Step
4. Ticket Updates

Execution Prompt

Provides the live incident context and the exact tasks to complete inside the guardrails.

Incident: INC-2041
Service: Remote access

Current signal:
- latency p95 = 312 ms for 14 minutes
- 41 affected users across 3 divisions
- no approved change window is active

Tasks:
1. Validate runbook trigger conditions.
2. Assemble diagnostics bundle from gateway, broker, and auth telemetry.
3. Produce blast-radius summary.
4. Draft internal support note and stakeholder holding statement.
5. Draft vendor escalation packet.

Do not execute routing, failover, or closure actions.

Vendor Escalation Prompt

Used only after diagnostics are attached and the incident commander approves vendor escalation.

Draft a vendor escalation using the attached evidence pack.

Include:
- incident start time and timeline highlights
- impacted user scope and affected divisions
- current latency and session failure indicators
- comparison to the last-known-good baseline
- current workaround status
- requested vendor action in the next 30 minutes

Keep tone factual. Do not assign blame. Flag any missing evidence explicitly.

AI-Executable Runbook Contract

The same runbook is represented in machine-readable form so an AI operator can follow the rules consistently and write outputs back to the incident.

Embedded JSON contract
runbook_id: NT-22053-123
title: Remote Access Latency Response
execution_mode: ai_assisted_with_human_approval
allowed_tools:
  - telemetry.read
  - baseline.compare
  - ticket.update
  - timeline.append
  - communications.draft
  - artifact.attach
  - vendor.case_draft
blocked_tools:
  - network.change
  - gateway.failover
  - firewall.change
  - communications.send
  - incident.close
invoke_when:
  - remote_access_latency_p95 > 280ms for 10m
  - affected_users >= 25
required_context:
  - incident_id
  - incident_commander
  - service_owner
  - last_known_good_baseline
  - current_gateway_metrics
write_back_to_ticket:
  - blast_radius_summary
  - diagnostics_bundle_reference
  - communication_drafts
  - vendor_packet
  - next_human_decision
success_criteria:
  - remote_access_latency_p95 < 140ms for 15m
  - incident_commander_confirms_service_restored = true