CoreWeave Incident Troubleshooting Runbook

v20260423

coreweave-incident-runbook

This runbook provides structured steps for responding to critical production incidents on the CoreWeave platform. Use it when dealing with GPU workload failures, inference service outages, or general Kubernetes resource issues. It guides users through checking pod status, node health, and diagnosing common model loading errors to ensure rapid service recovery.

CoreWeave Incident Response GPU Kubernetes Inference Troubleshooting

Get Skill

430 downloads

Overview

CoreWeave Incident Runbook

Triage Steps

# 1. Check pod status
kubectl get pods -l app=inference -o wide

# 2. Check recent events
kubectl get events --sort-by=.lastTimestamp | tail -20

# 3. Check node status
kubectl get nodes -l gpu.nvidia.com/class -o wide

# 4. Check GPU health
kubectl exec -it $(kubectl get pod -l app=inference -o name | head -1) -- nvidia-smi

Common Incidents

Inference Service Down

Check pod status and events
If OOMKilled: reduce batch size or upgrade GPU
If ImagePullBackOff: check registry credentials
If Pending: check GPU quota and availability

GPU Node Failure

Pods will be rescheduled automatically
If no capacity: scale down non-critical workloads
Contact CoreWeave support for extended outages

Model Loading Failure

Check HuggingFace token secret exists
Verify model name spelling
Check PVC has sufficient storage
Review container logs for download errors

Rollback

kubectl rollout undo deployment/inference

Resources

Next Steps

For data handling, see coreweave-data-handling.

Info

Category Engineering

Name coreweave-incident-runbook

Version v20260423

Size 1.74KB

Source jeremylongshore/claude-code-plugins-plus-skills

Updated At 2026-04-28