CoreWeave故障排查手册

v20260423

coreweave-incident-runbook

这是一份用于CoreWeave平台的关键生产环境故障排除手册。它提供了应对GPU工作负载故障、推理服务宕机或Kubernetes资源问题的结构化步骤，指导用户检查Pod状态、节点健康和模型加载错误，以实现快速恢复服务。

CoreWeave 事件响应 GPU Kubernetes 推理服务故障排除

获取技能

430 次下载

概览

CoreWeave Incident Runbook

Triage Steps

# 1. Check pod status
kubectl get pods -l app=inference -o wide

# 2. Check recent events
kubectl get events --sort-by=.lastTimestamp | tail -20

# 3. Check node status
kubectl get nodes -l gpu.nvidia.com/class -o wide

# 4. Check GPU health
kubectl exec -it $(kubectl get pod -l app=inference -o name | head -1) -- nvidia-smi

Common Incidents

Inference Service Down

Check pod status and events
If OOMKilled: reduce batch size or upgrade GPU
If ImagePullBackOff: check registry credentials
If Pending: check GPU quota and availability

GPU Node Failure

Pods will be rescheduled automatically
If no capacity: scale down non-critical workloads
Contact CoreWeave support for extended outages

Model Loading Failure

Check HuggingFace token secret exists
Verify model name spelling
Check PVC has sufficient storage
Review container logs for download errors

Rollback

kubectl rollout undo deployment/inference

Resources

Next Steps

For data handling, see coreweave-data-handling.

信息

Category 硬件工程

Name coreweave-incident-runbook

版本 v20260423

大小 1.74KB

Source jeremylongshore/claude-code-plugins-plus-skills

更新时间 2026-04-28