技能 编程开发 CoreWeave生产部署检查清单

CoreWeave生产部署检查清单

v20260423
coreweave-prod-checklist
这份清单用于指导将GPU工作负载(如推理服务或模型训练)从开发环境迁移到CoreWeave生产环境的全部关键步骤。它系统地覆盖了自动伸缩、资源配置、安全策略、持久化存储、性能监控和故障回滚等MLOps及DevOps最佳实践。
获取技能
232 次下载
概览

CoreWeave Production Checklist

Inference Services

  • GPU type and count validated for model size
  • Autoscaling configured (KServe or HPA)
  • Health and readiness probes set
  • Resource requests AND limits specified
  • Node affinity targeting correct GPU class
  • minReplicas >= 1 for production (no cold starts)

Storage

  • Model weights in PVC (not downloaded at startup)
  • Checkpoints saved to persistent storage
  • Storage class appropriate (SSD for inference, HDD for archival)

Security

  • Secrets for model tokens and registry access
  • Network policies applied
  • Container images from trusted registries

Monitoring

  • GPU utilization metrics collected
  • Inference latency and throughput tracked
  • Alert on pod restarts and OOM events
  • Log aggregation configured

Rollback

kubectl rollout undo deployment/my-inference
kubectl rollout status deployment/my-inference

Resources

Next Steps

For upgrades, see coreweave-upgrade-migration.

信息
Category 编程开发
Name coreweave-prod-checklist
版本 v20260423
大小 1.63KB
更新时间 2026-04-28
语言