CoreWeave Production Deployment Checklist

v20260423

coreweave-prod-checklist

A comprehensive checklist for ensuring GPU workloads, such as inference services or model training pipelines, are fully prepared for production deployment on CoreWeave. It covers critical MLOps and DevOps aspects including autoscaling, resource limits, security policies, persistent storage validation, monitoring setup, and rollback procedures.

CoreWeave GPU Kubernetes MLOps DevOps Inference Production Cloud

Get Skill

232 downloads

Overview

CoreWeave Production Checklist

Inference Services

GPU type and count validated for model size
Autoscaling configured (KServe or HPA)
Health and readiness probes set
Resource requests AND limits specified
Node affinity targeting correct GPU class
minReplicas >= 1 for production (no cold starts)

Storage

Model weights in PVC (not downloaded at startup)
Checkpoints saved to persistent storage
Storage class appropriate (SSD for inference, HDD for archival)

Security

Secrets for model tokens and registry access
Network policies applied
Container images from trusted registries

Monitoring

GPU utilization metrics collected
Inference latency and throughput tracked
Alert on pod restarts and OOM events
Log aggregation configured

Rollback

kubectl rollout undo deployment/my-inference
kubectl rollout status deployment/my-inference

Resources

CoreWeave CKS

Next Steps

For upgrades, see coreweave-upgrade-migration.

Info

Category Development

Name coreweave-prod-checklist

Version v20260423

Size 1.63KB

Source jeremylongshore/claude-code-plugins-plus-skills

Updated At 2026-04-28