Skills Engineering Deploying GPU Inference on CoreWeave

Deploying GPU Inference on CoreWeave

v20260423
coreweave-deploy-integration
This skill provides comprehensive guidance for deploying and managing GPU-accelerated AI inference services on CoreWeave Kubernetes (CKS). It covers best practices including containerization using NVIDIA CUDA base images, configuring specific GPU resource limits (A100/H100), setting up robust health checks, and executing controlled rolling updates. Ideal for managing multi-model inference and scaling demanding AI workloads in a cloud environment.
Get Skill
146 downloads
Overview

CoreWeave Deploy Integration

Overview

Deploy GPU-accelerated inference services on CoreWeave Kubernetes (CKS). This skill covers containerizing inference workloads with NVIDIA CUDA base images, configuring GPU resource limits and node affinity for A100/H100 scheduling, setting up health checks that validate GPU availability and model loading, and executing rolling updates that respect GPU node draining. CoreWeave's scheduler requires explicit GPU resource requests to place pods on the correct hardware tier.

Docker Configuration

FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04 AS base
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 python3-pip curl && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt ./
RUN pip3 install --no-cache-dir -r requirements.txt

FROM base
RUN groupadd -r app && useradd -r -g app app
COPY --chown=app:app src/ ./src/
COPY --chown=app:app models/ ./models/
USER app
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1
CMD ["python3", "src/server.py"]

Environment Variables

export COREWEAVE_API_KEY="cw_xxxxxxxxxxxx"
export COREWEAVE_NAMESPACE="tenant-my-org"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export GPU_TYPE="A100_PCIE_80GB"
export GPU_COUNT="1"
export LOG_LEVEL="info"
export PORT="8080"

Health Check Endpoint

import express from 'express';
import { execSync } from 'child_process';

const app = express();

app.get('/health', async (req, res) => {
  try {
    const gpuInfo = execSync('nvidia-smi --query-gpu=name,memory.used --format=csv,noheader').toString().trim();
    const modelLoaded = globalThis.modelReady === true;
    if (!modelLoaded) throw new Error('Model not loaded');
    res.json({ status: 'healthy', gpu: gpuInfo, model: process.env.MODEL_NAME, timestamp: new Date().toISOString() });
  } catch (error) {
    res.status(503).json({ status: 'unhealthy', error: (error as Error).message });
  }
});

Deployment Steps

Step 1: Build

docker build -t registry.coreweave.com/my-org/inference-svc:latest .
docker push registry.coreweave.com/my-org/inference-svc:latest

Step 2: Run

# k8s/deployment.yaml
resources:
  limits:
    nvidia.com/gpu: 1
    cpu: "4"
    memory: "48Gi"
nodeSelector:
  gpu.nvidia.com/class: A100_PCIE_80GB
kubectl apply -f k8s/deployment.yaml -n tenant-my-org

Step 3: Verify

kubectl get pods -n tenant-my-org -l app=inference-svc
curl -s http://inference-svc.tenant-my-org.svc.cluster.local:8080/health | jq .

Step 4: Rolling Update

kubectl set image deployment/inference-svc \
  inference=registry.coreweave.com/my-org/inference-svc:v2 \
  -n tenant-my-org
kubectl rollout status deployment/inference-svc -n tenant-my-org --timeout=600s

Error Handling

Issue Cause Fix
Pending pod stuck No GPU nodes available for requested type Check kubectl describe node for allocatable GPUs or switch GPU tier
OOMKilled Model exceeds GPU memory Reduce model size, enable quantization, or request larger GPU
nvidia-smi not found Missing NVIDIA device plugin Verify CoreWeave namespace has GPU operator installed
401 Unauthorized Invalid API key or expired credentials Regenerate key in CoreWeave dashboard
Slow rolling update GPU nodes take time to drain Set terminationGracePeriodSeconds: 300 in deployment spec

Resources

Next Steps

See coreweave-webhooks-events.

Info
Category Engineering
Name coreweave-deploy-integration
Version v20260423
Size 4.28KB
Updated At 2026-04-28
Language