Skills Development CoreWeave GPU Performance Tuning Guide

CoreWeave GPU Performance Tuning Guide

v20260423
coreweave-performance-tuning
This guide provides expert strategies for optimizing GPU inference performance on CoreWeave infrastructure. It covers GPU selection based on workload (LLM, image generation, training), advanced techniques like continuous batching (using vLLM), autoscaling setup (HPA), and benchmarking data. Use this to maximize GPU utilization, minimize latency, and optimize large-scale AI model deployment.
Get Skill
494 downloads
Overview

CoreWeave Performance Tuning

GPU Selection by Workload

Workload Recommended GPU Why
LLM inference (7-13B) A100 80GB Good balance of memory and cost
LLM inference (70B+) 8xH100 NVLink for tensor parallelism
Image generation L40 Good for diffusion models
Training (large models) 8xH100 SXM5 Fastest interconnect
Batch processing A100 40GB Cost-effective

Inference Optimization

# Continuous batching with vLLM
containers:
  - name: vllm
    args:
      - "--model=meta-llama/Llama-3.1-8B-Instruct"
      - "--max-num-batched-tokens=8192"
      - "--max-num-seqs=256"
      - "--gpu-memory-utilization=0.90"
      - "--enable-prefix-caching"
      - "--dtype=float16"

Autoscaling Tuning

# HPA based on GPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: DCGM_FI_DEV_GPU_UTIL
        target:
          type: AverageValue
          averageValue: "70"

Performance Benchmarks

Metric A100-80GB H100-80GB
Llama-8B tokens/sec ~2,000 ~4,500
Llama-70B tokens/sec ~200 (4x) ~500 (4x)
Cold start (vLLM) 30-60s 20-40s

Resources

Next Steps

For cost optimization, see coreweave-cost-tuning.

Info
Category Development
Name coreweave-performance-tuning
Version v20260423
Size 2.18KB
Updated At 2026-04-28
Language