下载

CoreWeave GPU迁移与升级

v20260423

coreweave-upgrade-migration

该工具用于管理和自动化CoreWeave云基础设施的升级与迁移过程。它支持执行GPU型号升级（如A100到H100）、CUDA版本兼容性检查，并利用Kubernetes API审计部署、检测过时实例，确保集群平稳过渡。

Kubernetes 云原生 DevOps GPU 迁移核心云 CUDA 基础设施

获取技能

157 次下载

概览

CoreWeave Upgrade & Migration

Overview

CoreWeave is a GPU-specialized cloud provider running Kubernetes-native infrastructure. Migrations involve upgrading between GPU instance types (A100 to H100), updating CUDA driver versions, and handling Kubernetes API version changes across namespaces. Tracking API versions is critical because CoreWeave's instance type labels and resource quotas change between platform releases, and deploying to a deprecated instance class will cause scheduling failures.

Version Detection

import { KubeConfig, CoreV1Api } from "@kubernetes/client-node";

async function detectCoreWeaveVersion(): Promise<void> {
  const kc = new KubeConfig();
  kc.loadFromDefault();
  const k8sApi = kc.makeApiClient(CoreV1Api);

  // Check current namespace GPU allocations
  const pods = await k8sApi.listNamespacedPod("my-namespace");
  for (const pod of pods.body.items) {
    const gpuClass = pod.spec?.nodeSelector?.["gpu.nvidia.com/class"];
    const cudaVersion = pod.metadata?.labels?.["cuda-version"];
    console.log(`Pod ${pod.metadata?.name}: GPU=${gpuClass}, CUDA=${cudaVersion}`);
  }

  // Detect deprecated instance types
  const deprecated = ["A100_PCIE_40GB", "V100_PCIE_16GB", "RTX_A5000"];
  const activeGpus = pods.body.items
    .map((p) => p.spec?.nodeSelector?.["gpu.nvidia.com/class"])
    .filter(Boolean);
  const stale = activeGpus.filter((g) => deprecated.includes(g!));
  if (stale.length > 0) console.warn(`Deprecated GPU types in use: ${stale.join(", ")}`);
}

Migration Checklist

Review CoreWeave release notes for deprecated instance types
Audit all deployments for gpu.nvidia.com/class node selectors
Verify CUDA version compatibility with target GPU (see matrix below)
Update container base images to match new CUDA/cuDNN requirements
Test inference latency on new GPU type in staging namespace
Update resource requests (nvidia.com/gpu) for new instance memory
Migrate persistent volumes if switching regions or availability zones
Update Kubernetes API version in manifests (e.g., apps/v1 changes)
Validate HPA scaling behavior on new instance type throughput
Run canary deployment with traffic split before full cutover

Schema Migration

// CoreWeave instance type labels changed in 2025 platform update
// Old: gpu.nvidia.com/class: "A100_PCIE_80GB"
// New: gpu.nvidia.com/class: "H100_SXM5_80GB"

interface DeploymentMigration {
  oldSelector: Record<string, string>;
  newSelector: Record<string, string>;
  cudaMinVersion: string;
}

const GPU_MIGRATIONS: DeploymentMigration[] = [
  {
    oldSelector: { "gpu.nvidia.com/class": "A100_PCIE_80GB" },
    newSelector: { "gpu.nvidia.com/class": "H100_SXM5_80GB" },
    cudaMinVersion: "12.4",
  },
  {
    oldSelector: { "gpu.nvidia.com/class": "A100_SXM4_80GB" },
    newSelector: { "gpu.nvidia.com/class": "H100_SXM5_80GB" },
    cudaMinVersion: "12.4",
  },
];

function migrateNodeSelector(manifest: any, migration: DeploymentMigration): any {
  const selector = manifest.spec?.template?.spec?.nodeSelector;
  if (!selector) return manifest;
  for (const [key, oldVal] of Object.entries(migration.oldSelector)) {
    if (selector[key] === oldVal) {
      selector[key] = migration.newSelector[key];
    }
  }
  return manifest;
}

Rollback Strategy

import { AppsV1Api, KubeConfig } from "@kubernetes/client-node";

async function rollbackDeployment(namespace: string, name: string): Promise<void> {
  const kc = new KubeConfig();
  kc.loadFromDefault();
  const appsApi = kc.makeApiClient(AppsV1Api);

  // Kubernetes rollout undo — reverts to previous revision
  const deployment = await appsApi.readNamespacedDeployment(name, namespace);
  const currentRevision = deployment.body.metadata?.annotations?.["deployment.kubernetes.io/revision"];
  console.log(`Rolling back ${name} from revision ${currentRevision}`);

  // Patch to trigger rollback via revision annotation
  await appsApi.patchNamespacedDeployment(name, namespace, {
    spec: { template: { metadata: { annotations: { "kubectl.kubernetes.io/restartedAt": new Date().toISOString() } } } },
  }, undefined, undefined, undefined, undefined, undefined, { headers: { "Content-Type": "application/strategic-merge-patch+json" } });
  console.log(`Rollback initiated for ${name} in ${namespace}`);
}

Error Handling

Migration Issue	Symptom	Fix
GPU class not schedulable	Pod stuck in `Pending` with `Insufficient nvidia.com/gpu`	Verify instance type exists in target region; check quota
CUDA version mismatch	Container crashes with `CUDA driver version is insufficient`	Rebuild container with CUDA matching target GPU driver
Namespace quota exceeded	`Forbidden: exceeded quota` on deployment	Request quota increase for new instance type via CoreWeave dashboard
PVC migration failure	`VolumeAttachment` timeout on new node	Detach old PVC, recreate in target availability zone
API version deprecated	`no matches for kind "Deployment" in version "extensions/v1beta1"`	Update manifest to `apps/v1` and adjust spec fields

Resources

Next Steps

For CI/CD pipeline integration, see coreweave-ci-integration.

信息

Category 编程开发

Name coreweave-upgrade-migration

版本 v20260423

大小 5.99KB

Source jeremylongshore/claude-code-plugins-plus-skills

更新时间 2026-04-26

语言

简体中文

English