技能 硬件工程 CoreWeave GPU工作负载事件监控

CoreWeave GPU工作负载事件监控

v20260423
coreweave-webhooks-events
该工具通过Webhook机制,实时监控CoreWeave集群的GPU工作负载状态和生命周期事件。它追踪Pod就绪状态、任务完成、存储挂载和节点健康状况,是构建大规模GPU推理和训练任务的自动化扩展、告警及恢复流程的核心组件。
获取技能
416 次下载
概览

CoreWeave Webhooks & Events

Overview

CoreWeave emits Kubernetes-native events and custom status callbacks for GPU workload lifecycle management. Monitor instance readiness, job completion, volume attachment, and node health to build automated scaling, alerting, and recovery pipelines for GPU-accelerated inference and training workloads.

Webhook Registration

const response = await fetch("https://api.coreweave.com/v1/webhooks", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.COREWEAVE_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://yourapp.com/webhooks/coreweave",
    events: ["instance.ready", "job.completed", "volume.attached", "node.unhealthy"],
    secret: process.env.COREWEAVE_WEBHOOK_SECRET,
  }),
});

Signature Verification

import crypto from "crypto";
import { Request, Response, NextFunction } from "express";

function verifyCoreWeaveSignature(req: Request, res: Response, next: NextFunction) {
  const signature = req.headers["x-coreweave-signature"] as string;
  const timestamp = req.headers["x-coreweave-timestamp"] as string;
  const payload = `${timestamp}.${req.body.toString()}`;
  const expected = crypto.createHmac("sha256", process.env.COREWEAVE_WEBHOOK_SECRET!)
    .update(payload).digest("hex");
  if (!crypto.timingSafeEqual(Buffer.from(signature), Buffer.from(expected))) {
    return res.status(401).json({ error: "Invalid signature" });
  }
  next();
}

Event Handler

import express from "express";
const app = express();

app.post("/webhooks/coreweave", express.raw({ type: "application/json" }), verifyCoreWeaveSignature, (req, res) => {
  const event = JSON.parse(req.body.toString());
  res.status(200).json({ received: true });

  switch (event.type) {
    case "instance.ready":
      registerEndpoint(event.data.instance_id, event.data.gpu_type); break;
    case "job.completed":
      collectArtifacts(event.data.job_id, event.data.output_path); break;
    case "volume.attached":
      mountStorage(event.data.volume_id, event.data.node_name); break;
    case "node.unhealthy":
      drainAndReschedule(event.data.node_id, event.data.reason); break;
  }
});

Event Types

Event Payload Fields Use Case
instance.ready instance_id, gpu_type, ip_address Register inference endpoint
job.completed job_id, output_path, duration_seconds Collect training artifacts
volume.attached volume_id, node_name, mount_path Confirm storage availability
node.unhealthy node_id, reason, gpu_count Drain node and reschedule pods
instance.terminated instance_id, exit_code, gpu_type Clean up resources and alert

Retry & Idempotency

const processed = new Set<string>();

async function handleIdempotent(event: { id: string; type: string; data: any }) {
  if (processed.has(event.id)) return;
  await routeEvent(event);
  processed.add(event.id);
  if (processed.size > 10_000) {
    const entries = Array.from(processed);
    entries.slice(0, entries.length - 10_000).forEach((id) => processed.delete(id));
  }
}

Error Handling

Issue Cause Fix
Signature mismatch Clock skew between clusters Validate timestamp within 5-minute window
Duplicate instance.ready Rescheduled pod on same node Track instance IDs for deduplication
Stale node.unhealthy Transient GPU memory error Wait for consecutive events before draining
Missing output_path Job failed before writing Check exit_code before collecting artifacts

Resources

Next Steps

See coreweave-security-basics.

信息
Category 硬件工程
Name coreweave-webhooks-events
版本 v20260423
大小 4.32KB
更新时间 2026-04-28
语言