技能 硬件工程 Vast.ai GPU监控与成本追踪

Vast.ai GPU监控与成本追踪

v20260423
vastai-observability
本技能提供Vast.ai GPU实例的全面监控方案,可采集GPU利用率、实例运行状态、温度和成本累积等关键指标。适用于搭建监控仪表板、配置告警机制(如GPU空闲、过热、预算超支)以及自动化云资源使用跟踪,确保资源健康和成本可控。
获取技能
241 次下载
概览

Vast.ai Observability

Overview

Monitor Vast.ai GPU instance health, utilization, and costs. Key metrics: GPU utilization (idle GPUs waste $0.20-$4.00/hr), instance uptime, training progress, cost accumulation, and spot preemption events.

Prerequisites

  • Vast.ai account with active instances
  • vastai CLI installed
  • Optional: Prometheus, Grafana, or Datadog for dashboarding

Instructions

Step 1: Instance Metrics Collector

import subprocess, json, time
from datetime import datetime

class VastMetricsCollector:
    def __init__(self, output_file="vast_metrics.jsonl"):
        self.output_file = output_file

    def collect(self):
        result = subprocess.run(
            ["vastai", "show", "instances", "--raw"],
            capture_output=True, text=True)
        instances = json.loads(result.stdout)

        metrics = {
            "timestamp": datetime.utcnow().isoformat(),
            "total_instances": len(instances),
            "running": 0, "total_hourly_cost": 0,
            "instances": [],
        }

        for inst in instances:
            status = inst.get("actual_status", "unknown")
            dph = inst.get("dph_total", 0)
            if status == "running":
                metrics["running"] += 1
                metrics["total_hourly_cost"] += dph

            metrics["instances"].append({
                "id": inst["id"],
                "gpu": inst.get("gpu_name"),
                "status": status,
                "dph": dph,
                "gpu_util": inst.get("gpu_util", 0),
                "gpu_temp": inst.get("gpu_temp", 0),
            })

        with open(self.output_file, "a") as f:
            f.write(json.dumps(metrics) + "\n")

        return metrics

    def run(self, interval=60):
        while True:
            m = self.collect()
            print(f"[{m['timestamp']}] Running: {m['running']} | "
                  f"Cost: ${m['total_hourly_cost']:.3f}/hr")
            time.sleep(interval)

Step 2: Alert Conditions

def check_alerts(metrics):
    alerts = []

    # Idle GPU alert (running but <10% utilization)
    for inst in metrics["instances"]:
        if inst["status"] == "running" and inst["gpu_util"] < 10:
            alerts.append(f"IDLE: Instance {inst['id']} GPU util={inst['gpu_util']}% "
                         f"(wasting ${inst['dph']:.3f}/hr)")

    # High temperature alert
    for inst in metrics["instances"]:
        if inst.get("gpu_temp", 0) > 85:
            alerts.append(f"HOT: Instance {inst['id']} GPU temp={inst['gpu_temp']}C")

    # Budget alert
    daily_projection = metrics["total_hourly_cost"] * 24
    if daily_projection > 100:
        alerts.append(f"BUDGET: Projected daily cost ${daily_projection:.2f}")

    return alerts

Step 3: Remote GPU Monitoring

# SSH into instance and collect nvidia-smi metrics
ssh -p $PORT root@$HOST "nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu,power.draw --format=csv,noheader,nounits"
# Output: 95, 20480, 24576, 72, 285

Step 4: Prometheus Exporter (Optional)

from prometheus_client import Gauge, start_http_server

gpu_util = Gauge("vastai_gpu_utilization", "GPU utilization %", ["instance_id", "gpu_name"])
hourly_cost = Gauge("vastai_hourly_cost", "Total hourly cost USD")
instance_count = Gauge("vastai_instance_count", "Running instances")

def export_metrics(metrics):
    instance_count.set(metrics["running"])
    hourly_cost.set(metrics["total_hourly_cost"])
    for inst in metrics["instances"]:
        if inst["status"] == "running":
            gpu_util.labels(inst["id"], inst["gpu"]).set(inst["gpu_util"])

start_http_server(9090)  # Prometheus scrape target

Output

  • Metrics collector with JSONL output
  • Alert conditions (idle GPU, high temp, budget)
  • Remote GPU monitoring via SSH + nvidia-smi
  • Optional Prometheus exporter for Grafana dashboards

Error Handling

Alert Threshold Response
Idle GPU util < 10% for > 10 min Investigate or destroy instance
High temp > 85C sustained Reduce workload or report to host
Budget exceeded Projected daily > $100 Destroy non-critical instances
Instance offline Status changed from running Trigger auto-recovery

Resources

Next Steps

For incident response procedures, see vastai-incident-runbook.

Examples

Quick dashboard: Run VastMetricsCollector().run(interval=30) in tmux on a monitoring server. Pipe alerts to Slack via webhook.

Cost tracking: Parse vast_metrics.jsonl to plot hourly cost over time and identify spending patterns.

信息
Category 硬件工程
Name vastai-observability
版本 v20260423
大小 5.27KB
更新时间 2026-04-28
语言