技能 人工智能 Vast.ai上的机器学习部署指南

Vast.ai上的机器学习部署指南

v20260423
vastai-deploy-integration
本指南提供了一套完整的机器学习工作流部署解决方案,专门用于Vast.ai GPU云平台。内容涵盖了从优化Docker镜像以提高效率,到使用自动化Python脚本进行资源搜索和实例创建,再到数据传输和部署后健康检查的全流程指导,确保ML服务能够稳定、高效地运行。
获取技能
169 次下载
概览

Vast.ai Deploy Integration

Overview

Deploy ML training jobs and inference services on Vast.ai GPU cloud. Covers Docker image optimization, automated provisioning scripts, data transfer strategies, and deployment automation.

Prerequisites

  • Vast.ai CLI authenticated
  • Docker image published to a registry
  • Training/inference code tested locally

Instructions

Step 1: Optimized Docker Image

# Dockerfile.vastai — optimized for fast pulls on Vast.ai
FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime

# Install dependencies in a single layer
COPY requirements.txt /tmp/
RUN pip install --no-cache-dir -r /tmp/requirements.txt && rm /tmp/requirements.txt

# Copy application code
COPY src/ /workspace/src/
COPY scripts/ /workspace/scripts/

WORKDIR /workspace
CMD ["python", "src/train.py"]
# Build and push
docker build -t ghcr.io/yourorg/training:v1 -f Dockerfile.vastai .
docker push ghcr.io/yourorg/training:v1

Step 2: Automated Deployment Script

#!/usr/bin/env python3
"""deploy.py — Automated Vast.ai deployment with monitoring."""
import subprocess, json, time, argparse, sys

def deploy(args):
    # Search for matching offer
    query = (f"num_gpus={args.gpus} gpu_name={args.gpu} "
             f"reliability>{args.reliability} dph_total<={args.max_price} "
             f"disk_space>={args.disk} rentable=true")

    offers = json.loads(subprocess.run(
        ["vastai", "search", "offers", query, "--order", "dph_total",
         "--raw", "--limit", "5"],
        capture_output=True, text=True, check=True).stdout)

    if not offers:
        print(f"ERROR: No offers matching: {query}", file=sys.stderr)
        sys.exit(1)

    offer = offers[0]
    print(f"Selected: {offer['gpu_name']} ${offer['dph_total']:.3f}/hr "
          f"(ID: {offer['id']})")

    # Create instance
    cmd = ["vastai", "create", "instance", str(offer["id"]),
           "--image", args.image, "--disk", str(args.disk)]
    if args.onstart:
        cmd.extend(["--onstart-cmd", args.onstart])

    result = json.loads(subprocess.run(
        cmd, capture_output=True, text=True, check=True).stdout)
    instance_id = result["new_contract"]
    print(f"Instance {instance_id} provisioning...")

    # Wait for running
    for _ in range(30):
        info = json.loads(subprocess.run(
            ["vastai", "show", "instance", str(instance_id), "--raw"],
            capture_output=True, text=True).stdout)
        if info.get("actual_status") == "running":
            print(f"READY: ssh -p {info['ssh_port']} root@{info['ssh_host']}")
            return instance_id, info
        time.sleep(10)

    raise TimeoutError("Instance did not start")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--gpu", default="RTX_4090")
    parser.add_argument("--gpus", type=int, default=1)
    parser.add_argument("--image", required=True)
    parser.add_argument("--disk", type=int, default=50)
    parser.add_argument("--max-price", type=float, default=0.50)
    parser.add_argument("--reliability", type=float, default=0.95)
    parser.add_argument("--onstart", default="")
    deploy(parser.parse_args())

Step 3: Data Transfer Strategies

# Small datasets (<5GB): SCP directly
scp -P $PORT ./data.tar.gz root@$HOST:/workspace/

# Large datasets (>5GB): Use rsync with compression
rsync -avz --progress -e "ssh -p $PORT" ./data/ root@$HOST:/workspace/data/

# Very large datasets: Pre-stage on cloud storage
ssh -p $PORT root@$HOST "wget -q https://storage.example.com/dataset.tar.gz -O /workspace/data.tar.gz"

Step 4: Health Check After Deploy

ssh -p $PORT -o StrictHostKeyChecking=no root@$HOST << 'CHECK'
echo "=== Deploy Health Check ==="
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
df -h /workspace | tail -1
echo "=== Ready ==="
CHECK

Output

  • Optimized Docker image for fast Vast.ai pulls
  • Automated deployment script with GPU/price selection
  • Data transfer patterns (SCP, rsync, cloud storage)
  • Post-deploy health check verification

Error Handling

Error Cause Solution
Docker pull timeout Image too large (>10GB) Use multi-stage builds; minimize image layers
Disk space exhausted Insufficient disk allocation Increase --disk parameter
SSH timeout after deploy Instance still loading image Wait longer or use smaller base image
CUDA version mismatch Image CUDA > host CUDA Filter offers by cuda_max_good

Resources

Next Steps

For event-driven workflows, see vastai-webhooks-events.

Examples

One-command deploy: python deploy.py --gpu A100 --image ghcr.io/org/train:v1 --max-price 2.00 --disk 100

Multi-GPU deploy: Set --gpus 4 and --gpu H100_SXM for distributed training with torchrun.

信息
Category 人工智能
Name vastai-deploy-integration
版本 v20260423
大小 5.58KB
更新时间 2026-04-28
语言