Complete checklist for running production GPU workloads on Vast.ai, covering account setup, instance selection, data safety, monitoring, and cost controls.
>= 0.98 for production jobsinet_down >= 200 for data transferdph_total set in search queries#!/bin/bash
set -euo pipefail
echo "Vast.ai Production Readiness Check"
# 1. Auth
vastai show user --raw | python3 -c "
import sys, json; u=json.load(sys.stdin)
balance = u.get('balance', 0)
print(f' Auth: OK | Balance: \${balance:.2f}')
assert balance >= 10, f'Balance too low: \${balance:.2f}'
" && echo " Balance: PASS" || echo " Balance: FAIL"
# 2. Offer availability
COUNT=$(vastai search offers 'reliability>0.98 num_gpus=1 rentable=true' --raw --limit 1 | python3 -c "import sys,json; print(len(json.load(sys.stdin)))")
echo " Offers available: $COUNT+ | PASS"
# 3. Docker image pullable
docker pull pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime > /dev/null 2>&1 && echo " Docker image: PASS" || echo " Docker image: FAIL"
echo "Pre-flight checks complete."
| Error | Cause | Solution |
|---|---|---|
| Insufficient balance | Credits depleted mid-job | Set up auto-top-up or balance alerts |
| Instance preempted during final epoch | Spot instance reclaimed | Use on-demand for final training stage |
| Checkpoint corrupted | Interrupted mid-save | Implement atomic checkpoint writes (save to temp, rename) |
| GPU utilization drops to 0% | Data pipeline bottleneck | Profile data loading; increase disk I/O |
For version upgrades, see vastai-upgrade-migration.
Pre-launch audit: Run the verification script, check all boxes, confirm Docker image pulls successfully, and verify at least 3 matching offers are available before starting a production training run.
Budget-safe launch: Set max_dph=2.00, auto-destroy timeout of 12 hours, and daily spend alert at $50 to prevent cost overruns.