remote-gpu-trainer
sickn33/antigravity-awesome-skills
This skill manages the entire lifecycle of long-running deep learning training and inference jobs deployed on remote, ephemeral computing environments (e.g., AutoDL, RunPod, bare SSH, Slurm, K8s). It provides critical capabilities like resilient checkpointing, managing spot instance preemption, debugging resource failures (OOM, NaN), and ensuring safe billing teardown, making it essential for professional ML Ops on rented resources.