distributed-llm-pretraining-torchtitan
Orchestra-Research/AI-Research-SKILLs
TorchTitan delivers PyTorch-native large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), enabling Float8, torch.compile, and distributed checkpointing to train models from 8B up to 405B+ across 8–512+ GPUs with TensorBoard monitoring and SLURM support.