distributed-llm-pretraining-torchtitan
Orchestra-Research/AI-Research-SKILLs
TorchTitan delivers PyTorch-native distributed LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), Float8, torch.compile, and checkpointing so you can scale Llama 3.1, DeepSeek V3, or custom models across 8 to 512+ GPUs.