distributed-llm-pretraining-torchtitan
Orchestra-Research/AI-Research-SKILLs
Orchestrates PyTorch-native TorchTitan 4D parallelism (FSDP2, TP, PP, CP) for large-scale LLM pretraining, scaling from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing to boot Llama 3.1, DeepSeek V3, or custom models.