moe-training
Orchestra-Research/AI-Research-SKILLs
This skill provides comprehensive guidance on training Mixture of Experts (MoE) models, including architectures like Mixtral and DeepSeek-V3. It details techniques for achieving large-scale model capacity with significantly reduced computational costs (e.g., 5x reduction vs. dense models). Topics covered include top-k routing, load balancing, expert parallelism, and integration with frameworks like DeepSpeed and HuggingFace for efficient, resource-constrained training.