training-llms-megatron

7.6

261

Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100), or require tensor/pipeline/sequence/context/expert parallelism. Production-ready framework used for Nemotron, LLaMA, DeepSeek.

llm-training

7.6

Rating

Installs

AI & LLM

Quick Review

Exceptional skill for training large-scale LLMs using Megatron-Core. The description is comprehensive and clearly articulates when to use this skill (>1B parameters, need for GPU efficiency, specific parallelism strategies). Task knowledge is outstanding with three complete, actionable workflows covering standard training, MoE models, and performance optimization, plus extensive troubleshooting. The structure is clean with a well-organized SKILL.md providing practical examples and checklists, while deferring deep-dive details to reference files. Novelty is extremely high—training 70B-462B parameter models with 47% MFU and complex 4D parallelism (TP/PP/DP/CP/EP) would require extensive research and many tokens for a CLI agent to figure out independently. The skill provides production-ready configurations, hardware requirements, and real-world optimization strategies that significantly reduce implementation complexity and cost. Minor improvement possible: could add a decision tree for parallelism selection based on model size/GPU count.