Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (>512 tokens), encountering GPU memory issues with attention, or need faster inference. Supports PyTorch native SDPA, flash-attn library, H100 FP8, and sliding window attention.
7.6
Rating
0
Installs
AI & LLM
Category
Excellent skill for optimizing transformer attention with Flash Attention. The description clearly articulates when and why to use this skill (long sequences, memory constraints, speedup needs). Task knowledge is comprehensive with three complete workflows covering PyTorch native integration, flash-attn library usage, and H100 FP8 optimization - each with code, benchmarking, and verification steps. Structure is clean with a quick start, clear workflow checklists, troubleshooting, and references to external files for advanced topics. Novelty is strong: implementing Flash Attention optimization manually would require deep understanding of GPU memory hierarchy, CUDA kernels, and attention mechanics - this skill packages it into actionable workflows. Minor room for improvement: could explicitly mention token/cost savings quantification and add a decision tree for choosing between PyTorch native vs flash-attn library.
Loading SKILL.md…