Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with minimal accuracy loss. Use when GPU memory is limited, need to fit larger models, or want faster inference. Supports INT8, NF4, FP4 formats, QLoRA training, and 8-bit optimizers. Works with HuggingFace Transformers.
8.7
Rating
0
Installs
AI & LLM
Category
Excellent skill for LLM quantization with bitsandbytes. The description clearly conveys when and why to use this skill (GPU memory constraints, model size fitting). Task knowledge is comprehensive with three complete workflows covering inference quantization, QLoRA fine-tuning, and 8-bit optimizers, all with concrete code examples, memory calculations, and troubleshooting. Structure is very clean with a quick start, workflow checklists, comparison tables, and advanced topics properly delegated to reference files. Novelty is strong since quantization configuration involves many interdependent parameters (quant_type, compute_dtype, double_quant, device_map) that would require extensive trial-and-error for a CLI agent, and this skill consolidates best practices efficiently. Minor room for improvement: could add a decision tree for choosing between 4-bit/8-bit/alternatives, and more explicit examples of accuracy measurement post-quantization. Overall, this is a highly practical, well-documented skill that significantly reduces complexity and token cost for a common AI/LLM workflow.
Loading SKILL.md…

Skill Author