Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.
8.7
Rating
0
Installs
AI & LLM
Category
Exceptional skill for GPTQ quantization with comprehensive, production-ready guidance. The description clearly specifies use cases (deploying 70B/405B models on consumer GPUs, 4× memory reduction, 3-4× speedup), enabling precise invocation. Task knowledge is outstanding with complete code examples for loading pre-quantized models, custom quantization, QLoRA fine-tuning, multi-GPU deployment, and batch inference. Structure is excellent: concise SKILL.md with clear overview and decision trees, with detailed topics properly delegated to reference files. The skill provides significant value by consolidating complex quantization workflows, kernel selection (ExLlamaV2, Marlin, Triton), configuration trade-offs, and integration patterns that would otherwise require extensive documentation searches and experimentation. Benchmark data and pre-quantized model sources add immediate practical utility. Minor improvement possible: could explicitly mention compute requirements for quantization itself.
Loading SKILL.md…

Skill Author