Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.
7.6
Rating
0
Installs
AI & LLM
Category
Excellent skill for GPTQ quantization with comprehensive coverage. The description clearly articulates when to use GPTQ vs alternatives (AWQ, bitsandbytes) with specific criteria. SKILL.md provides extensive task knowledge including installation, quantization configs, kernel backends, multi-GPU deployment, and QLoRA integration with complete working code examples. Structure is well-organized with clear sections and references to separate files for calibration, integration, and troubleshooting. The skill is highly novel - implementing 4-bit quantization manually would require deep expertise in Hessian-based optimization, group-wise quantization math, and CUDA kernel integration, easily consuming thousands of tokens for a CLI agent to discover and implement correctly. Minor improvement possible in making trade-off decision trees even more explicit, but overall this is a production-ready, high-value skill that enables deployment of 70B+ models on consumer hardware.
Loading SKILL.md…