Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
8.1
Rating
0
Installs
AI & LLM
Category
Excellent skill for CPU/non-NVIDIA LLM inference. The description clearly states when to use llama.cpp vs alternatives (TensorRT-LLM, vLLM), making it easy for a CLI agent to decide. Task knowledge is comprehensive with concrete installation, quantization, and hardware acceleration commands. Structure is good with a well-organized main file and logical reference separation. Novelty is solid: while CLI agents could eventually figure out llama.cpp setup, this skill consolidates platform-specific commands (Metal/CUDA/ROCm), quantization trade-offs, and performance benchmarks that would require significant exploration and token usage to discover independently. Minor deduction on novelty because basic llama.cpp usage is relatively straightforward once discovered.
Loading SKILL.md…

Skill Author