Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
8.1
Rating
0
Installs
AI & LLM
Category
Excellent skill for TensorRT-LLM optimization with clear description, comprehensive task knowledge including installation, inference patterns, and serving examples. Well-structured with concise main file and referenced guides for advanced topics. Description accurately captures when to use this vs alternatives (vLLM, llama.cpp). Covers key features (quantization, multi-GPU, batching) with working code examples. Novelty score is moderate because while TensorRT-LLM setup requires expertise, many deployment scenarios can be handled by simpler tools or direct CLI usage; the skill adds value primarily for complex multi-GPU and quantization workflows that would otherwise require extensive documentation review.
Loading SKILL.md…

Skill Author