evaluating-llms-harness

7.6

111

335

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

benchmarking

7.6

Rating

Installs

AI & LLM

Quick Review

Excellent skill documentation for LLM evaluation using lm-evaluation-harness. The description clearly articulates when to use this skill (benchmarking, model comparison, tracking training progress) with specific benchmark names that help CLI agents understand the context. The task knowledge is comprehensive, providing complete command-line examples for all major workflows including standard evaluation, training progress tracking, model comparison, and vLLM optimization. The structure is well-organized with numbered workflows, clear checklists, and proper separation of concerns (advanced topics referenced in separate files). The skill demonstrates high novelty as running 60+ academic benchmarks with proper prompts and metrics would require significant token usage and domain expertise from a CLI agent alone. Minor room for improvement: the description could slightly more explicitly state the output format (JSON results) to make invocation expectations clearer, and a few more edge cases in troubleshooting could enhance completeness. Overall, this is a production-ready skill that meaningfully reduces the complexity and cost of LLM evaluation tasks.