TacoSkill LAB
TacoSkill LAB
HomeSkillHubCreatePlaygroundSkillKit
© 2026 TacoSkill LAB
AboutPrivacyTerms
  1. Home
  2. /
  3. SkillHub
  4. /
  5. nemo-evaluator-sdk
Improve

nemo-evaluator-sdk

7.6

by zechenzhangAGI

94Favorites
170Upvotes
0Downvotes

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

evaluation

7.6

Rating

0

Installs

AI & LLM

Category

Quick Review

Excellent skill with comprehensive workflows for enterprise LLM evaluation. The description clearly covers multi-backend execution across 100+ benchmarks, and a CLI agent can confidently invoke this skill for evaluation tasks. Task knowledge is outstanding with 4 detailed workflows covering standard benchmarks, HPC deployment, model comparison, and safety/VLM evaluation—complete with config examples, CLI commands, and Python API usage. Structure is very clean with a concise main document and references for advanced topics. Novelty is high: orchestrating containerized evaluation across multiple backends (Docker/Slurm/cloud) with 18+ harnesses would require significant tokens and expertise for a CLI agent to accomplish independently. Minor improvement possible: could slightly expand the description to mention safety/VLM capabilities explicitly for better discoverability.

LLM Signals

Description coverage9
Task knowledge10
Structure9
Novelty9

GitHub Signals

891
74
19
2
Last commit 0 days ago

Publisher

zechenzhangAGI

zechenzhangAGI

Skill Author

Related Skills

rag-architectprompt-engineerfine-tuning-expert

Loading SKILL.md…

Try onlineView on GitHub

Publisher

zechenzhangAGI avatar
zechenzhangAGI

Skill Author

Related Skills

rag-architect

Jeffallan

7.0

prompt-engineer

Jeffallan

7.0

fine-tuning-expert

Jeffallan

6.4

mcp-developer

Jeffallan

6.4
Try online