GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.
7.6
Rating
0
Installs
AI & LLM
Category
Excellent skill documentation for GPU-accelerated LLM data curation. The description is comprehensive and actionable for CLI agents, clearly specifying use cases (web scrapes, deduplication, multi-modal datasets) and when to use alternatives. Task knowledge is outstanding with complete code examples for all major operations (quality filtering, deduplication variants, PII redaction, multi-modal processing), performance benchmarks, and cost analysis. Structure is well-organized with clear sections, quick start, common patterns, and references to external guides. The skill addresses a genuinely complex domain (GPU-accelerated data curation at TB scale) that would require significant token usage and expertise for a CLI agent to replicate, making it highly novel. The 16× speedup claims, concrete benchmarks, and production use cases (Nemotron-4) establish strong value. Minor improvement possible in creating a more explicit function/module index given the breadth of capabilities, but the current logical flow is still very clear.
Loading SKILL.md…