sparse-autoencoder-training

7.6

106

293

Provides guidance for training and analyzing Sparse Autoencoders (SAEs) using SAELens to decompose neural network activations into interpretable features. Use when discovering interpretable features, analyzing superposition, or studying monosemantic representations in language models.

autoencoder

7.6

Rating

Installs

Machine Learning

Quick Review

Excellent skill for sparse autoencoder training and analysis. The description clearly identifies when to use SAELens (feature discovery, superposition analysis, interpretability), enabling proper invocation. Task knowledge is comprehensive with three complete workflows (loading pre-trained SAEs, training custom SAEs, feature analysis/steering), detailed code examples, hyperparameter guidance, evaluation metrics, and troubleshooting. Structure is logical with clear workflow separation, summary tables, and references to supplementary files for API details. The skill addresses a genuinely novel/complex task—training and analyzing sparse autoencoders for mechanistic interpretability—that would require extensive tokens and domain expertise for a CLI agent to accomplish independently. The skill meaningfully reduces cost by packaging specialized knowledge about Anthropic's monosemanticity research, SAELens library nuances, hyperparameter tuning, and integration patterns. Minor opportunity: could slightly expand the description to mention specific use cases like 'safety analysis' or 'model steering' for even clearer invocation guidance.