kitft/natural_language_autoencoders
kitft/natural_language_autoencodersFrom the README
Natural Language Autoencoders (NLA)
Open-source library accompanying the Anthropic Transformer Circuits post Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations.
๐ Blog post ยท โถ Video walkthrough ยท ๐ฌ Try the released NLAs on Neuronpedia
A Natural Language Autoencoder is a pair of fine-tuned LMs that map residual-stream activation vectors to natural language and back:
| | direction | mechanism |
|---|---|---|
| AV (activation verbalizer) | vector โ text | inject the vector as a single token embedding into a fixed prompt, autoregress a description |
| AR (activation reconstructor) | text โ vector | truncated K+1-layer LM + Linear(d, d) head, extract at the final token |
Both vectors are L2-normalised before comparison, so the round-trip
MSE(reconstructed, original) = 2(1 โ cos) measures direction agreement only.
Low MSE means the AR could recover the original direction from the AV's words
alone, which implies the explanation captures the information in the vector.
This is the full training repo โ data generation, SFT, GRPO RL, and
checkpoint conversion. For a lightweight inference-only package (just
NLAClient + NLACritic, no training deps), see
kitft/nla-inference.
A note on naming. Public-facing names are AV / AR. Inside the
nla/package you will see actor / critic โ those are the same two models, named to map directly onto Miles' RL primitives (the AV is the policy actor; the AR is the value critic). The codebase keeps actor/critic so the Miles extension points read naturally; everywhere user-facing we use AV/AR.
Released checkpoints
All eight checkpoints are gathered in the
kitft/nla-models collection
on the HF Hub โ four base-model families, each with an AV and an AR. We extract
from a layer roughly two-thirds of the way through the model in each case
โ deep enough that the residual stream carries rich semantic content, shallow
enough that it hasn't yet collapsed toward the unembedding.
| base model | layer | d_model | AV | AR |
|---|---|---|---|---|
| Qwen2.5-7B-Instruct | 20 / 28 | 3584 | kitft/nla-qwen2.5-7b-L20-av | kitft/nla-qwen2.5-7b-L20-ar |
| Gemma-3-12B-IT | 32 / 48 | 3840 | kitft/nla-gemma3-12b-L32-av | kitft/nla-gemma3-12b-L32-ar |
| Gemma-3-27B-IT | 41 / 62 | 5376 | kitft/nla-gemma3-27b-L41-av | kitft/nla-gemma3-27b-L41-ar |
| Llama-3.3-70B-Instruct | 53 / 80 | 8192 | kitft/Llama-3.3-70B-NLA-L53-av | kitft/Llama-3.3-70B-NLA-L53-ar |
Each checkpoint ships an nla_meta.yaml sidecar with the prompt template,
injection token IDs, and scale factors that the model was trained with โ load
those, never hardcode them.
How it fits together
NLA training is built as a thin extension on top of two open-source projects:
- Miles โ Ray-orchestrated RL training (FSDP2 / Megatron backends, GRPO, async rollout). We used the FSDP backend