kitft/natural_language_autoencoders

392

+11/day

Python

From the README

Natural Language Autoencoders (NLA)

Open-source library accompanying the Anthropic Transformer Circuits post Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations.

📄 Blog post · ▶ Video walkthrough · 🔬 Try the released NLAs on Neuronpedia

A Natural Language Autoencoder is a pair of fine-tuned LMs that map residual-stream activation vectors to natural language and back:

| | direction | mechanism | |---|---|---| | AV (activation verbalizer) | vector → text | inject the vector as a single token embedding into a fixed prompt, autoregress a description | | AR (activation reconstructor) | text → vector | truncated K+1-layer LM + Linear(d, d) head, extract at the final token |

Both vectors are L2-normalised before comparison, so the round-trip MSE(reconstructed, original) = 2(1 − cos) measures direction agreement only. Low MSE means the AR could recover the original direction from the AV's words alone, which implies the explanation captures the information in the vector.

This is the full training repo — data generation, SFT, GRPO RL, and checkpoint conversion. For a lightweight inference-only package (just NLAClient + NLACritic, no training deps), see kitft/nla-inference.

A note on naming. Public-facing names are AV / AR. Inside the nla/ package you will see actor / critic — those are the same two models, named to map directly onto Miles' RL primitives (the AV is the policy actor; the AR is the value critic). The codebase keeps actor/critic so the Miles extension points read naturally; everywhere user-facing we use AV/AR.

Released checkpoints

All eight checkpoints are gathered in the kitft/nla-models collection on the HF Hub — four base-model families, each with an AV and an AR. We extract from a layer roughly two-thirds of the way through the model in each case — deep enough that the residual stream carries rich semantic content, shallow enough that it hasn't yet collapsed toward the unembedding.

| base model | layer | d_model | AV | AR | |---|---|---|---|---| | Qwen2.5-7B-Instruct | 20 / 28 | 3584 | kitft/nla-qwen2.5-7b-L20-av | kitft/nla-qwen2.5-7b-L20-ar | | Gemma-3-12B-IT | 32 / 48 | 3840 | kitft/nla-gemma3-12b-L32-av | kitft/nla-gemma3-12b-L32-ar | | Gemma-3-27B-IT | 41 / 62 | 5376 | kitft/nla-gemma3-27b-L41-av | kitft/nla-gemma3-27b-L41-ar | | Llama-3.3-70B-Instruct | 53 / 80 | 8192 | kitft/Llama-3.3-70B-NLA-L53-av | kitft/Llama-3.3-70B-NLA-L53-ar |

Each checkpoint ships an nla_meta.yaml sidecar with the prompt template, injection token IDs, and scale factors that the model was trained with — load those, never hardcode them.

How it fits together

NLA training is built as a thin extension on top of two open-source projects:

Miles — Ray-orchestrated RL training (FSDP2 / Megatron backends, GRPO, async rollout). We used the FSDP backend

View on GitHub