Back to feed

huggingface/nanowhale

huggingface/nanowhale
306
+57/day
29
Python

From the README

nanowhale 🐳

A ~110M parameter language model trained from scratch using the DeepSeek-V4 architecture. This repo contains all the code, configs, and tokenizer used to pretrain and fine-tune the model.

Models

| Model | Description | Link | |---|---|---| | nanowhale-100m-base | Pretrained base model (5K steps on FineWeb-Edu) | šŸ¤— Hub | | nanowhale-100m | SFT chat model (3K steps on SmolTalk) | šŸ¤— Hub |

Architecture

The model implements the full DeepSeek-V4 feature set at miniature scale:

  • Multi-Head Latent Attention (MLA) — 8 heads, 1 KV head (MQA), head_dim=96 (32 RoPE + 64 NoPE), q_lora_rank=160
  • Mixture-of-Experts (MoE) — 4 routed + 1 shared expert, top-2 routing, SwiGLU FFN (dim 640)
  • Hyper-Connections — hc_mult=4, Sinkhorn routing (2 iterations)
  • Multi-Token Prediction (MTP) — 1 next-token prediction layer

| Parameter | Value | |---|---| | Total params | ~110M (41M embeddings + 69M non-embedding) | | Hidden size | 320 | | Layers | 8 | | Vocab size | 129,280 (DeepSeek-V4 tokenizer) | | Context length | 2,048 tokens |

Repo Structure

ā”œā”€ā”€ modeling_deepseek_v4.py         # DeepSeek-V4 model implementation
ā”œā”€ā”€ configuration_deepseek_v4.py    # Model config class
ā”œā”€ā”€ requirements.txt
ā”œā”€ā”€ configs/
│   ā”œā”€ā”€ main_100m.yaml              # Training hyperparameters (100M model)
│   ā”œā”€ā”€ debug.yaml                  # Quick debug config (50 steps)
│   └── fallback_under_1b.yaml      # Alternative config
ā”œā”€ā”€ scripts/
│   ā”œā”€ā”€ train_pretrain.py            # Pretraining (SFTTrainer on FineWeb-Edu)
│   ā”œā”€ā”€ train_sft.py                 # SFT fine-tuning (SFTTrainer on SmolTalk)
│   ā”œā”€ā”€ eval_smoke.py                # Perplexity evaluation & generation
│   ā”œā”€ā”€ chat.py                      # Interactive chat
│   ā”œā”€ā”€ upload_to_hub.py             # Hub upload utility
│   ā”œā”€ā”€ count_params.py              # Parameter counting
│   ā”œā”€ā”€ prepare_data.py              # Data preparation
│   └── inspect_deepseek_v4.py       # Architecture inspection
└── tokenizer/
    ā”œā”€ā”€ tokenizer.json
    └── tokenizer_config.json

Quick Start

Install

pip install -r requirements.txt

Pretraining

python scripts/train_pretrain.py --config configs/main_100m.yaml

SFT

python scripts/train_sft.py

Chat

python scripts/chat.py

Evaluation

python scripts/eval_smoke.py

Training Results

Pretraining (5,000 steps on FineWeb-Edu)

| Metric | Value | |---|---| | Tokens seen | ~2.6B | | Final loss | ~5.3 | | Token accuracy | 33.8% | | Hardware | 1Ɨ H100 80GB, bf16 | | Throughput | 72ms/step (with torch.compile) |

SFT (3,000 steps on SmolTalk)

| Metric | Start | End | |---|---|---| | Train loss | 15.41 | 10.22 | | Eval loss | 2.873 | 2.607 | | Token accuracy | 36.2% | 48.5% |

Perplexity (held-out English text)

| Model | Perplexity | |---|---| | Pretrained | 13.62 | | SFT | 12.90 |

Known Issues

  • bf16 NaN: The model produces