Back to feed

antirez/ds4

antirez/ds4
2.4k
+89/day
139
C

DeepSeek 4 Flash local inference engine for Metal

From the README

ds4.c

ds4.c is a small native inference engine for DeepSeek V4 Flash. It is intentionally narrow: not a generic GGUF runner, not a wrapper around another runtime, and not a framework. The main path is a DeepSeek V4 Flash-specific Metal graph executor with DS4-specific loading, prompt rendering, KV state, and server API glue.

This project would not exist without llama.cpp and GGML, make sure to read the acknowledgements section, a big thank you to Georgi Gerganov and all the other contributors.

Now, back at this project. Why we believe DeepSeek v4 Flash to be a pretty special model deserving a stand alone engine? Because after comparing it with powerful smaller dense models, we can report that:

  1. DeepSeek v4 Flash is faster because of less active parameters.
  2. In thinking mode, if you avoid max thinking, it produces a thinking section that is a lot shorter than other models, even 1/5 of other models in many cases, and crucially, the thinking section length is proportional to the problem complexity. This makes DeepSeek v4 Flash usable with thinking enabled when other models are practically impossible to use in the same conditions.
  3. The model features a context window of 1 million tokens.
  4. Being so large, it knows more things if you go sampling at the edge of knowledge. For instance asking about Italian show or political questions soon uncovers that 284B parameters are a lot more than 27B or 35B parameters.
  5. It writes much better English and Italian. It feels a quasi-frontier model.
  6. The KV cache is incredibly compress, allowing long context inference on local computers and on disk KV cache persistence.
  7. It works well with 2-bit quantization, if quantized in a special way (read later). This allows to run it in MacBooks with 128GB of RAM.
  8. We expect DeepSeek to release updated versions of v4 Flash in the future, even better than the current one.

That said, a few important things about this project:

  • The local inference landscape contains many excellent projects, but new models are released continuously, and the attention immediately gets captured by the next model to implement. This project takes a deliberately narrow bet: one model at a time, official-vector validation (logits obtained with the official implementation), long-context tests, and enough agent integration to know if it really works. The exact model may change as the landscape evolves, but the constraint remains: local inference credible on high end personal machines or Mac Studios, starting from 128GB of memory.
  • This software is developed with strong assistance from GPT 5.5 and with humans leading the ideas, testing, and debugging. We say this openly because it shaped how the project was built. If you are not happy with AI-developed code, this software is not for you. The acknowledgement below is equally important: this would not exist without llama.cpp and GGML, largely written by hand.
  • This implementation is based on the idea