Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models
We converted our 15B reasoning model to a Mamba hybrid achieving 2.1x throughput with minimal quality loss. The key? A non-obvious insight about what data to distill on, and why intuition fails here.
When MiniMax published their M2 post-mortem in October explaining why they abandoned efficient attention at 230B scale, the narrative briefly became “efficient attention is dead.” Within days, Kimi Linear proved otherwise. The real lesson: it depends on your constraints.
Our constraint was simple: we had a strong 15B reasoning model and needed to make it efficient without starting over. No infinite compute for