Mixture of Experts (MoEs) in Transformers
Over the past few years, scaling dense language models has driven most progress in LLMs. From early models like the original ULMFiT (~30M parameters) or GPT-2 (1.5B parameters, which at the time was considered “too dangerous to release” đź§Ś), and eventually to today’s hundred-billion–parameter systems, the recipe was simple:
More data + more parameters gives better performance.
Scaling laws reinforced this trend, but dense scaling has practical limits:
- Training becomes increasingly expensive.
- Inference latency grows.
- Deployment requires significant memory and hardware.
This is where Mixture of Experts (MoEs) enter the picture.