Mixture of Experts (MoEs) in Transformers

Over the past few years, scaling dense language models has driven most progress in LLMs. From early models like the original ULMFiT (~30M parameters) or GPT-2 (1.5B parameters, which at the time was considered “too dangerous to release” 🧌), and eventually to today’s hundred-billion–parameter systems, the recipe was simple:

More data + more parameters gives better performance.

Scaling laws reinforced this trend, but dense scaling has practical limits:

Training becomes increasingly expensive.
Inference latency grows.
Deployment requires significant memory and hardware.

This is where Mixture of Experts (MoEs) enter the picture.

To finish reading, please visit source site