Mixture of Experts (MoEs) in Transformers

Over the past few years, scaling dense language models has driven most progress in LLMs. From early models like the original ULMFiT (~30M parameters) or GPT-2 (1.5B parameters, which at the time was considered “too dangerous to release” đź§Ś), and eventually to today’s hundred-billion–parameter systems, the recipe was simple:

More data + more parameters gives better performance.

Scaling laws reinforced this trend, but dense scaling has practical limits:

  • Training becomes increasingly expensive.
  • Inference latency grows.
  • Deployment requires significant memory and hardware.

This is where Mixture of Experts (MoEs) enter the picture.

   

 

To finish reading, please visit source site