Mixture of Experts (MoEs) in Transformers
Over the past few years, scaling dense language models has driven most progress in LLMs. From early models like the original ULMFiT (~30M parameters) or GPT-2 (1.5B parameters, which at the time was considered “too dangerous to release” đź§Ś), and eventually to today’s hundred-billion–parameter systems, the recipe was simple: More data + more parameters gives better performance. Scaling laws reinforced this trend, but dense scaling has practical limits: Training becomes increasingly expensive. Inference latency grows. Deployment requires significant memory and […]
Read more