Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models
Join us in building benchmarks that capture early-stage reasoning & scientific knowledge in LLMs!
The development of Large Language Models (LLMs) typically begins with a series of ablation experiments, wherein various model architectures, data mixtures, and training hyperparameters are systematically evaluated. This phase is commonly referred to as the early stages of training. During this period, researchers primarily monitor two key metrics: the training loss curve and evaluation scores. However, existing evaluation benchmarks often fail to provide meaningful or discriminative signals during these initial stages where LLMs are trained on a few tokens ~200B tokens, making