QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard
QIMMA validates benchmarks before evaluating models, ensuring reported scores reflect genuine Arabic language capability in LLMs.
If you’ve been tracking Arabic LLM evaluation, you’ve probably noticed a growing tension: the number of benchmarks and leaderboards is expanding rapidly, but are we actually measuring what we think we’re measuring?
We built QIMMA قمّة (Arabic for “summit”), to answer that question systematically. Instead of aggregating existing Arabic benchmarks as-is and running models on them, we applied a rigorous quality validation pipeline before any evaluation took place. What we found was sobering: even widely-used, well-regarded Arabic benchmarks contain systematic quality issues that can quietly corrupt evaluation results.
This post walks through what QIMMA is, how we built it, what problems we found, and what the model rankings look like once you clean things up.
🔍 The Problem: Arabic NLP Evaluation Is Fragmented and Unvalidated
Arabic is spoken by over 400 million people across diverse dialects and cultural contexts, yet the Arabic NLP evaluation landscape remains fragmented. A few
