QIMMA validates benchmarks before evaluating models, ensuring reported scores reflect genuine Arabic language capability in LLMs.
If you've been tracking Arabic LLM evaluation, you've probably noticed a growing tension: the number of benchmarks and leaderboards is expanding rapidly, but are we actually measuring what we think we're measuring?
We built QIMMA ูู ูุฉ (Arabic for "summit"), to answer that question systematically. Instead of aggregating existing Arabic benchmarks as-is and running models on them, we applied a rigorous quality validation pipeline before any evaluation took place. What we found was sobering: even widely-used, well-regarded Arabic benchmarks contain systematic quality issues that can quietly corrupt evaluation results.
This post walks through what QIMMA is, how we built it, what problems we found, and what the model rankings look like once you clean things up.
Arabic is spoken by over 400 million people across diverse dialects and cultural contexts, yet the Arabic NLP evaluation landscape remains fragmented. A few key pain points have motivated this work:
Translation issues. Many Arabic benchmarks are translations from English. This introduces distributional shifts. Questions that feel natural in English become awkward or culturally misaligned in Arabic, making benchmark data less representative of how Arabic is naturally used.
Absent quality validation. Even native Arabic benchmarks are often released without rigorous quality checks. Annotation inconsistencies, incorrect gold answers, encoding errors, and cultural bias in ground-truth labels have all been documented in established resources.
Reproducibility gaps. Evaluation scripts and per-sample outputs are rarely released publicly, making it hard to audit results or build on prior work.
Coverage fragmentation. Existing leaderboards cover isolated tasks and narrow domains, making holistic model assessment difficult.
To illustrate where QIMMA sits relative to existing platforms:
| Leaderboard | Open Source | Native Arabic | Quality Validation | Coding Eval | Public Outputs |
|---|---|---|---|---|---|
| OALL v1 | โ | Mixed | โ | โ | โ |
| OALL v2 | โ | Mostly | โ | โ | โ |
| BALSAM | Partial | 50% | โ | โ | โ |
| AraGen | โ | 100% | โ | โ | โ |
| SILMA ABL | โ | 100% | โ | โ | โ |
| ILMAAM | Partial | 100% | โ | โ | โ |
| HELM Arabic | โ | Mixed | โ | โ | โ |
| โฐ QIMMA | โ | 99% | โ | โ | โ |
QIMMA is the only platform combining all five properties: open source, predominantly native Arabic content, systematic quality validation, code evaluation, and public per-sample inference outputs.
QIMMA consolidates 109 subsets from 14 source benchmarks into a unified evaluation suite of over 52,000 samples, spanning 7 domains:
| Domain | Benchmarks | Task Types |
|---|---|---|
| Cultural | AraDiCE-Culture, ArabCulture, PalmX | MCQ |
| STEM | ArabicMMLU, GAT, 3LM STEM | MCQ |
| Legal | ArabLegalQA, MizanQA | MCQ, QA |
| Medical | MedArabiQ, MedAraBench | MCQ, QA |
| Safety | AraTrust | MCQ |
| Poetry & Literature | FannOrFlop | QA |
| Coding | 3LM HumanEval+, 3LM MBPP+ | Code |
A few things stand out about this design:
This is the methodological heart of QIMMA. Before running a single model, we applied a multi-stage validation pipeline to every sample in every benchmark.
Each sample was independently evaluated by two state-of-the-art LLMs:
We chose two models with strong Arabic capability but different training data compositions, so that their combined judgment is more robust than either alone.
Each model scores a sample against a 10-point rubric, with binary scores (0 or 1) per criterion:

A sample is eliminated if either model scores it below 7/10. Samples where both models agree on elimination are dropped immediately. However, where only one model flags a sample, it proceeds to human review in Stage 2.
Flagged samples are reviewed by native Arabic speakers with cultural and dialectal familiarity. Human annotators make final calls on:
For culturally sensitive content, multiple perspectives are considered, since "correctness" can genuinely vary across Arab regions.
The pipeline revealed recurring quality issues across benchmarks; not isolated errors, but systematic patterns reflecting gaps in how benchmarks were originally constructed.
| Benchmark | Total Samples | Discarded | Discard Rate |
|---|---|---|---|
| ArabicMMLU | 14,163 | 436 | 3.1% |
| MizanQA | 1,769 | 41 | 2.3% |
| PalmX | 3,001 | 25 | 0.8% |
| MedAraBench | 4,960 | 33 | 0.7% |
| FannOrFlop | 6,984 | 43 | 0.6% |
| ArabCulture | 3,482 | 7 | 0.2% |
| MedArabiQ | 499 | 1 | 0.2% |
| GAT | 13,986 | 1 | ~0.0% |
| 3LM STEM | 2,609 | 1 | ~0.0% |
| AraDiCE-Culture | 180 | 0 | 0.0% |
| ArabLegalQA | 79 | 0 | 0.0% |
| AraTrust | 522 | 0 | 0.0% |
โ๏ธ Answer Quality
False or mismatched gold indices, factually wrong answers, missing or raw text answers.
๐ Text & Formatting Quality
Corrupt or illegible text, spelling and grammar errors, and duplicate samples.
๐ฌ Cultural Sensitivity
Stereotype reinforcement and monolithic generalizations about diverse communities.
๐ค Gold Answer Compliance
Misalignment of gold answers with evaluation protocols.
Code benchmarks required a different intervention. Rather than discarding samples, we refined the Arabic problem statements in 3LM's Arabic adaptations of HumanEval+ and MBPP+, leaving task identifiers, reference solutions, and test suites completely unchanged.
The modification rates were striking:
| Benchmark | Total Prompts | Modified | Unchanged | Modification Rate |
|---|---|---|---|---|
| 3LM HumanEval+ | 164 | 145 | 19 | 88% |
| 3LM MBPP+ | 378 | 308 | 70 | 81% |
Modifications fell into five categories:
QIMMA uses LightEval, EvalPlus and FannOrFlop as its evaluation framework, chosen for consistency, multilingual community adoption, and reproducibility.
| Task Type | Metric | Benchmarks |
|---|---|---|
| MCQ | Normalized Log-Likelihood Accuracy | AraDiCE-Culture, ArabicMMLU, ArabCulture, PalmX, 3LM STEM, MedArabiQ, GAT, MedAraBench, AraTrust |
| Multi-select MCQ | Probability Mass on Gold Choices | MizanQA |
| Generative QA | F1 BERTScore (AraBERT v02) | MedArabiQ, ArabLegalQA, FannOrFlop |
| Code | Pass@1 | 3LM HumanEval+, 3LM MBPP+ |
QIMMA standardizes prompting by question format, with six template types:

MCQ: generic multiple choice ยท MCQ-C: multiple choice with context passage ยท MCQ-I: multiple choice with specific instructions (GAT analogy/completion) ยท QA: generic open-ended QA ยท QA-C: QA with context ยท QA-F: fill-in-the-blank QA
All prompts are in Arabic. For MizanQA and ArabCulture, benchmark-specific system prompts from the original papers are preserved.
Results as of April 2026; covering top 10 evaluated models. Visit the live leaderboard for current rankings.
| Rank | Model | AVERAGE | AraDiCE-Culture | ArabicMMLU | ArabCulture | PALMX | 3LM STEM | AraTrust | MizanQA | MedArabiQ | ArabLegalQA | GAT | MedAraBench | HumanEval+ | MBPP+ | FannOrFlop |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ๐ฅ 1 | Qwen/Qwen3.5-397B-A17B-FP8 | 68.06 | 82.78 | 77.54 | 61.75 | 83.91 | 88.67 | 90.04 | 73.36 | 47.30 | 54.94 | 55.89 | 47.97 | 67.68 | 76.72 | 44.33 |
| ๐ฅ 2 | Applied-Innovation-Center/Karnak | 66.20 | 73.33 | 80.94 | 53.49 | 81.40 | 93.10 | 89.08 | 55.92 | 55.78 | 71.58 | 61.06 | 54.19 | 33.54 | 64.55 | 58.91 |
| ๐ฅ 3 | inceptionai/Jais-2-70B-Chat | 65.81 | 78.89 | 81.29 | 83.24 | 83.73 | 87.96 | 90.23 | 71.78 | 52.79 | 69.60 | 51.67 | 50.89 | 19.51 | 43.65 | 56.13 |
| #4 | Qwen/Qwen2.5-72B-Instruct | 65.75 | 77.22 | 73.78 | 63.83 | 77.77 | 87.55 | 88.51 | 63.49 | 50.06 | 70.74 | 55.90 | 44.19 | 37.20 | 72.75 | 57.51 |
| #5 | Applied-Innovation-Center/AIC-1 | 65.37 | 73.33 | 72.02 | 77.52 | 76.11 | 88.13 | 90.61 | 56.36 | 53.75 | 68.96 | 62.11 | 50.78 | 28.05 | 69.58 | 47.83 |
| #6 | Qwen/Qwen3.5-122B-A10B | 64.84 | 74.44 | 73.17 | 37.78 | 81.46 | 86.18 | 86.97 | 64.01 | 47.04 | 55.11 | 50.90 | 52.49 | 65.24 | 72.43 | 60.54 |
| #7 | Sakalti/Ultiima-72B | 64.49 | 78.33 | 72.28 | 68.79 | 76.75 | 83.70 | 89.08 | 60.44 | 44.58 | 69.12 | 46.91 | 42.25 | 39.02 | 74.07 | 57.56 |
| #8 | meta-llama/Llama-3.3-70B-Instruct | 63.96 | 77.22 | 71.57 | 78.05 | 77.95 | 88.28 | 85.63 | 67.44 | 56.25 | 64.00 | 51.13 | 54.86 | 27.44 | 71.16 | 24.43 |
| #9 | Qwen/Qwen2.5-32B-Instruct | 63.26 | 70.56 | 68.76 | 75.80 | 72.07 | 81.03 | 85.82 | 53.78 | 48.08 | 69.27 | 56.94 | 36.51 | 34.15 | 72.75 | 93.10 |
| #10 | FreedomIntelligence/AceGPT-v2-32B-Chat | 61.14 | 76.67 | 70.62 | 79.79 | 74.46 | 84.88 | 86.97 | 63.89 | 49.96 | 71.46 | 56.04 | 47.32 | 23.78 | 54.50 | 15.56 |
Across the full leaderboard (46 models), a clear but imperfect size-performance correlation emerges. However, there are interesting exceptions:
To summarize the distinctive properties of QIMMA:
| Property | Details |
|---|---|
| Quality-first philosophy | Validation runs before evaluation, not as an afterthought |
| Multi-model validation | Two LLMs with different training + human review for flagged cases |
| 99% native Arabic | Avoids translation artifacts almost entirely |
| Multi-domain, multi-task | 7 domains, 3 task types (MCQ, QA, code), 109 subsets |
| Code evaluation | First Arabic leaderboard to include code generation |
| Full transparency | Per-sample inference outputs publicly released, not just aggregate scores |
| LightEval-based | Unified, reproducible evaluation codebase |
| Dialectal awareness | Explicit handling of MSA vs. dialectal variation in prompts and rubrics |
@misc{alqadi2026arabicbenchmarksreliableqimmas,
title={Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation},
author={Leen AlQadi and Ahmed Alzubaidi and Mohammed Alyafeai and Hamza Alobeidli and Maitha Alhammadi and Shaikha Alsuwaidi and Omar Alkaabi and Basma El Amel Boussaha and Hakim Hacid},
year={2026},
eprint={2604.03395},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.03395},
}