Arabic Quality Benchmark

arabenchعربنش

artok shows Arabic costs more. arabench shows Arabic gets worse.

37 prompts · 8 categories · 17 providers
$ pip install arabench
The Problem
LLMs underperform on Arabic.
Grammar breaks. Dialects get ignored. Cultural context vanishes. Diacritization is an afterthought.

Most LLMs are trained primarily on English data. When asked to work in Arabic, quality degrades across every dimension — translation loses nuance, grammar rules get broken, dialect distinctions disappear, and cultural knowledge becomes shallow.


Nobody benchmarks this systematically. arabench does. 37 carefully crafted prompts across 8 categories, scored 0–100, testing what actually matters for Arabic speakers.

Benchmark
8 categories. One score.
Each category scored 0–100. Overall score is a weighted average.
🌐

Translation

English-Arabic and Arabic-English accuracy. Nuance, idiom, register preservation.

📝

Grammar

Morphological correctness, agreement, case endings. The hard stuff Arabic demands.

🗺️

Dialect Detection

Distinguishing MSA from Levantine, Gulf, Egyptian, and Maghrebi Arabic.

🏛️

Cultural Knowledge

Proverbs, history, customs, regional context. Does the model actually know Arabic culture?

📄

Summarization

Condensing Arabic text without losing meaning. Coherence and faithfulness.

🔄

Code-switching

Handling mixed Arabic-English input naturally. Real-world conversational patterns.

🔤

Diacritization

Correct placement of tashkeel on Arabic text. A uniquely Arabic challenge.

⌨️

Instruction Following

Following complex prompts written entirely in Arabic. Format, constraints, tone.

Providers
17 providers. Every major LLM.
Configure API keys with arabench config. Swap models anytime.
ProviderDefault Model
OpenAIgpt-5.4
Anthropicclaude-opus-4-6
Googlegemini-3.1-pro-preview
DeepSeekdeepseek-chat (V3.2)
Mistralmistral-large-latest
Coherecommand-a-03-2025
Groqllama-3.3-70b-versatile
Qwenqwen3.5-plus
xAIgrok-4-1-fast
AI21jamba-large
Zhipu AIglm-5
MiniMaxMiniMax-M2.7
Cerebrasllama-3.3-70b
SambaNovaDeepSeek-R1
TogetherQwen3.5-72B-Instruct
FireworksQwen3.5-72B
Perplexitysonar-pro
CLI
6 commands. Zero config.
Everything you need to benchmark Arabic quality.

run

Full benchmark across all configured providers. 37 prompts, 8 categories, full scoring.

quick

Quick single-provider benchmark with a subset of prompts. Fast iteration.

compare

Side-by-side comparison of two providers. See exactly where one beats the other.

leaderboard

Ranked results from all previous runs. Track progress over time.

explain

Scoring methodology for any category. Understand what the numbers mean.

config

Set up API keys and preferences. Provider selection, model overrides, output format.

Get Started
Five commands to benchmark Arabic.
# Install $ pip install arabench # Set up your API keys $ arabench config # Quick benchmark against one provider $ arabench quick openai # Run the full benchmark $ arabench run # Compare two providers head-to-head $ arabench compare openai anthropic # See the leaderboard $ arabench leaderboard # JSON output for CI pipelines $ arabench run --json