Arabic Quality Benchmark

arabenchعربنش

artok shows Arabic costs more. arabench shows Arabic gets worse.

37 prompts · 8 categories · 17 providers

$ pip install arabench

The Problem

LLMs underperform on Arabic.

Grammar breaks. Dialects get ignored. Cultural context vanishes. Diacritization is an afterthought.

Most LLMs are trained primarily on English data. When asked to work in Arabic, quality degrades across every dimension — translation loses nuance, grammar rules get broken, dialect distinctions disappear, and cultural knowledge becomes shallow.

Nobody benchmarks this systematically. arabench does. 37 carefully crafted prompts across 8 categories, scored 0–100, testing what actually matters for Arabic speakers.

Benchmark

8 categories. One score.

Each category scored 0–100. Overall score is a weighted average.

🌐

Translation

English-Arabic and Arabic-English accuracy. Nuance, idiom, register preservation.

📝

Grammar

Morphological correctness, agreement, case endings. The hard stuff Arabic demands.

🗺️

Dialect Detection

Distinguishing MSA from Levantine, Gulf, Egyptian, and Maghrebi Arabic.

🏛️

Cultural Knowledge

Proverbs, history, customs, regional context. Does the model actually know Arabic culture?

📄

Summarization

Condensing Arabic text without losing meaning. Coherence and faithfulness.

🔄

Code-switching

Handling mixed Arabic-English input naturally. Real-world conversational patterns.

🔤

Diacritization

Correct placement of tashkeel on Arabic text. A uniquely Arabic challenge.

⌨️

Instruction Following

Following complex prompts written entirely in Arabic. Format, constraints, tone.

Providers

17 providers. Every major LLM.

Configure API keys with arabench config. Swap models anytime.

Provider	Default Model
OpenAI	gpt-5.4
Anthropic	claude-opus-4-6
Google	gemini-3.1-pro-preview
DeepSeek	deepseek-chat (V3.2)
Mistral	mistral-large-latest
Cohere	command-a-03-2025
Groq	llama-3.3-70b-versatile
Qwen	qwen3.5-plus
xAI	grok-4-1-fast
AI21	jamba-large
Zhipu AI	glm-5
MiniMax	MiniMax-M2.7
Cerebras	llama-3.3-70b
SambaNova	DeepSeek-R1
Together	Qwen3.5-72B-Instruct
Fireworks	Qwen3.5-72B
Perplexity	sonar-pro

CLI

6 commands. Zero config.

Everything you need to benchmark Arabic quality.

run

Full benchmark across all configured providers. 37 prompts, 8 categories, full scoring.

quick

Quick single-provider benchmark with a subset of prompts. Fast iteration.

compare

Side-by-side comparison of two providers. See exactly where one beats the other.

leaderboard

Ranked results from all previous runs. Track progress over time.

explain

Scoring methodology for any category. Understand what the numbers mean.

config

Set up API keys and preferences. Provider selection, model overrides, output format.

Get Started

Five commands to benchmark Arabic.

# Install $ pip install arabench # Set up your API keys $ arabench config # Quick benchmark against one provider $ arabench quick openai # Run the full benchmark $ arabench run # Compare two providers head-to-head $ arabench compare openai anthropic # See the leaderboard $ arabench leaderboard # JSON output for CI pipelines $ arabench run --json