artok shows Arabic costs more. arabench shows Arabic gets worse.
pip install arabench
Most LLMs are trained primarily on English data. When asked to work in Arabic, quality degrades across every dimension — translation loses nuance, grammar rules get broken, dialect distinctions disappear, and cultural knowledge becomes shallow.
Nobody benchmarks this systematically. arabench does. 37 carefully crafted prompts across 8 categories, scored 0–100, testing what actually matters for Arabic speakers.
English-Arabic and Arabic-English accuracy. Nuance, idiom, register preservation.
Morphological correctness, agreement, case endings. The hard stuff Arabic demands.
Distinguishing MSA from Levantine, Gulf, Egyptian, and Maghrebi Arabic.
Proverbs, history, customs, regional context. Does the model actually know Arabic culture?
Condensing Arabic text without losing meaning. Coherence and faithfulness.
Handling mixed Arabic-English input naturally. Real-world conversational patterns.
Correct placement of tashkeel on Arabic text. A uniquely Arabic challenge.
Following complex prompts written entirely in Arabic. Format, constraints, tone.
arabench config. Swap models anytime.| Provider | Default Model |
|---|---|
| OpenAI | gpt-5.4 |
| Anthropic | claude-opus-4-6 |
| gemini-3.1-pro-preview | |
| DeepSeek | deepseek-chat (V3.2) |
| Mistral | mistral-large-latest |
| Cohere | command-a-03-2025 |
| Groq | llama-3.3-70b-versatile |
| Qwen | qwen3.5-plus |
| xAI | grok-4-1-fast |
| AI21 | jamba-large |
| Zhipu AI | glm-5 |
| MiniMax | MiniMax-M2.7 |
| Cerebras | llama-3.3-70b |
| SambaNova | DeepSeek-R1 |
| Together | Qwen3.5-72B-Instruct |
| Fireworks | Qwen3.5-72B |
| Perplexity | sonar-pro |
Full benchmark across all configured providers. 37 prompts, 8 categories, full scoring.
Quick single-provider benchmark with a subset of prompts. Fast iteration.
Side-by-side comparison of two providers. See exactly where one beats the other.
Ranked results from all previous runs. Track progress over time.
Scoring methodology for any category. Understand what the numbers mean.
Set up API keys and preferences. Provider selection, model overrides, output format.