artok shows Arabic costs more. arabench shows Arabic gets worse.
pip install arabench
Most LLMs are trained primarily on English data. When asked to work in Arabic, quality degrades across every dimension — translation loses nuance, grammar rules get broken, dialect distinctions disappear, and cultural knowledge becomes shallow.
Nobody benchmarks this systematically. arabench does. 37 carefully crafted prompts across 8 categories, scored 0–100, testing what actually matters for Arabic speakers.
arabench config. Swap models anytime.| Provider | Default Model |
|---|---|
| OpenAI | gpt-5.4 |
| Anthropic | claude-opus-4-6 |
| gemini-3.1-pro-preview | |
| DeepSeek | deepseek-chat (V3.2) |
| Mistral | mistral-large-latest |
| Cohere | command-a-03-2025 |
| Groq | llama-3.3-70b-versatile |
| Qwen | qwen3.5-plus |
| xAI | grok-4-1-fast |
| AI21 | jamba-large |
| Zhipu AI | glm-5 |
| MiniMax | MiniMax-M2.7 |
| Cerebras | llama-3.3-70b |
| SambaNova | DeepSeek-R1 |
| Together | Qwen3.5-72B-Instruct |
| Fireworks | Qwen3.5-72B |
| Perplexity | sonar-pro |