arabenchعربنش

artok shows Arabic costs more. arabench shows Arabic gets worse.

37 prompts · 8 categories · 17 providers
One score that tells you how well a model actually handles Arabic. Grammar, dialects, culture, diacritization — everything that matters.
$ pip install arabench

Chapter I
The Problem
LLMs underperform on Arabic.

Most LLMs are trained primarily on English data. When asked to work in Arabic, quality degrades across every dimension — translation loses nuance, grammar rules get broken, dialect distinctions disappear, and cultural knowledge becomes shallow.


Nobody benchmarks this systematically. arabench does. 37 carefully crafted prompts across 8 categories, scored 0–100, testing what actually matters for Arabic speakers.


Chapter II
8 Categories
Each scored 0–100. Overall score is a weighted average.

Chapter III
17 Providers
Configure keys with arabench config. Swap models anytime.
ProviderDefault Model
OpenAIgpt-5.4
Anthropicclaude-opus-4-6
Googlegemini-3.1-pro-preview
DeepSeekdeepseek-chat (V3.2)
Mistralmistral-large-latest
Coherecommand-a-03-2025
Groqllama-3.3-70b-versatile
Qwenqwen3.5-plus
xAIgrok-4-1-fast
AI21jamba-large
Zhipu AIglm-5
MiniMaxMiniMax-M2.7
Cerebrasllama-3.3-70b
SambaNovaDeepSeek-R1
TogetherQwen3.5-72B-Instruct
FireworksQwen3.5-72B
Perplexitysonar-pro

Chapter IV
Commands
6 commands. Zero config.

Chapter V
Get Started
# Install $ pip install arabench # Set up your API keys $ arabench config # Quick benchmark against one provider $ arabench quick openai # Run the full benchmark $ arabench run # Compare two providers head-to-head $ arabench compare openai anthropic # See the leaderboard $ arabench leaderboard # JSON output for CI pipelines $ arabench run --json