arabenchعربنش

artok shows Arabic costs more. arabench shows Arabic gets worse.

37 prompts · 8 categories · 17 providers
One score that tells you how well a model actually handles Arabic. Grammar, dialects, culture, diacritization — everything that matters.

$ pip install arabench

Chapter I

The Problem

LLMs underperform on Arabic.

Most LLMs are trained primarily on English data. When asked to work in Arabic, quality degrades across every dimension — translation loses nuance, grammar rules get broken, dialect distinctions disappear, and cultural knowledge becomes shallow.

Nobody benchmarks this systematically. arabench does. 37 carefully crafted prompts across 8 categories, scored 0–100, testing what actually matters for Arabic speakers.

Chapter II

8 Categories

Each scored 0–100. Overall score is a weighted average.

Translation
English-Arabic and Arabic-English accuracy, nuance, idiom preservation
Grammar
Morphological correctness, agreement, case endings
Dialect Detection
Distinguishing MSA from Levantine, Gulf, Egyptian, Maghrebi
Cultural Knowledge
Proverbs, history, customs, regional context
Summarization
Condensing Arabic text without losing meaning
Code-switching
Handling mixed Arabic-English input naturally
Diacritization
Correct placement of tashkeel on Arabic text
Instruction Following
Following complex prompts written entirely in Arabic

Chapter III

17 Providers

Configure keys with arabench config. Swap models anytime.

Provider	Default Model
OpenAI	gpt-5.4
Anthropic	claude-opus-4-6
Google	gemini-3.1-pro-preview
DeepSeek	deepseek-chat (V3.2)
Mistral	mistral-large-latest
Cohere	command-a-03-2025
Groq	llama-3.3-70b-versatile
Qwen	qwen3.5-plus
xAI	grok-4-1-fast
AI21	jamba-large
Zhipu AI	glm-5
MiniMax	MiniMax-M2.7
Cerebras	llama-3.3-70b
SambaNova	DeepSeek-R1
Together	Qwen3.5-72B-Instruct
Fireworks	Qwen3.5-72B
Perplexity	sonar-pro

Chapter IV

Commands

6 commands. Zero config.

run
Full benchmark across all configured providers
quick <provider>
Quick single-provider benchmark, subset of prompts
compare <a> <b>
Side-by-side comparison of two providers
leaderboard
Ranked results from all previous runs
explain <category>
Scoring methodology for a category
config
Set up API keys and preferences

Chapter V

Get Started

# Install $ pip install arabench # Set up your API keys $ arabench config # Quick benchmark against one provider $ arabench quick openai # Run the full benchmark $ arabench run # Compare two providers head-to-head $ arabench compare openai anthropic # See the leaderboard $ arabench leaderboard # JSON output for CI pipelines $ arabench run --json