Scan your training data for Arabic-specific quality issues before they wreck your model.
pip install majal
Mojibake corrupts 3% of examples. Invisible chars break word boundaries. Romanized Arabic leaks into training. You don't know until eval scores tank.
One command finds every issue. Auto-fix cleans what it can. You ship clean data, your model actually learns Arabic.
Garbled Arabic from encoding mismatches (UTF-8 read as Windows-1256)
Different encodings within the same example
Byte Order Mark remnants in text
Zero-width spaces breaking Arabic word boundaries
Bidirectional override characters corrupting text flow
Control characters that survive copy-paste silently
Empty or whitespace-only response fields
Text cut off mid-sentence or mid-word
Near-duplicate training examples (fuzzy match)
Very short, repetitive, or boilerplate responses
MSA and dialect mixed within the same example
Romanized Arabic (e.g. "7abibi") in Arabic-expected fields
Inconsistent diacritics within the dataset
Arabic letters not joining correctly (presentation forms)
Missing or unexpected fields in structured data
Broken JSON escaping corrupting Arabic text
Find quality issues. Rich table output with severity, line number, and context.
Dataset statistics — language breakdown, token counts, field analysis.
Auto-fix issues. Shows diff preview before writing. Use --yes to skip prompt.
Learn about each check. Visual examples, severity rationale.
Random sample from dataset with quality annotations per example.
--field. Handles instruction/response pairs, chat messages, and raw text.| Format | Auto-detect | Description |
|---|---|---|
| JSONL | instruction, response, text, messages | One JSON object per line |
| CSV | Text columns by heuristic | Comma or tab separated |
| TXT | Full line content | One example per line or paragraph-separated |