majal
مجال
Scan your training data for Arabic-specific quality issues before they wreck your model.
15+ checks across 5 categories — supports JSONL, CSV, and TXT. Built from experience training Arabic LLMs.
Chapter I
The Problem
Bad data = bad models.
Without majal
Mojibake corrupts 3% of examples. Invisible chars break word boundaries. Romanized Arabic leaks into training. You don't know until eval scores tank.
With majal
One command finds every issue. Auto-fix cleans what it can. You ship clean data, your model actually learns Arabic.
Chapter II
15+ Checks
Each targets a real problem from training Arabic LLMs.
Encoding
- mojibake ERROR
Garbled Arabic from encoding mismatches
- mixed-encoding ERROR
Different encodings within the same example
- bom-artifact WARN
Byte Order Mark remnants in text
Invisible
- zwsp-injection ERROR
Zero-width spaces breaking Arabic word boundaries
- bidi-override ERROR
Bidirectional override characters corrupting text flow
- invisible-control WARN
Control characters that survive copy-paste silently
Content
- empty-response ERROR
Empty or whitespace-only response fields
- truncated-text WARN
Text cut off mid-sentence or mid-word
- duplicate-example WARN
Near-duplicate training examples (fuzzy match)
- low-quality INFO
Very short, repetitive, or boilerplate responses
Arabic
- dialect-mixing WARN
MSA and dialect mixed within the same example
- transliteration-leak ERROR
Romanized Arabic in Arabic-expected fields
- tashkeel-inconsistency INFO
Inconsistent diacritics within the dataset
- broken-shaping ERROR
Arabic letters not joining correctly (presentation forms)
Format
- field-mismatch ERROR
Missing or unexpected fields in structured data
- json-escape-error ERROR
Broken JSON escaping corrupting Arabic text
Chapter III
Commands
5 commands. Zero config.
- scan
Find quality issues with Rich table output
- stats
Dataset statistics — language, tokens, fields
- fix
Auto-fix issues with diff preview
- explain
Learn about each check visually
- sample
Random sample with quality annotations
Chapter V
Get Started
# Install
$ pip install majal
# Scan for issues
$ majal scan data.jsonl
# Get dataset stats
$ majal stats data.jsonl
# Auto-fix
$ majal fix data.jsonl
# Learn about a check
$ majal explain mojibake
# Random sample
$ majal sample data.jsonl --n 10
Chapter VI
As a Library
>>> from majal import scan_dataset
>>> results = scan_dataset("data.jsonl")
>>> for issue in results.issues:
... print(f"[{issue.severity}] {issue.check}")