majal

مجال

Scan your training data for Arabic-specific quality issues before they wreck your model.

15+ checks across 5 categories — supports JSONL, CSV, and TXT. Built from experience training Arabic LLMs.

$ pip install majal

Chapter I

The Problem

Bad data = bad models.

Without majal

Mojibake corrupts 3% of examples. Invisible chars break word boundaries. Romanized Arabic leaks into training. You don't know until eval scores tank.

With majal

One command finds every issue. Auto-fix cleans what it can. You ship clean data, your model actually learns Arabic.

Chapter II

15+ Checks

Each targets a real problem from training Arabic LLMs.

Encoding

mojibake ERROR
Garbled Arabic from encoding mismatches
mixed-encoding ERROR
Different encodings within the same example
bom-artifact WARN
Byte Order Mark remnants in text

Invisible

zwsp-injection ERROR
Zero-width spaces breaking Arabic word boundaries
bidi-override ERROR
Bidirectional override characters corrupting text flow
invisible-control WARN
Control characters that survive copy-paste silently

Content

empty-response ERROR
Empty or whitespace-only response fields
truncated-text WARN
Text cut off mid-sentence or mid-word
duplicate-example WARN
Near-duplicate training examples (fuzzy match)
low-quality INFO
Very short, repetitive, or boilerplate responses

Arabic

dialect-mixing WARN
MSA and dialect mixed within the same example
transliteration-leak ERROR
Romanized Arabic in Arabic-expected fields
tashkeel-inconsistency INFO
Inconsistent diacritics within the dataset
broken-shaping ERROR
Arabic letters not joining correctly (presentation forms)

Format

field-mismatch ERROR
Missing or unexpected fields in structured data
json-escape-error ERROR
Broken JSON escaping corrupting Arabic text

Chapter III

Commands

5 commands. Zero config.

scan
Find quality issues with Rich table output
stats
Dataset statistics — language, tokens, fields
fix
Auto-fix issues with diff preview
explain
Learn about each check visually
sample
Random sample with quality annotations

Chapter IV

Formats

Auto-detects text fields. Override with --field.

Format	Auto-detect	Description
JSONL	instruction, response, text, messages	One JSON object per line
CSV	Text columns by heuristic	Comma or tab separated
TXT	Full line content	One example per line or paragraph-separated

Chapter V

Get Started

# Install $ pip install majal # Scan for issues $ majal scan data.jsonl # Get dataset stats $ majal stats data.jsonl # Auto-fix $ majal fix data.jsonl # Learn about a check $ majal explain mojibake # Random sample $ majal sample data.jsonl --n 10

Chapter VI

As a Library

>>> from majal import scan_dataset >>> results = scan_dataset("data.jsonl") >>> for issue in results.issues: ... print(f"[{issue.severity}] {issue.check}")