Data Quality Tool

majal

مجال

Scan your training data for Arabic-specific quality issues before they wreck your model.

15+ checks · 5 categories · 3 formats
$ pip install majal
The Problem
Bad data = bad models.
Built from experience training Arabic LLMs. Encoding issues, invisible characters, dialect mixing, transliteration leaks — invisible problems that silently degrade your model.
Without majal

Mojibake corrupts 3% of examples. Invisible chars break word boundaries. Romanized Arabic leaks into training. You don't know until eval scores tank.

With majal

One command finds every issue. Auto-fix cleans what it can. You ship clean data, your model actually learns Arabic.

Detection
15+ checks across 5 categories.
Each check targets a real problem we hit training Arabic LLMs. Severity levels: ERROR, WARN, INFO.
Encoding

mojibake ERROR

Garbled Arabic from encoding mismatches (UTF-8 read as Windows-1256)

mixed-encoding ERROR

Different encodings within the same example

bom-artifact WARN

Byte Order Mark remnants in text

Invisible

zwsp-injection ERROR

Zero-width spaces breaking Arabic word boundaries

bidi-override ERROR

Bidirectional override characters corrupting text flow

invisible-control WARN

Control characters that survive copy-paste silently

Content

empty-response ERROR

Empty or whitespace-only response fields

truncated-text WARN

Text cut off mid-sentence or mid-word

duplicate-example WARN

Near-duplicate training examples (fuzzy match)

low-quality INFO

Very short, repetitive, or boilerplate responses

Arabic

dialect-mixing WARN

MSA and dialect mixed within the same example

transliteration-leak ERROR

Romanized Arabic (e.g. "7abibi") in Arabic-expected fields

tashkeel-inconsistency INFO

Inconsistent diacritics within the dataset

broken-shaping ERROR

Arabic letters not joining correctly (presentation forms)

Format

field-mismatch ERROR

Missing or unexpected fields in structured data

json-escape-error ERROR

Broken JSON escaping corrupting Arabic text

Commands
5 commands. Zero config.

scan

Find quality issues. Rich table output with severity, line number, and context.

stats

Dataset statistics — language breakdown, token counts, field analysis.

fix

Auto-fix issues. Shows diff preview before writing. Use --yes to skip prompt.

explain

Learn about each check. Visual examples, severity rationale.

sample

Random sample from dataset with quality annotations per example.

Formats
JSONL. CSV. TXT.
Auto-detects text fields. Override with --field. Handles instruction/response pairs, chat messages, and raw text.
FormatAuto-detectDescription
JSONLinstruction, response, text, messagesOne JSON object per line
CSVText columns by heuristicComma or tab separated
TXTFull line contentOne example per line or paragraph-separated
Get Started
Three commands to clean your data.
# Install $ pip install majal # Scan for issues $ majal scan data.jsonl # Get dataset stats $ majal stats data.jsonl # Auto-fix what can be fixed $ majal fix data.jsonl # Learn about a check $ majal explain mojibake # Random sample with annotations $ majal sample data.jsonl --n 10
Library
Use it from Python.
>>> from majal import scan_dataset >>> results = scan_dataset("data.jsonl") >>> for issue in results.issues: ... print(f"[{issue.severity}] {issue.check}")