majal — Arabic Dataset Inspector

The Problem

Bad data = bad models.

Built from experience training Arabic LLMs. Encoding issues, invisible characters, dialect mixing, transliteration leaks — invisible problems that silently degrade your model.

Without majal

Mojibake corrupts 3% of examples. Invisible chars break word boundaries. Romanized Arabic leaks into training. You don't know until eval scores tank.

With majal

One command finds every issue. Auto-fix cleans what it can. You ship clean data, your model actually learns Arabic.

Detection

15+ checks across 5 categories.

Each check targets a real problem we hit training Arabic LLMs. Severity levels: ERROR, WARN, INFO.

Encoding

mojibake ERROR

Garbled Arabic from encoding mismatches (UTF-8 read as Windows-1256)

mixed-encoding ERROR

Different encodings within the same example

bom-artifact WARN

Byte Order Mark remnants in text

Invisible

zwsp-injection ERROR

Zero-width spaces breaking Arabic word boundaries

bidi-override ERROR

Bidirectional override characters corrupting text flow

invisible-control WARN

Control characters that survive copy-paste silently

Content

empty-response ERROR

Empty or whitespace-only response fields

truncated-text WARN

Text cut off mid-sentence or mid-word

duplicate-example WARN

Near-duplicate training examples (fuzzy match)

low-quality INFO

Very short, repetitive, or boilerplate responses

Arabic

dialect-mixing WARN

MSA and dialect mixed within the same example

transliteration-leak ERROR

Romanized Arabic (e.g. "7abibi") in Arabic-expected fields

tashkeel-inconsistency INFO

Inconsistent diacritics within the dataset

broken-shaping ERROR

Arabic letters not joining correctly (presentation forms)

Format

field-mismatch ERROR

Missing or unexpected fields in structured data

json-escape-error ERROR

Broken JSON escaping corrupting Arabic text

Commands

5 commands. Zero config.

scan

Find quality issues. Rich table output with severity, line number, and context.

stats

Dataset statistics — language breakdown, token counts, field analysis.

fix

Auto-fix issues. Shows diff preview before writing. Use --yes to skip prompt.

explain

Learn about each check. Visual examples, severity rationale.

sample

Random sample from dataset with quality annotations per example.

Formats

JSONL. CSV. TXT.

Auto-detects text fields. Override with --field. Handles instruction/response pairs, chat messages, and raw text.

Format	Auto-detect	Description
JSONL	instruction, response, text, messages	One JSON object per line
CSV	Text columns by heuristic	Comma or tab separated
TXT	Full line content	One example per line or paragraph-separated

Get Started

Three commands to clean your data.

# Install $ pip install majal # Scan for issues $ majal scan data.jsonl # Get dataset stats $ majal stats data.jsonl # Auto-fix what can be fixed $ majal fix data.jsonl # Learn about a check $ majal explain mojibake # Random sample with annotations $ majal sample data.jsonl --n 10

Library

Use it from Python.

>>> from majal import scan_dataset >>> results = scan_dataset("data.jsonl") >>> for issue in results.issues: ... print(f"[{issue.severity}] {issue.check}")