majal

مجال

Scan your training data for Arabic-specific quality issues before they wreck your model.

15+ checks across 5 categories — supports JSONL, CSV, and TXT. Built from experience training Arabic LLMs.
$ pip install majal

Chapter I
The Problem
Bad data = bad models.
Without majal

Mojibake corrupts 3% of examples. Invisible chars break word boundaries. Romanized Arabic leaks into training. You don't know until eval scores tank.

With majal

One command finds every issue. Auto-fix cleans what it can. You ship clean data, your model actually learns Arabic.


Chapter II
15+ Checks
Each targets a real problem from training Arabic LLMs.
Encoding
  • mojibake ERROR
    Garbled Arabic from encoding mismatches
  • mixed-encoding ERROR
    Different encodings within the same example
  • bom-artifact WARN
    Byte Order Mark remnants in text
Invisible
  • zwsp-injection ERROR
    Zero-width spaces breaking Arabic word boundaries
  • bidi-override ERROR
    Bidirectional override characters corrupting text flow
  • invisible-control WARN
    Control characters that survive copy-paste silently
Content
  • empty-response ERROR
    Empty or whitespace-only response fields
  • truncated-text WARN
    Text cut off mid-sentence or mid-word
  • duplicate-example WARN
    Near-duplicate training examples (fuzzy match)
  • low-quality INFO
    Very short, repetitive, or boilerplate responses
Arabic
  • dialect-mixing WARN
    MSA and dialect mixed within the same example
  • transliteration-leak ERROR
    Romanized Arabic in Arabic-expected fields
  • tashkeel-inconsistency INFO
    Inconsistent diacritics within the dataset
  • broken-shaping ERROR
    Arabic letters not joining correctly (presentation forms)
Format
  • field-mismatch ERROR
    Missing or unexpected fields in structured data
  • json-escape-error ERROR
    Broken JSON escaping corrupting Arabic text

Chapter III
Commands
5 commands. Zero config.

Chapter IV
Formats
Auto-detects text fields. Override with --field.
FormatAuto-detectDescription
JSONLinstruction, response, text, messagesOne JSON object per line
CSVText columns by heuristicComma or tab separated
TXTFull line contentOne example per line or paragraph-separated

Chapter V
Get Started
# Install $ pip install majal # Scan for issues $ majal scan data.jsonl # Get dataset stats $ majal stats data.jsonl # Auto-fix $ majal fix data.jsonl # Learn about a check $ majal explain mojibake # Random sample $ majal sample data.jsonl --n 10

Chapter VI
As a Library
>>> from majal import scan_dataset >>> results = scan_dataset("data.jsonl") >>> for issue in results.issues: ... print(f"[{issue.severity}] {issue.check}")