Different Strokes for Different Folks: Writer Identification for Historical Arabic Manuscripts
Hamza A. Abushahla, Ariel Justine N. Panopio, Layth Al-Khairulla, Mohamed I. AlHajri
TLDR
This paper introduces a CNN-based model for writer identification in historical Arabic manuscripts, expanding the Muharaf dataset and establishing new benchmarks.
Key contributions
- Manually verified and expanded writer labels in the Muharaf dataset from 28% to 86.75% of lines.
- Proposed a CNN with attention for closed-set writer identification, modeling rare two-writer lines.
- Established first baselines for writer identification using both line-level and page-disjoint protocols.
- Achieved 99.05% Top-1 accuracy (line-level) and 78.61% Top-1 accuracy (page-disjoint) with DenseNet201.
Why it matters
This work significantly advances writer identification for historical Arabic manuscripts, crucial for preserving cultural heritage and verifying document authenticity. By expanding dataset labels and setting new benchmarks, it provides a valuable resource for historians and linguists. The established protocols offer clearer evaluation for future research.
Original Abstract
Handwritten Arabic manuscripts preserve the Arab world's intellectual and cultural heritage, and writer identification supports provenance, authenticity verification, and historical analysis. Using the Muharaf dataset of historical Arabic manuscripts, we evaluate writer identification from individual line images and, to the best of our knowledge, provide the first baselines reported under both line-level and page-disjoint evaluation protocols. Since the dataset is only partially labeled for writer identification, we manually verified and expanded writer labels in the public portion from 6,858 (28.00%) to 21,249 lines (86.75%) out of 24,495 line images, correcting inconsistencies and removing non-handwritten text. After further filtering, we retained 18,987 lines (77.51%). We propose a Convolutional Neural Network (CNN)-based model with attention mechanisms for closed-set writer identification, including rare two-writer lines modeled as composite writer-pair classes. We benchmark fourteen configurations and conduct ablations across different feature extractors and training regimes. To assess generalization to unseen pages, the page-disjoint protocol assigns all lines from each page to a single split. Under the line-level protocol, a fine-tuned DenseNet201 with attention achieves 99.05% Top-1 accuracy, 99.73% Top-5 accuracy, and 97.44% F1-score. Under the more challenging page-disjoint protocol, the best observed results are 78.61% Top-1 accuracy, 87.79% Top-5 accuracy, and 66.55% F1-score, thus quantifying the impact of page-level cues. By expanding the Muharaf dataset's labeled subset and reporting both protocols, we provide a clearer benchmark and a practical resource for historians and linguists engaged with culturally and historically significant documents. The code and implementation details are available on GitHub.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.