Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
TLDR
This paper introduces a method for zero-shot morphological discovery in low-resource Bantu languages using cross-lingual transfer and unsupervised clustering.
Key contributions
- Combines cross-lingual transfer from Swahili with unsupervised clustering for morphological discovery.
- Discovered noun class assignments for 2,455 Giriama words with only 91 labeled paradigms.
- Identified two previously undocumented Giriama morphological patterns: an 'a-' prefix variant and a contracted 'k'- prefix.
- Achieved 97.3% segmentation and 86.7% lemmatization rates on an expanded Giriama corpus.
Why it matters
This research provides a robust pipeline for uncovering complex linguistic structures in under-resourced languages, crucial for digital preservation and NLP development. By releasing code and lexicons, it directly supports further documentation and research in Bantu languages.
Original Abstract
We present a method for discovering morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering. Applied to Giriama (nyf), a language with only 91 labeled paradigms, our pipeline discovers noun class assignments for 2,455 words and identifies two previously undocumented morphological patterns: an a- prefix variant for Class 2 (vowel coalescence - the merger of two adjacent vowels - of wa-, 95.1% consistency) and a contracted k'- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms confirms 78.2% lemmatization accuracy, while a v3 corpus expansion to 19,624 words (9,014 unique lemmas) achieves 97.3% segmentation and 86.7% lemmatization rates across all major word classes. Our ensemble of transfer learning from Swahili and unsupervised clustering, combined via weighted voting, exploits complementary strengths: transfer excels at cognate detection (leveraging ~60% vocabulary overlap) while clustering discovers language-specific innovations invisible to transfer. We release all code and discovered lexicons to support morphological documentation for low-resource Bantu languages.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.