Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
TLDR
Neural models trained on modern data can accurately recover historical lexical structure and cognates in Bantu languages.
Key contributions
- Neural models (BantuMorph v7) recover historical lexical structure from modern Bantu data.
- 90.9% of top noun and 12 verb cognate candidates align with established Proto-Bantu reconstructions.
- Models accurately identify cognate clusters and phylogenetic groupings consistent with historical classifications.
- Cross-lingual noun classes maintain high cosine similarity across languages, indicating shared structure.
Why it matters
This research demonstrates the remarkable ability of neural models to computationally reconstruct ancient linguistic structures from modern data. It provides a novel, data-driven approach to historical linguistics, offering new tools to understand language evolution and relationships.
Original Abstract
We investigate whether neural models trained exclusively on modern morphological data can recover cross-lingual lexical structure consistent with historical reconstruction. Using BantuMorph v7, a transformer over Bantu morphological paradigms, we analyze 14 Eastern and Southern Bantu languages, extract encoder embeddings for their noun and verb lemmas, and identify 728 noun and 1,525 verb cognate candidates shared across 5+ languages. Evaluating these candidates against established historical resources-the Bantu Lexical Reconstructions database (BLR3; 4,786 reconstructed Proto-Bantu forms) and the ASJP basic vocabulary-we confirm 10 of the top 11 noun candidates (90.9%) align with previously reconstructed Proto-Bantu forms, including *-ntU 'person' (8 languages), *gombe 'cow' (9 languages), and *mUn (9 languages). Extending to verbs, 12 verb cognates align with reconstructed Proto-Bantu roots, including *-bon- 'see' and *-jIm- 'stand', each attested across wide geographic ranges. Cross-model validation using an independent translation model (NLLB-600M) confirms these patterns: both models recover cognate clusters and phylogenetic groupings consistent with established Guthrie-zone classifications (p < 0.01). Cross-lingual noun class analysis reveals that all 13 productive classes maintain >0.83 cosine similarity across languages (within-class > between-class, p < 10^-9). Our dataset is restricted to Eastern and Southern Bantu, so we interpret these results as recovering shared Bantu lexical structure consistent with Proto-Bantu rather than definitively distinguishing Proto-Bantu retentions from later regional innovations.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.