AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages
Lilian Wanzare, Cynthia Amol, zekiel Maina, Nelson Odhiambo, Hope Kerubo + 12 more
TLDR
AfriVoices-KE is a 3,000-hour multilingual speech dataset for five Kenyan languages, addressing underrepresentation in speech technology.
Key contributions
- Introduces AfriVoices-KE, a 3,000-hour multilingual speech dataset for five underrepresented Kenyan languages.
- Comprises 750 hours of scripted and 2,250 hours of spontaneous speech from 4,777 diverse native speakers.
- Utilized a dual collection methodology via a custom mobile app, ensuring high quality and linguistic diversity.
- Mitigated low-resource challenges through local partnerships, providing a foundational resource for ASR/TTS.
Why it matters
This paper directly tackles the severe lack of speech data for African languages, a major barrier to developing inclusive AI. By providing a high-quality, diverse dataset, AfriVoices-KE enables essential ASR and TTS systems, while preserving Kenya's linguistic heritage. It sets a precedent for data collection in low-resource contexts.
Original Abstract
AfriVoices-KE is a large-scale multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across diverse regions and demographics. This work addresses the critical underrepresentation of African languages in speech technology by providing a high-quality, linguistically diverse resource. Data collection followed a dual methodology: scripted recordings drew from compiled text corpora, translations, and domain-specific generated sentences spanning eleven domains relevant to the Kenyan context, while unscripted speech was elicited through textual and image prompts to capture natural linguistic variation and dialectal nuances. A customized mobile application enabled contributors to record using smartphones. Quality assurance operated at multiple layers, encompassing automated signal-to-noise ratio validation prior to recording and human review for content accuracy. Though the project encountered challenges common to low-resource settings, including unreliable infrastructure, device compatibility issues, and community trust barriers, these were mitigated through local mobilizers, stakeholder partnerships, and adaptive training protocols. AfriVoices-KE provides a foundational resource for developing inclusive automatic speech recognition and text-to-speech systems, while advancing the digital preservation of Kenya's linguistic heritage.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.