BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources
Raghvendra Kumar, Devankar Raj, Sriparna Saha
TLDR
BhashaSutra is the first unified survey of over 200 Indian NLP datasets, 50 benchmarks, and 100 models, covering diverse languages and tasks.
Key contributions
- First unified survey of 200+ Indian NLP datasets, 50+ benchmarks, and 100+ models.
- Covers text, speech, multimodal, and culturally grounded tasks across 22 scheduled languages.
- Organizes resources by linguistic phenomena, domains, and modalities for easy access.
- Identifies key challenges like data sparsity, uneven language coverage, and script diversity.
Why it matters
This survey addresses a critical gap by consolidating scattered Indian NLP resources, which were previously overlooked or partially covered. It provides a foundational resource for developing equitable, culturally grounded, and scalable NLP solutions for India's diverse linguistic landscape.
Original Abstract
India's linguistic landscape, spanning 22 scheduled languages and hundreds of marginalized dialects, has driven rapid growth in NLP datasets, benchmarks, and pretrained models. However, no dedicated survey consolidates resources developed specifically for Indian languages. Existing reviews either focus on a few high-resource languages or subsume Indian languages within broader multilingual settings, limiting coverage of low-resource and culturally diverse varieties. To address this gap, we present the first unified survey of Indian NLP resources, covering 200+ datasets, 50+ benchmarks, and 100+ models, tools, and systems across text, speech, multimodal, and culturally grounded tasks. We organize resources by linguistic phenomena, domains, and modalities; analyze trends in annotation, evaluation, and model design; and identify persistent challenges such as data sparsity, uneven language coverage, script diversity, and limited cultural and domain generalization. This survey offers a consolidated foundation for equitable, culturally grounded, and scalable NLP research in the Indian linguistic ecosystem.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.