ArXiv TLDR

BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources

🐦 Tweet
2604.18423

Raghvendra Kumar, Devankar Raj, Sriparna Saha

cs.CL

TLDR

BhashaSutra is the first unified survey of over 200 Indian NLP datasets, 50 benchmarks, and 100 models, covering diverse languages and tasks.

Key contributions

  • First unified survey of 200+ Indian NLP datasets, 50+ benchmarks, and 100+ models.
  • Covers text, speech, multimodal, and culturally grounded tasks across 22 scheduled languages.
  • Organizes resources by linguistic phenomena, domains, and modalities for easy access.
  • Identifies key challenges like data sparsity, uneven language coverage, and script diversity.

Why it matters

This survey addresses a critical gap by consolidating scattered Indian NLP resources, which were previously overlooked or partially covered. It provides a foundational resource for developing equitable, culturally grounded, and scalable NLP solutions for India's diverse linguistic landscape.

Original Abstract

India's linguistic landscape, spanning 22 scheduled languages and hundreds of marginalized dialects, has driven rapid growth in NLP datasets, benchmarks, and pretrained models. However, no dedicated survey consolidates resources developed specifically for Indian languages. Existing reviews either focus on a few high-resource languages or subsume Indian languages within broader multilingual settings, limiting coverage of low-resource and culturally diverse varieties. To address this gap, we present the first unified survey of Indian NLP resources, covering 200+ datasets, 50+ benchmarks, and 100+ models, tools, and systems across text, speech, multimodal, and culturally grounded tasks. We organize resources by linguistic phenomena, domains, and modalities; analyze trends in annotation, evaluation, and model design; and identify persistent challenges such as data sparsity, uneven language coverage, script diversity, and limited cultural and domain generalization. This survey offers a consolidated foundation for equitable, culturally grounded, and scalable NLP research in the Indian linguistic ecosystem.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.