WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data
Ziheng Zhang, Yunzhong Hou, Naijing Liu, Liang Zheng
TLDR
WARDEN is a novel two-stage system that transcribes and translates endangered Wardaman to English using only 6 hours of audio, outperforming larger models.
Key contributions
- Introduces WARDEN, a two-stage system for endangered Wardaman language transcription and translation.
- Tackles extreme low-resource settings (6 hours of audio) by separating transcription and translation models.
- Enhances transcription by initializing Wardaman tokens from Sundanese for faster fine-tuning.
- Improves translation by providing an expert-compiled Wardaman-English dictionary to an LLM.
Why it matters
This paper offers a crucial approach for preserving endangered languages, demonstrating high performance with minimal data. It sets a new baseline for low-resource language processing, enabling broader linguistic documentation efforts.
Original Abstract
This paper introduces WARDEN, an early language model system capable of transcribing and translating Wardaman, an endangered Australian indigenous language into English. The significant challenge we face is the lack of large-scale training data: in fact, we only have 6 hours of annotated audio. Therefore, while it is common practice to train a single model for transcription and translation using large datasets (like English to French), this practice is no longer viable in the Wardaman to English context. To tackle the low-resource challenge, we design WARDEN to have separate transcription and translation models: WARDEN first turns a Wardaman audio input into phonemic transcription, and then the transcription into English translation. Further, we propose two useful techniques to enhance performance. For transcription, we initialize the Wardaman token from Sundanese, a language that shares similar phonemes with Wardaman, to accelerate fine-tuning of the transcription model. For translation, we compile a Wardaman-English dictionary from expert annotations, and provide this domain-specific knowledge to a large language model (LLM) to reason and decide the final output. We empirically demonstrate that this two-stage design works better than data-hungry unified approaches in extremely low data settings. Using a mere 6 hours of annotated data, WARDEN outperforms larger open-source and proprietary models and establishes a strong baseline. Data and code are available.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.