HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval

April 12, 20262604.10665

cs.CLcs.IR

TLDR

HeceTokenizer is a syllable-based tokenizer for Turkish that achieves superior retrieval performance with a tiny BERT model, leveraging phonological structure.

Key contributions

Introduces HeceTokenizer, a syllable-based tokenization approach for Turkish.
Leverages Turkish's 6-pattern phonological structure for an OOV-free, 8k syllable vocabulary.
A BERT-tiny model (1.5M params) achieves 50.3% Recall@5 on TQuAD, surpassing larger baselines.
Demonstrates that Turkish syllable phonology provides a strong, resource-light inductive bias for retrieval.

Why it matters

This paper introduces an efficient syllable-based tokenization for Turkish, achieving superior retrieval with a tiny BERT model. It demonstrates that leveraging a language's inherent phonological structure can lead to better performance with significantly fewer resources, offering a valuable, resource-light alternative.

Original Abstract

HeceTokenizer is a syllable-based tokenizer for Turkish that exploits the deterministic six-pattern phonological structure of the language to construct a closed, out-of-vocabulary (OOV)-free vocabulary of approximately 8,000 unique syllable types. A BERT-tiny encoder (1.5M parameters) is trained from scratch on a subset of Turkish Wikipedia using a masked language modeling objective and evaluated on the TQuAD retrieval benchmark using Recall@5. Combined with a fine-grained chunk-based retrieval strategy, HeceTokenizer achieves 50.3% Recall@5, surpassing the 46.92% reported by a morphology-driven baseline that uses a 200 times larger model. These results suggest that the phonological regularity of Turkish syllables provides a strong and resource-light inductive bias for retrieval tasks.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers