Optimizing Korean-Centric LLMs via Token Pruning

April 17, 20262604.16235

cs.CL

TLDR

This paper benchmarks token pruning for optimizing multilingual LLMs for Korean-centric tasks, showing improved stability and performance while reducing vocabulary size.

Key contributions

Benchmarks token pruning on state-of-the-art multilingual LLMs (Qwen3, Gemma-3, Llama-3, Aya) for Korean NLP.
Shows token pruning significantly improves generation stability by reducing language confusion.
Enhances performance on Korean-specific machine translation tasks.
Validates token pruning as an effective optimization for memory-constrained, domain-specific LLM deployments.

Why it matters

This paper offers a practical method, token pruning, to optimize large multilingual LLMs for specific languages like Korean. It addresses challenges in deploying powerful models in resource-constrained environments, making them more efficient and stable for targeted applications.

Original Abstract

This paper presents a systematic benchmark of state-of-the-art multilingual large language models (LLMs) adapted via token pruning - a compression technique that eliminates tokens and embedding parameters corresponding to languages irrelevant to the target application. Focusing on Korean-centric natural language processing (NLP) tasks, we evaluate architectures including Qwen3, Gemma-3, Llama-3, and Aya across three vocabulary configurations: Original, English-Korean (EnKo), and English-Korean-Chinese (EnKoZh). Performance is assessed using established benchmarks for general aptitude, cultural literacy, instruction following, and machine translation. Our findings indicate that token pruning significantly improves generation stability by eliminating language confusion, and in the case of machine translation, frequently enhances performance on Korean-specific tasks. While instruction-following capabilities display architecture-dependent variance linked to latent cross-lingual representations, the significant reduction in vocabulary size validates token pruning as a highly effective optimization strategy for memory-constrained, domain-specific deployments, despite modest gains in inference latency.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers