ArXiv TLDR

Micro Language Models Enable Instant Responses

🐦 Tweet
2604.19642

Wen Cheng, Tuochao Chen, Karim Helwani, Sriram Srinivasan, Luke Zettlemoyer + 1 more

cs.CL

TLDR

Micro LMs (8M-30M params) enable instant, contextually grounded responses on edge devices by initiating replies while cloud models complete them.

Key contributions

  • Introduces micro LMs (8M-30M params) for instant on-device response initiation on edge devices.
  • Proposes a collaborative framework where $μ$LMs start responses and cloud models complete them.
  • Achieves seamless mid-sentence handoffs and structured error recovery for robust operation.
  • Demonstrates $μ$LMs match larger models (70M-256M class) despite extreme parameter reduction.

Why it matters

This paper addresses the critical latency issue for AI assistants on extremely resource-constrained edge devices. By enabling instant local responses, it overcomes power and compute limitations, making responsive AI possible for smartwatches and glasses. This hybrid approach significantly enhances user experience.

Original Abstract

Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute constraints, yet cloud inference introduces multi-second latencies that break the illusion of a responsive assistant. We introduce micro language models ($μ$LMs): ultra-compact models (8M-30M parameters) that instantly generate the first 4-8 words of a contextually grounded response on-device, while a cloud model completes it; thus, masking the cloud latency. We show that useful language generation survives at this extreme scale with our models matching several 70M-256M-class existing models. We design a collaborative generation framework that reframes the cloud model as a continuator rather than a respondent, achieving seamless mid-sentence handoffs and structured graceful recovery via three error correction methods when the local opener goes wrong. Empirical results show that $μ$LMs can initiate responses that larger models complete seamlessly, demonstrating that orders-of-magnitude asymmetric collaboration is achievable and unlocking responsive AI for extremely resource-constrained devices. The model checkpoint and demo are available at https://github.com/Sensente/micro_language_model_swen_project.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.