Fast Byte Latent Transformer
Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz, Gargi Ghosh, Luke Zettlemoyer + 3 more
TLDR
The Fast Byte Latent Transformer (BLT) introduces novel training and generation techniques to significantly speed up byte-level language models.
Key contributions
- BLT Diffusion (BLT-D) enables parallel byte generation using a block-wise diffusion objective for faster inference.
- BLT Self-speculation (BLT-S) drafts bytes locally and verifies them with a single full-model pass for quality.
- BLT Diffusion+Verification (BLT-DV) combines diffusion-based generation with an autoregressive verification step.
- Achieves over 50% lower estimated memory-bandwidth cost compared to original BLT on generation tasks.
Why it matters
This paper addresses the key bottleneck of slow autoregressive generation in byte-level LMs. By introducing faster variants, it makes byte-level models more practical and efficient, removing barriers to their widespread adoption.
Original Abstract
Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.