ArXiv TLDR

UIPress: Bringing Optical Token Compression to UI-to-Code Generation

🐦 Tweet
2604.09442

Dasen Dai, Shuoqi Li, Ronghao Chen, Huacan Wang, Biao Wu + 1 more

cs.CL

TLDR

UIPress introduces the first encoder-side learned optical compression for UI-to-Code generation, significantly boosting speed and performance.

Key contributions

  • UIPress is a lightweight learned module for UI-to-Code VLMs, inserted between ViT encoder and LLM decoder.
  • Compresses ~6,700 visual tokens to a fixed budget of 256 using convolutions and Transformer refinement.
  • Achieves 9.1x speedup and +7.5% CLIP score improvement over uncompressed baselines on Design2Code.
  • First encoder-side learned compression method for the UI-to-Code task.

Why it matters

UI-to-Code generation is bottlenecked by thousands of visual tokens, leading to high latency. Existing compression methods are often inefficient or task-agnostic. UIPress introduces a novel, effective optical compression paradigm, dramatically improving speed and accuracy. This makes UI-to-Code more practical and efficient for real-world applications.

Original Abstract

UI-to-Code generation requires vision-language models (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visual token efficiency critical. Existing compression methods either select tokens at inference time using task-agnostic heuristics, or zero out low-attention features without actually shortening the sequence -- neither truly reduces prefill latency or adapts to the non-uniform information density of UI screenshots. Meanwhile, optical (encoder-side learned) compression has shown strong results for document OCR, yet no prior work has adapted this paradigm to UI-to-Code generation. We propose UIPress, a lightweight learned compression module inserted between the frozen ViT encoder and the LLM decoder of Qwen3-VL-8B. UIPress combines depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement to compress ${\sim}$6{,}700 visual tokens to a fixed budget of 256. Together with Low-Rank Adaptation (LoRA) on the decoder to bridge the representation gap, the entire system adds only ${\sim}$21.7M trainable parameters (0.26\% of the 8B base model). Under a fair comparison on the same base model against four baselines on Design2Code, UIPress at 256 tokens achieves a CLIP score of 0.8127, outperforming the uncompressed baseline by +7.5\% and the strongest inference-time method by +4.6\%, while delivering 9.1$\times$ time-to-first-token speedup. To the best of our knowledge, UIPress is the first encoder-side learned compression method for the UI-to-Code task.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.