A Vision-Language-Action Model for Adaptive Ultrasound-Guided Needle Insertion and Needle Tracking

April 22, 20262604.20347

Yuelin Zhang, Qingpeng Ding, Longxiang Tang, Chengyu Fang, Shing Shin Cheng

cs.ROcs.AI

TLDR

A VLA model unifies needle tracking and insertion for adaptive, automated ultrasound-guided procedures, outperforming current methods.

Key contributions

Proposes a Vision-Language-Action (VLA) model for adaptive, automated US-guided needle insertion and tracking.
Introduces a Cross-Depth Fusion (CDF) tracking head for real-time, end-to-end needle tracking.
Uses a Tracking-Conditioning (TraCon) register for efficient adaptation of vision backbones for tracking.
Employs an uncertainty-aware control policy for adaptive and safe needle insertion control.

Why it matters

This paper introduces a unified VLA model that significantly improves the safety and efficiency of ultrasound-guided needle insertions. By enabling real-time adaptive control, it addresses critical challenges in dynamic medical procedures. This advancement could lead to better patient outcomes and reduced procedure times in robotic surgery.

Original Abstract

Ultrasound (US)-guided needle insertion is a critical yet challenging procedure due to dynamic imaging conditions and difficulties in needle visualization. Many methods have been proposed for automated needle insertion, but they often rely on hand-crafted pipelines with modular controllers, whose performance degrades in challenging cases. In this paper, a Vision-Language-Action (VLA) model is proposed for adaptive and automated US-guided needle insertion and tracking on a robotic ultrasound (RUS) system. This framework provides a unified approach to needle tracking and needle insertion control, enabling real-time, dynamically adaptive adjustment of insertion based on the obtained needle position and environment awareness. To achieve real-time and end-to-end tracking, a Cross-Depth Fusion (CDF) tracking head is proposed, integrating shallow positional and deep semantic features from the large-scale vision backbone. To adapt the pretrained vision backbone for tracking tasks, a Tracking-Conditioning (TraCon) register is introduced for parameter-efficient feature conditioning. After needle tracking, an uncertainty-aware control policy and an asynchronous VLA pipeline are presented for adaptive needle insertion control, ensuring timely decision-making for improved safety and outcomes. Extensive experiments on both needle tracking and insertion show that our method consistently outperforms state-of-the-art trackers and manual operation, achieving higher tracking accuracy, improved insertion success rates, and reduced procedure time, highlighting promising directions for RUS-based intelligent intervention.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers