SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection

April 20, 20262604.18476

Hao Vo, Khoa Vo, Thinh Phan, Ngo Xuan Cuong, Gianfranco Doretto + 3 more

cs.CV

TLDR

SemLT3D improves camera-only 3D object detection for rare categories by using semantic-guided expert distillation and CLIP-informed features.

Key contributions

Tackles long-tail imbalance in camera-only 3D object detection for rare, safety-critical categories.
Employs a language-guided mixture-of-experts for semantic routing and tail distribution specialization.
Utilizes semantic projection distillation to align 3D queries with CLIP-informed 2D features.
Improves robustness against diverse appearance variations and challenging corner cases.

Why it matters

Existing camera-only 3D detection struggles with rare, safety-critical objects due to long-tail imbalance. SemLT3D provides a principled solution by enriching representations for these underrepresented classes, significantly improving reliability for autonomous driving.

Original Abstract

Camera-only 3D object detection has emerged as a cost-effective and scalable alternative to LiDAR for autonomous driving, yet existing methods primarily prioritize overall performance while overlooking the severe long-tail imbalance inherent in real-world datasets. In practice, many rare but safety-critical categories such as children, strollers, or emergency vehicles are heavily underrepresented, leading to biased learning and degraded performance. This challenge is further exacerbated by pronounced inter-class ambiguity (e.g., visually similar subclasses) and substantial intra-class diversity (e.g., objects varying widely in appearance, scale, pose, or context), which together hinder reliable long-tail recognition. In this work, we introduce SemLT3D, a Semantic-Guided Expert Distillation framework designed to enrich the representation space for underrepresented classes through semantic priors. SemLT3D consists of: (1) a language-guided mixture-of-experts module that routes 3D queries to specialized experts according to their semantic affinity, enabling the model to better disentangle confusing classes and specialize on tail distributions; and (2) a semantic projection distillation pipeline that aligns 3D queries with CLIP-informed 2D semantics, producing more coherent and discriminative features across diverse visual manifestations. Although motivated by long-tail imbalance, the semantically structured learning in SemLT3D also improves robustness under broader appearance variations and challenging corner cases, offering a principled step toward more reliable camera-only 3D perception.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers