REALM: An RGB and Event Aligned Latent Manifold for Cross-Modal Perception

April 30, 20262605.00271

Vincenzo Polizzi, David B. Lindell, Jonathan Kelly

cs.CVcs.AIcs.RO

TLDR

REALM aligns event camera data with RGB foundation models using LoRA, enabling zero-shot transfer for various vision tasks and SOTA performance.

Key contributions

REALM projects event representations into pretrained RGB foundation model latent spaces.
Uses LoRA to bridge the modality gap, unlocking RGB geometric and semantic priors for event streams.
Enables zero-shot application of complex, frozen image-trained decoders to raw event data.
Achieves state-of-the-art performance in wide-baseline feature matching.

Why it matters

Event cameras offer unique advantages, but their processing has been siloed. REALM addresses this by creating a unified latent space, allowing event data to leverage powerful, pre-trained RGB models. This significantly expands the utility of event cameras for diverse vision tasks.

Original Abstract

Event cameras provide several unique advantages over standard frame-based sensors, including high temporal resolution, low latency, and robustness to extreme lighting. However, existing learning-based approaches for event processing are typically confined to narrow, task-specific silos and lack the ability to generalize across modalities. We address this gap with REALM, a cross-modal framework that learns an RGB and Event Aligned Latent Manifold by projecting event representations into the pretrained latent space of RGB foundation models. Instead of task-specific training, we leverage low-rank adaptation (LoRA) to bridge the modality gap, effectively unlocking the geometric and semantic priors of frozen RGB backbones for asynchronous event streams. We demonstrate that REALM effectively maps events into the ViT-based foundation latent space. Our method allows us to perform downstream tasks like depth estimation and semantic segmentation by simply transferring linear heads trained on the RGB teacher. Most significantly, REALM enables the direct, zero-shot application of complex, frozen image-trained decoders, such as MASt3R, to raw event data. We demonstrate state-of-the-art performance in wide-baseline feature matching, significantly outperforming specialized architectures. Code and models are available upon acceptance.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers