GMGaze: MoE-Based Context-Aware Gaze Estimation with CLIP and Multiscale Transformer
Xinyuan Zhao, Yihang Wu, Ahmad Chaddad, Sarah A. Alkhodair, Reem Kateb
TLDR
GMGaze is a MoE-based, context-aware gaze estimation model using CLIP and a multiscale transformer for improved accuracy and cross-domain performance.
Key contributions
- Uses semantic prototype conditioning with 4 banks (illumination, background, head pose, appearance) for context.
- Employs early unified fusion of CLIP global, patch, and CNN tokens to prevent information loss.
- Integrates sparse Mixture-of-Experts (MoE) for conditional computational capacity without dense parameters.
- Applies adversarial domain adaptation with feature separation for state-of-the-art cross-domain results.
Why it matters
GMGaze addresses key challenges in gaze estimation like late feature fusion, lack of context awareness, and inefficient capacity scaling. Its novel architecture achieves state-of-the-art accuracy in both within-domain and cross-domain settings, making it a significant advancement.
Original Abstract
Gaze estimation methods commonly use facial appearances to predict the direction of a person gaze. However, previous studies show three major challenges with convolutional neural network (CNN)-based, transformer-based, and contrastive language-image pre-training (CLIP)-based methods, including late fusion of image features, lack of factor-aware conditioning, and impractical capacity scaling. To address these challenges, we propose Globally-conditioned Multi-scale Gaze estimation (GMGaze), which leverages a multi-scale transformer architecture. Specifically, the model first introduces semantic prototype conditioning, which modulates the CLIP global image embedding using four learned prototype banks (i.e., illumination, background, head pose and appearance) to generate two complementary context-biased global tokens. These tokens, along with the CLIP patch and CNN tokens, are fused at the first layer. This early unified fusion prevents information loss common in late-stage merging. Finally, each token passes through sparse Mixture-of-Experts modules, providing conditional computational capacity without uniformly increasing dense parameters. For cross-domain adaptation, we incorporate an adversarial domain adaptation technique with a feature separation loss that encourages the two global tokens to remain de-correlated. Experiments using four public benchmarks (MPIIFaceGaze, EYEDIAP, Gaze360, and ETH-XGaze) show that GMGaze achieves mean angular errors of 2.49$^\circ$, 3.22$^\circ$, 10.16$^\circ$, and 1.44$^\circ$, respectively, outperforming previous baselines in all within-domain settings. In cross-domain evaluations, it provides state-of-the-art (SOTA) results on two standard transfer routes.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.