Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback
Edoardo Bianchi, Antonio Liotta
TLDR
This paper presents parameter-efficient multi-view methods (SkillFormer, PATS, ProfVLM) for estimating human action proficiency and generating expert feedback.
Key contributions
- SkillFormer: Parameter-efficient discriminative architecture for selective multi-view fusion.
- PATS: Improves temporal sampling by preserving locally dense excerpts of fundamental movements.
- ProfVLM: Generates proficiency labels and expert feedback via conditional language generation.
- Achieves SOTA on Ego-Exo4D with up to 20x fewer parameters and 3x fewer training epochs.
Why it matters
This paper addresses the critical task of estimating human action proficiency, vital for coaching and rehabilitation. It introduces efficient, multi-view systems that provide interpretable, generative feedback, moving beyond simple classification. This makes proficiency estimation more practical and actionable.
Original Abstract
Estimating how well a person performs an action, rather than which action is performed, is central to coaching, rehabilitation, and talent identification. This task is challenging because proficiency is encoded in subtle differences in timing, balance, body mechanics, and execution, often distributed across multiple views and short temporal events. We discuss three recent contributions to multi-view proficiency estimation on Ego-Exo4D. SkillFormer introduces a parameter-efficient discriminative architecture for selective multi-view fusion; PATS improves temporal sampling by preserving locally dense excerpts of fundamental movements; and ProfVLM reformulates proficiency estimation as conditional language generation, producing both a proficiency label and expert-style feedback through a gated cross-view projector and a compact language backbone. Together, these methods achieve state-of-the-art accuracy on Ego-Exo4D with up to 20x fewer trainable parameters and up to 3x fewer training epochs than video-transformer baselines, while moving from closed-set classification toward interpretable feedback generation. These results highlight a shift toward efficient, multi-view systems that combine selective fusion, proficiency-aware sampling, and actionable generative feedback.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.