Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements
Genki Kinoshita, Shu Nakamura, Ryo Kawahara, Shohei Nobuhara, Yasutomo Kawanishi + 1 more
TLDR
This paper introduces A4Mer, a self-supervised hierarchical model that learns Action Atoms and Motifs for robust human movement representation.
Key contributions
- Proposes Action Motifs, a hierarchical representation of human movement using Action Atoms and their temporal compositions.
- Introduces A4Mer, a self-supervised nested latent Transformer to learn these hierarchical representations.
- Develops a unified masked token prediction pretext task for learning Action Atoms and Motifs.
- Presents Action Motif Dataset (AMD), a large-scale multi-view dataset with novel foot-mounted camera annotations.
Why it matters
This paper tackles compositional human movement modeling, crucial for understanding complex behaviors. Learning reusable Action Motifs significantly boosts performance in tasks like action recognition and motion prediction. The new AMD dataset also provides valuable resources for future research.
Original Abstract
Effective human behavior modeling requires a representation of the human body movement that capitalizes on its compositionality. We propose a hierarchical representation consisting of Action Atoms that capture the atomic joint movements and Action Motifs that are formed by their temporal compositions and encode similar body movements found across different overall human actions. We derive A4Mer, a nested latent Transformer to learn this hierarchical representation from human pose data in a fully self-supervised manner. A4Mer splits a 3D pose sequence into variable-length segments and represents each segment as a single latent token (Action Atoms). Through bottom-up representation learning, temporal patterns composed of these Action Atoms, which capture meaningful temporal spans of reusable, semantic segments of body movements, naturally emerge (Action Motifs). A4Mer achieves this with a unified pretext task of masked token prediction in their respective latent spaces. We also introduce Action Motif Dataset (AMD), a large-scale dataset of multi-view human behavior videos with full SMPL annotations. We introduce a novel use of cameras by mounting them on the feet to achieve their frame-wise annotations despite frequent and heavy body occlusions. Experimental results demonstrate the effectiveness of A4Mer for extracting meaningful Action Motifs, which significantly benefit human behavior modeling tasks including action recognition, motion prediction, and motion interpolation.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.