MotionBind: Multi-Modal Human Motion Alignment for Retrieval, Recognition, and Generation

University of Pennsylvania

Abstract

Recent advances in multi-modal representation learning have led to unified embedding spaces that align modalities such as images, text, audio, and vision. However, human motion sequences, a modality that is fundamental for understanding dynamic human activities, remains largely unrepresented in these frameworks. Semantic understanding of actions requires multi-modal grounding: text conveys descriptive semantics, vision provides visual context, and audio provides environmental cues.

To bridge this gap, we propose MotionBind, a novel architecture that extends the LanguageBind embedding space to incorporate human motion. MotionBind has two major components. The first one is a Multi-Scale Temporal Motion Transformer (MuTMoT) that maps motion sequences to semantically meaningful embeddings. Multimodal alignment is achieved via diverse cross-modal supervision, including motion-text pairs from HumanML3D and KIT-ML, motion-video pairs rendered from AMASS, and motion-video-audio triplets from AIST++. The second component is a Retrieval-Augmented Latent diffusion Model (REALM) that can generate motion sequences conditioned on many modalities.

MotionBind achieves state-of-the-art or competitive performance across motion reconstruction, cross-modal retrieval, zero-shot action recognition, and text-to-motion generation benchmarks.

Method Overview

MotionBind consists of two main components: MuTMoT (Multi-Scale Temporal Motion Transformer) for multi-modal representation learning and REALM (Retrieval-Augmented Latent diffusion Model) for motion generation.

MuTMoT Architecture

MuTMoT Architecture

MuTMoT is a transformer-based hierarchical encoder-decoder architecture that encodes motion sequences into compact embeddings aligned with a shared multi-modal space. It captures motion dynamics at multiple temporal resolutions, producing representations that reflect both fine-grained pose transitions and high-level action semantics. The architecture extends LanguageBind to include human motion, enabling joint cross-modal reasoning and retrieval across text, vision, audio, and motion.

REALM Architecture

REALM Architecture

REALM is a retrieval-augmented latent diffusion model that generates motion sequences conditioned on any modality (text, video, or audio). It operates in the compact latent space defined by the MuTMoT encoder and uses temporal conditioning with learnable frame tokens that dynamically attend to conditioning context throughout the denoising process. By retrieving semantically similar motions from a large database, REALM enhances realism and controllability during generation.

BibTeX

@inproceedings{
          kinfu2025motionbind,
          title={MotionBind: Multi-Modal Human Motion Alignment for Retrieval, Recognition, and Generation},
          author={Kaleab A Kinfu and Rene Vidal},
          booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
          year={2025},
          url={https://openreview.net/forum?id=sUjwDdyspc}
      }