StyleMotif: Multi-Modal Motion Stylization
using Style-Content Cross Fusion

Ziyu Guo¹, Young Yoon Lee², Joseph Liu², Yizhak Ben-Shabat²,
Victor Zordan², Mubbasir Kapadia²

¹The Chinese University of Hong Kong, ²Roblox

🤗

Abstract

We present StyleMotif, a novel Stylized Motion Latent Diffusion model, generating motion conditioned on both content and style from multiple modalities. Unlike existing approaches that either focus on generating diverse motion content or transferring style from sequences, StyleMotif seamlessly synthesizes motion across a wide range of content while incorporating stylistic cues from multi-modal inputs, including motion, text, image, video, and audio. To achieve this, we introduce a style-content cross fusion mechanism and align a style encoder with a pre-trained multi-modal model, ensuring that the generated motion accurately captures the reference style while preserving realism. Extensive experiments demonstrate that our framework surpasses existing methods in stylized motion generation and exhibits emergent capabilities for multi-modal motion stylization, enabling more nuanced motion synthesis. Source code and pre-trained models will be released upon acceptance.

Method

Multi-modal Motion Stylization

With the power of the aligned multi-modal space, our model supports stylization guided by a variety of modalities, including motion, text, image, video, and audio. Our model generates a stylized motion that incorporates the style from the input modality while maintaining the content integrity as specified by the text prompt.

Input Content Text: "A person is walking."

Input Style
[ Audio of Rooster Crow ]

Input Style
[ Image of Duck ]

Input Style
[ Video of Dinosaur ]

Input Style
[ Video of Rooster ]

Output Stylized Motion
from StyleMotif

Motion-guided Stylization

We show some qualitative results of motion-guided stylization from our model and baseline, SMooDi. Our model produces more cohesive stylized motions, with better alignment of style and content.

Input Content Text: "A person walks in a circle."

Input Style

SMooDi ❌ (not circular trajectory)

StyleMotif (Ours) 🟢

Input Style

SMooDi ❌ (not flapping)

StyleMotif (Ours) 🟢

Input Style

SMooDi ❌ (not circular trajectory & foot skating)

StyleMotif (Ours) 🟢

Text-guided Stylization

We showcase qualitative results for text-guided stylization, where our model also demonstrates strong capability in harmonizing style and content, generating high-quality and visually coherent results.

Input Context Text: "A person walks forward then sits down."

Input Style Text: "Aeroplane"

Input Style Text: "Chicken"

Input Style Text: "Dinosaur"

Output Stylized Motion
from StyleMotif

Style Interpolation

Leveraging the aligned multi-modal space, our model enables text-guided style interpolation. Given one content text along with at least two style style texts, our model generates a motion that combines the characteristics of all input styles.

StyleMotif: Multi-Modal Motion Stylization
using Style-Content Cross Fusion

Abstract

Method

Multi-modal Motion Stylization

Input Content Text: "A person is walking."

Input Style
[ Audio of Rooster Crow ]

Input Style
[ Image of Duck ]

Input Style
[ Video of Dinosaur ]

Input Style
[ Video of Rooster ]

Output Stylized Motion
from StyleMotif

Output Stylized Motion
from StyleMotif

Output Stylized Motion
from StyleMotif

Output Stylized Motion
from StyleMotif

Motion-guided Stylization

Input Content Text: "A person walks in a circle."

Input Style

SMooDi ❌ (not circular trajectory)

StyleMotif (Ours) 🟢

Input Style

SMooDi ❌ (not flapping)

StyleMotif (Ours) 🟢

Input Style

SMooDi ❌ (not circular trajectory & foot skating)

StyleMotif (Ours) 🟢

Text-guided Stylization

Input Context Text: "A person walks forward then sits down."

Input Style Text: "Aeroplane"

Input Style Text: "Chicken"

Input Style Text: "Dinosaur"

Output Stylized Motion
from StyleMotif

Output Stylized Motion
from StyleMotif

Output Stylized Motion
from StyleMotif

Style Interpolation

Input Content Text: "A person is walking."

Input Styles: "Elated" + "Balance"

Output Stylized Motion
from StyleMotif

Input Styles: "Rushed" + "OnPhoneLeft"

Output Stylized Motion
from StyleMotif

Input Styles: "Drunk" + "Balance"

Output Stylized Motion
from StyleMotif

StyleMotif: Multi-Modal Motion Stylizationusing Style-Content Cross Fusion

Abstract

Method

Multi-modal Motion Stylization

Input Content Text: "A person is walking."

Input Style[ Audio of Rooster Crow ]

Input Style[ Image of Duck ]

Input Style[ Video of Dinosaur ]

Input Style[ Video of Rooster ]

Output Stylized Motionfrom StyleMotif

Output Stylized Motionfrom StyleMotif

Output Stylized Motionfrom StyleMotif

Output Stylized Motionfrom StyleMotif

Motion-guided Stylization

Input Content Text: "A person walks in a circle."

Input Style

SMooDi ❌ (not circular trajectory)

StyleMotif (Ours) 🟢

Input Style

SMooDi ❌ (not flapping)

StyleMotif (Ours) 🟢

Input Style

SMooDi ❌ (not circular trajectory & foot skating)

StyleMotif (Ours) 🟢

Text-guided Stylization

Input Context Text: "A person walks forward then sits down."

Input Style Text: "Aeroplane"

Input Style Text: "Chicken"

Input Style Text: "Dinosaur"

Output Stylized Motionfrom StyleMotif

Output Stylized Motionfrom StyleMotif

Output Stylized Motionfrom StyleMotif

Style Interpolation

Input Content Text: "A person is walking."

Input Styles: "Elated" + "Balance"

Output Stylized Motionfrom StyleMotif

Input Styles: "Rushed" + "OnPhoneLeft"

Output Stylized Motionfrom StyleMotif

Input Styles: "Drunk" + "Balance"

Output Stylized Motionfrom StyleMotif

StyleMotif: Multi-Modal Motion Stylization
using Style-Content Cross Fusion

Input Style
[ Audio of Rooster Crow ]

Input Style
[ Image of Duck ]

Input Style
[ Video of Dinosaur ]

Input Style
[ Video of Rooster ]

Output Stylized Motion
from StyleMotif

Output Stylized Motion
from StyleMotif

Output Stylized Motion
from StyleMotif

Output Stylized Motion
from StyleMotif

Output Stylized Motion
from StyleMotif

Output Stylized Motion
from StyleMotif

Output Stylized Motion
from StyleMotif

Output Stylized Motion
from StyleMotif

Output Stylized Motion
from StyleMotif

Output Stylized Motion
from StyleMotif