StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation

Fudan University1 Microsoft Research Asia2 Xi'an Jiaotong University3 Hunyuan, Tencent Inc4

Audio-driven Avatar Video Results (Please turn on the sound)

All animations are directly synthesized by StableAvatar without the use of any post-processing tools, such as the face-swapping tool FaceFusion or face restoration models like GFP-GAN and CodeFormer. We present StableAvatar, the first end-to-end video diffusion transformer, which synthesizes infinite-length high-quality audio-driven avatar videos without any post-processing, conditioned on a reference image and audio.

Additional Audio-driven Avatar Video Results (Please turn on the sound)

Comparisons with SOTA methods (Please turn on the sound)

Video Presentation

Abstract

Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually. To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion’s own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.

MY ALT TEXT

Architecture of StableAvatar. (a) refers to the structure of the Audio Adapter. Embeddings from the Image Encoder and Text Encoder are injected to each block of DiT. Given the audio, we extract the audio embeddings utilizing Wav2Vec. To model joint audio-latent representations, the audio embeddings are fed into the Audio Adapter, and its outputs are injected into the DiT via cross-attention.

If you have any suggestions or find our work helpful, feel free to contact me. Email: francisshuyuan@gmail.com