StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation

1Fudan University, 2Microsoft Research Asia, 3Hunyuan, Tencent Inc, 4University of Washington

Please hover your mouse over the video to watch the full content.

All animations are directly synthesized by StableAnimator++ without the use of any face-related post-processing tools, such as the face-swapping tool FaceFusion or face restoration models like GFP-GAN and CodeFormer. The presented skeleton poses are significantly misaligned with the reference images in terms of body size and position.

We present StableAnimator++, the first ID-preserving video diffusion framework with learnable pose alignment, capable of generating high-quality videos conditioned on a reference image and a pose sequence without any post-processing.

Human Image Animation Results

Please hover your mouse over the video to watch the full content.

Abstract

Current diffusion models for human image animation often struggle to maintain identity (ID) consistency, especially when the reference image and driving video differ significantly in body size or position. We introduce StableAnimator++, the first ID-preserving video diffusion framework with learnable pose alignment, capable of generating high-quality videos conditioned on a reference image and a pose sequence without any post-processing. Building upon a video diffusion model, StableAnimator++ contains carefully designed modules for both training and inference, striving for identity consistency. In particular, StableAnimator++ first uses learnable layers to predict the similarity transformation matrices between the reference image and the driven poses via injecting guidance from Singular Value Decomposition (SVD). These matrices align the driven poses with the reference image, mitigating misalignment to a great extent. StableAnimator++ then computes image and face embeddings using off-the-shelf encoders, refining the face embeddings via a global content-aware Face Encoder. To further maintain ID, we introduce a distribution-aware ID Adapter that counteracts interference caused by temporal layers while preserving ID via distribution alignment. During the inference stage, we propose a novel Hamilton-Jacobi-Bellman (HJB) based face optimization integrated into the denoising process, guiding the diffusion trajectory for enhanced facial fidelity. Experiments on benchmarks show the effectiveness of StableAnimator++ both qualitatively and quantitatively.

Demo Presentation

Framework

MY ALT TEXT

Architecture of StableAnimator++. (a) and (b) refer to the structure of the Face Encoder and each block in the U-Net. We first apply our learnable alignment to the driving pose sequence and feed the aligned results into the PoseNet for motion modeling. Embeddings from the Image Encoder and Face Encoder are injected into each block of U-Net. Given the reference, we extract the image embeddings and face embeddings utilizing Image Encoder and Arcface. The face embeddings are fed into the FaceEncoder to enhance ID. Then, image embeddings and refined face embeddings are injected into the U-Net through the ID Adapter to ensure ID consistency.

If you have any suggestions or find our work helpful, feel free to contact me. Email: francisshuyuan@gmail.com