AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars

Abstract

Whole-body audio-driven avatar generation is a crucial task for creating lifelike digital humans and enhancing interactive virtual agents, with applications in virtual reality, digital entertainment, and remote communication. Current approaches typically generate audio-driven facial expressions and gestures separately, which leads to a critical limitation: the lack of seamless coordination between facial and gestural elements, resulting in less natural animations. To address this, we propose AsynFusion, a novel framework that leverages asynchronous latent consistency models to harmonize expression and gesture synthesis. Our method is built upon a dual-branch DiT architecture that enables parallel generation of expressions and gestures, with cross-attention mechanisms facilitating bidirectional feature interaction. Additionally, we introduce an asynchronous sampling strategy based on Latent Consistency Models (LCM) to significantly reduce computational overhead while maintaining high-quality outputs. Extensive experiments demonstrate that AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations, outperforming existing methods in both quantitative and qualitative metrics.

Video Comparison

Emage

DiffSHEG

AsynFusion

EMAGE

DiffSHEG

AsynFusion

EMAGE

DifffSHEG

AsynFusion

BibTeX

BibTex Code Here