InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

Zhenzhi Wang*1, Jiaqi Yang*2, Jianwen Jiang*2, Chao Liang2, Gaojie Lin2, Zerong Zheng2, Ceyuan Yang2, Dahua Lin1
1CUHK MMLab, 2ByteDance *Equal contribution,Project lead TL;DR: InterActHuman is a novel diffusion transformer (DiT) based framework for multi-concept audio-driven human video generation that overcomes the traditional single-entity limitation by localizing and aligning multi-modal inputs for each distinct subject. Instead of fusing all conditions globally, the method uses an iterative, in-network mask predictor to infer fine-grained, spatio-temporal layouts for each identity, enabling the precise injection of local cues—such as audio for accurate lip synchronization—into their specific regions during the video synthesis process. Built on a DiT backbone and leveraging a multi-step denoising process, the framework dynamically refines its masks, so that predictions from one step guide local condition injection in the next, ensuring that audio signals are correctly associated with each individual (represented by a reference image). Extensive evaluations demonstrate state-of-the-art performance in key areas like lip-sync accuracy, subject consistency, and overall video quality, while a scalable data pipeline with over 2.6 million annotated video-entity pairs supports its training. Our method supports applications covering audio-driven multi-person video generation and multi-concept video customization such as human-object interaction.

* Note that to generate all results on this page, only text prompts, and N paired {reference image, audio segment} are required. The lip-sync is natively supported by DiT layers via audio cross-attention, and no post-processing is needed. In our paper, concept means an appearance represented by a reference image, which could be human, animal, background or object. Identity means a specific person where audio condition is applied, which is a subset of concept.

*Disclaimer: This project is intended for research purposes, and the above demo is solely for demonstrating the model's functionality. Should you consider that the above materials involve infringement, please contact us for deletion.

*Thank you to the Phantom team for providing the evaluation data and tools.

Dialogue Videos

InterActHuman supports multi-person dialogue video generation, where each person is represented by a reference image (e.g., head image). Each specific audio segment condition could be applied to a specific person according to user's wish. It is also possible to assign multiple audio segments to the same person in different frames in a video. For example, we have 5 audio segments with different durations, and we have two speakers. The audio assignment could be 1-2-1-2-1, where 1 means speaker 1, and 2 means speaker 2. Some cases' appearance or audio are cropped or trimmed from publicly available videos generated by Veo3.

Human-Object Interaction Videos with Audio

InterActHuman supports single-person talking video generation with human-object interaction, where each person and object is represented by a reference image. It is worth noting that the affordance of object is implicitly learned from data and controlled by user's text prompt. The audio condition is only applied to the person in our cases, yet audio could also be applied to human-like objects such as a cookie or a cup. It is also possible to apply HOI into multi-person talking video generation, where we don't show results here due to complex input condition preparation.

The input reference is one object image and one portrait image.

Domain Diversity

In terms of diversity of input domain, InterActHuman supports cartoons, artificial objects, and animals. Videos provided here are generated without audio condition, which could be regarded as showcases from the multi-concept video generation part of InterActHuman.