* Note that to generate all results on this page, only text prompts, and N paired {reference image, audio segment} are required. The lip-sync is natively supported by DiT layers via audio cross-attention, and no post-processing is needed. In our paper, concept means an appearance represented by a reference image, which could be human, animal, background or object. Identity means a specific person where audio condition is applied, which is a subset of concept.
*Disclaimer: This project is intended for research purposes, and the above demo is solely for demonstrating the model's functionality. Should you consider that the above materials involve infringement, please contact us for deletion.
*Thank you to the Phantom team for providing the evaluation data and tools.
InterActHuman supports multi-person dialogue video generation, where each person is represented by a reference image (e.g., head image). Each specific audio segment condition could be applied to a specific person according to user's wish. It is also possible to assign multiple audio segments to the same person in different frames in a video. For example, we have 5 audio segments with different durations, and we have two speakers. The audio assignment could be 1-2-1-2-1, where 1 means speaker 1, and 2 means speaker 2. Some cases' appearance or audio are cropped or trimmed from publicly available videos generated by Veo3.
InterActHuman supports single-person talking video generation with human-object interaction, where each person and object is represented by a reference image. It is worth noting that the affordance of object is implicitly learned from data and controlled by user's text prompt. The audio condition is only applied to the person in our cases, yet audio could also be applied to human-like objects such as a cookie or a cup. It is also possible to apply HOI into multi-person talking video generation, where we don't show results here due to complex input condition preparation.
The input reference is one object image and one portrait image.
In terms of diversity of input domain, InterActHuman supports cartoons, artificial objects, and animals. Videos provided here are generated without audio condition, which could be regarded as showcases from the multi-concept video generation part of InterActHuman.