TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation

Zhenzhi Wang1, Jian Wang2, Ke Ma2, Dahua Lin1, Bing Zhou2 1CUHK MMLab, 2Snap Research

TL;DR: TalkVerse is a large-scale, open corpus for single-person, audio-driven talking video generation (2.3M clips, 6.3k hours) together with a 5B-parameter diffusion transformer (DiT) baseline that sustains minute-long generation with low drift and around 10x lower inference cost compared to larger commercial models.

TalkVerse teaser examples
TalkVerse data curation pipeline

The corpus is curated from more than 60k hours of general video via a transparent and reproducible pipeline comprising scene-cut detection, human and subtitle screening, aesthetic and quality assessment, audio-visual synchronization, and language annotation. On top of TalkVerse, we train a 5B DiT model with a higher VAE downsampling ratio that maintains strong lip-sync and long-horizon video quality and can generalize zero-shotly to video dubbing via controlled latent noise.

We aim to release the curated dataset, training and inference code and recipes, and the 5B checkpoints to establish a common yardstick and lower compute barriers for research in audio-driven human video and related tasks such as pose-driven generation and joint audio-video synthesis.

Dataset Sample Visualizations

In these examples, if the video length is longer than context frames + denoised video frames, the videos will be used to train the FramePack module in our model. If the video length is shorter than context frames + denoised video frames but longer than denoised video frames, the videos will be used to train the model. If the video length is shorter than denoised video frames, the videos will be discarded. We use orange boxes to highlight the context frames and green boxes to highlight the denoised video frames in the videos. In the denoised video frames, we also visualize body bounding boxes and face bounding boxes where we add more loss weights in our training. The visualized videos covers English, Chinese, Spanish, etc.

Model Outputs

Model Architecture

TalkVerse model architecture

English Speech

Single-speaker English talking videos generated from our model, demonstrating minute-scale lip-sync and identity consistency under diverse audio segments.

Chinese Speech

Single-speaker Chinese talking videos illustrating cross-lingual audio-driven generation using the same unified model. As our model is under-trained and Chinese is the language with the second largest data in our dataset, we select this language for the audio-driven from other languages for demonstration. We will continue to improve the model on other languages in the future.

Singing and Musical Performances

Audio-driven videos conditioned on singing and instrumental audio, illustrating how the model handles long-range rhythm and expressiveness beyond standard speech.

Cartoon / Anime Domain

A stylized example showing that TalkVerse can be applied to non-photorealistic, cartoon-like identities while preserving coarse lip-sync and motion patterns.

Long-Form Videos

Longer examples demonstrating that our model can sustain minute-long, audio-driven talking video generation with low drift and stable identity and lip synchronization.

Video Dubbing

A video dubbing example where an English source clip (the first video) is used as input and our model generates synchronized talking videos in multiple target languages while preserving the original identity and motion. The other three videos are the corresponding dubbed outputs in Chinese, Japanese, and Spanish.

We also provide additional multi-scene dubbing examples generated by the same model. The first video is the original English video, the other three videos are the corresponding dubbed outputs of other speech content with different emotions.

EMTD Test Set

Clips from EMTD Test Set with varied speakers and audio conditions.