TL;DR: TalkVerse is a large-scale, open corpus for single-person, audio-driven talking video generation (2.3M clips, 6.3k hours) together with a 5B-parameter diffusion transformer (DiT) baseline that sustains minute-long generation with low drift and around 10x lower inference cost compared to larger commercial models.
The corpus is curated from more than 60k hours of general video via a transparent and reproducible pipeline comprising scene-cut detection, human and subtitle screening, aesthetic and quality assessment, audio-visual synchronization, and language annotation. On top of TalkVerse, we train a 5B DiT model with a higher VAE downsampling ratio that maintains strong lip-sync and long-horizon video quality and can generalize zero-shotly to video dubbing via controlled latent noise.
We aim to release the curated dataset, training and inference code and recipes, and the 5B checkpoints to establish a common yardstick and lower compute barriers for research in audio-driven human video and related tasks such as pose-driven generation and joint audio-video synthesis.
Single-speaker English talking videos generated from our model, demonstrating minute-scale lip-sync and identity consistency under diverse audio segments.
Single-speaker Chinese talking videos illustrating cross-lingual audio-driven generation using the same unified model. As our model is under-trained and Chinese is the language with the second largest data in our dataset, we select this language for the audio-driven from other languages for demonstration. We will continue to improve the model on other languages in the future.
Audio-driven videos conditioned on singing and instrumental audio, illustrating how the model handles long-range rhythm and expressiveness beyond standard speech.
A stylized example showing that TalkVerse can be applied to non-photorealistic, cartoon-like identities while preserving coarse lip-sync and motion patterns.
Longer examples demonstrating that our model can sustain minute-long, audio-driven talking video generation with low drift and stable identity and lip synchronization.
A video dubbing example where an English source clip (the first video) is used as input and our model generates synchronized talking videos in multiple target languages while preserving the original identity and motion. The other three videos are the corresponding dubbed outputs in Chinese, Japanese, and Spanish.
We also provide additional multi-scene dubbing examples generated by the same model. The first video is the original English video, the other three videos are the corresponding dubbed outputs of other speech content with different emotions.
Clips from EMTD Test Set with varied speakers and audio conditions.