Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation

1Fudan University   2Hunyuan Foundation Model Team, Tencent Inc

Abstract

Current open-source diffusion models struggle to generate stable and synchronized audio-visual content, particularly in scenarios demanding complex semantic reasoning. The root cause is that existing methods rely on coarse text embeddings from off-the-shelf encoders to guide audio-video denoising, which discards fine-grained semantics and, critically, lacks a shared long-horizon plan, leading to uncoordinated denoising trajectories and fragile cross-modal alignment. We propose Baton, the first framework that introduces explicit semantic planning into joint video-audio generation. Our key insight is that complementing coarse text guidance with semantically rich, modality-aware planned tokens, jointly reasoned and mutually aligned before denoising, can simultaneously restore fine-grained semantic detail and establish a shared blueprint that coordinates both audio and video denoising trajectories. Concretely, Baton first introduces the VA-Planner, a multimodal language model equipped with dual semantic alignment towers, where learnable queries cross-attend to both video and audio features to produce a pair of semantically aligned video and audio planned tokens as keyframe-level blueprints. These planned tokens are injected into the diffusion backbone via cross-attention layers, providing temporally grounded guidance complementary to coarse text embeddings. Since planned tokens do not share one-to-one spatial-temporal correspondence with diffusion latents, we further propose Relative Semantic RoPE, a relative positional encoding that maps planned tokens and latents into a shared spatial-temporal coordinate frame, enabling each latent to accurately attend to its positionally corresponding semantic cues. Experiments on benchmarks show the effectiveness of Baton both qualitatively and quantitatively.

Qualitative Results

Ablation Study

w/o VA-Planner
w/ PE
w/ Frozen LLM
w/o Learnable Query
w/o Tower
w/o RoPE in Tower
w/ TA-Tok+WavTokenizer
w/ DINOv3
w/ Beats
w/ Unified Tokens
Baton

Video Prompt: A light-skinned Latin American man wearing sunglasses, a black t-shirt, beige shorts, and a large backpack, speaking while cycling. Then, a young woman in a teal-blue camisole is seen riding her bicycle just behind him. As they continue forward, they move steadily along a wide dirt road through a rural area with white-walled houses, trees, and power lines under an overcast sky. A local man stands in the doorway of one of the white-walled houses. Along the way, a local woman passes by on one side, reinforcing the sense of ongoing movement.

Audio Prompt: Set in an outdoor environment, a young man [Speaker A] speaks in a steady, conversational tone: \"to buy drinks, tip our tour guide, tip the guy who played the guitar for us. So, now we're we're leaving the fields and we're heading back into town to to get some money"

w/o RS-RoPE
w/ Temporal RoPE
Baton

Video Prompt: A young Caucasian man stands at an outdoor shooting range, holding a scoped AR-15 rifle, he fires several shots at a nearby pine tree, then reloads.

Audio Prompt: In a quiet, open outdoor environment, a sharp gunshot rings out, followed by a male voice [Speaker A] saying \"Ah\" in a neutral tone. Immediately after, another gunshot is fired. After a brief pause, a mechanical click is heard, as if a weapon is being reloaded.

BibTeX


        @article{tu2026baton,
        title={Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation},
        author={Tu, Shuyuan and Tian, Qi and Yang, Zihan and Wu, Yue and Han, Xintong and Kong, Weijie and Xiong, Jiangfeng and Zhang, Jian-Wei and Zhong, Zhao and Bo, Liefeng and Wu, Zuxuan and Jiang, Yu-Gang},
        journal={arXiv preprint arXiv:2605.25195},
        year={2026}
      }