MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

By Karan Kumawat (220150005)
May 7, 2025

Motivation

Ever faced the problem of having a video but no audio? Generating realistic, synchronized audio from video is a challenging task that has significant implications in various fields, including film production, gaming, and virtual reality. The ability to synthesize high-quality audio that matches the visual content can enhance user experience and immersion. Existing methods either rely on limited video-audio datasets or bolt "control modules" onto text-to-audio models, which can lead to suboptimal results. This is where the concept of multimodal joint training comes into play.

Compare the difference:

Video without Sound

The same video without audio - notice what's missing

Video with Sound

Experience the full audiovisual experience

The Innovation: MMAudio proposes a novel approach for Foley i.e. to video-to-audio synthesis by leveraging multimodal joint training. By integrating video, text and audio data into a single training framework, the model learns to generate high-quality audio that is not only synchronized with the visual content but also semantically relevant. This approach allows for better generalization across different domains and improves the overall quality of the synthesized audio.

MMAudio Architecture Diagram

MMAudio performing multimodal joint training with high-quality, abundant audio-text data which enables effective data scaling. At inference, MMAudio generates conditions-aligned audio with video and/or text guidance.

Relevance Today:

The Evolution of Multimodal Learning for Audio-Visual Synthesis

Multimodal learning has evolved significantly over the past few years, driven by advancements in deep learning and the availability of large-scale datasets. Early approaches focused on simple feature concatenation or shallow fusion techniques, which often failed to capture the complex relationships between different modalities.

Semantic Alignment has been a foundational concept in multimodal learning, focusing on establishing meaningful connections between different modalities at a conceptual level:

Temporal Alignment represents another critical dimension in multimodal learning, particularly for time-based media like video and audio:

Multimodal Conditionding approaches have transformed how models integrate information from multiple sources:

This historical trajectory shows a clear progression from isolated single-modality approaches toward more integrated systems that can understand and generate content across the full spectrum of human sensory experience, with each advancement building upon previous innovations to create increasingly sophisticated and capable models.

Key Learnings

Through my exploration of multimodal learning on video-audio synthesis, I've gained several important insights:

MMAudio Architecture Diagram

The architecture of MMAudio. It consists of a series of multimodal transformer blocks with visual/text/audio branches, followed by a series of audio-only transformer blocks. For synchrony, they devised a conditional synchronization module that extracts and integrates into the generation process for temporal alignment.
Note: The figure is a conceptual representation and may not reflect the exact architecture used in the paper.

Code and Demonstrations

Feature Extraction Pipeline

# Simplified audio feature extraction
class AudioFeatureExtractor:
    def __init__(self, model_name):
        self.clip_model = load_clip_model()
        self.vae = load_vae_model()
        self.mel_converter = MelConverter()

    def process(self, waveform):
        mel = self.mel_converter(waveform)
        latents = self.vae.encode(mel)
        text_features = self.clip_model.encode_text(latents)
        return latents, text_features

# Simplified video feature extraction
class VideoFeatureExtractor:
    def __init__(self):
        self.synchformer = load_synchronization_model()
        self.clip_encoder = load_video_encoder()
        
    def process(self, video_frames):
        clip_features = self.clip_encoder(video_frames)
        sync_features = self.synchformer(video_frames)
        return clip_features, sync_features
                

Multimodal Fusion Architecture

class MultimodalTransformer(nn.Module):
    def __init__(self):
        super().__init__()
        self.cross_attn = nn.MultiheadAttention(embed_dim=1024, num_heads=16)
        self.temporal_conv = nn.Conv1d(1024, 1024, kernel_size=3)
        self.adaLN = AdaptiveLayerNorm(1024)
        
    def forward(self, video_feats, audio_feats, text_feats):
        # Joint attention across modalities
        combined = self.cross_attn(
            query=audio_feats,
            key=torch.cat([video_feats, text_feats], dim=1),
            value=torch.cat([video_feats, text_feats], dim=1)
        )
        # Temporal processing
        aligned = self.temporal_conv(combined)
        # Conditional normalization
        output = self.adaLN(aligned)
        return output

Training and Evaluation

def train_step(batch, models):
    # Extract multimodal features
    audio_feats = models.audio_extractor(batch['waveform'])
    video_feats = models.video_extractor(batch['video_frames'])
    text_feats = models.text_encoder(batch['text'])
                    
    # Multimodal fusion
    fused_feats = models.fuser(video_feats, audio_feats, text_feats)
                    
    # Flow matching objective
    noise = torch.randn_like(fused_feats)
    t = torch.rand(fused_feats.size(0))
    x_t = t[:,None,None]*fused_feats + (1-t[:,None,None])*noise
    pred_velocity = models.velocity_net(x_t, t)
    loss = F.mse_loss(pred_velocity, fused_feats - noise)
                    
    return loss

Demo

Check out the demo of MMAudio in action. This was done in Colab environment:

# Colab demo for MMAudio
!nvidia-smi
# Check GPU availability
import torch
if torch.cuda.is_available():
    print('GPU is available')
    device = 'cuda'
else:
    print('GPU not available, using CPU')    
    device = 'cpu'
    
# Install dependencies
!pip install torch torchvision torchaudio transformers
!git clone https://github.com/hkchengrex/MMAudio.git
%cd MMAudio
!pip install -e .
    
# Example data:
%cd /content/MMAudio
!curl https://i.imgur.com/8xHJTzI.mp4 -O
from IPython.display import HTML
from base64 import b64encode
data_url = "data:video/mp4;base64," + b64encode(open('video.mp4', 'rb').read()).decode()
HTML('<video width="400" height="300" controls><source src="' + data_url + '" type="video/mp4"></video>')    
    
# Load the model
!python demo.py --duration=10 --video=video.mp4 --prompt "waves and seagulls"
data_url = "data:video/mp4;base64," + b64encode(open('./output/video.mp4', 'rb').read()).decode()
HTML('<video width="400" height="300" controls><source src="' + data_url + '" type="video/mp4"></video>')    
        

Experiments

Metrics

We evaluate generation quality across four dimensions:

  • Distribution matching: Measures similarity between ground-truth and generated audio using Fréchet Distance (FD) with PaSST, PANNs, and VGGish embeddings, and Kullback-Leibler (KL) distance with PANNs and PaSST classifiers.
  • Audio quality: Assessed using Inception Score with PANNs classifier.
  • Semantic alignment: Measured via ImageBind cosine similarity (IB-score) between visual and audio features.
  • Temporal alignment: Quantified by Synchformer's desynchronization score (DeSync), measuring audio-visual misalignment in seconds.

Main Results

Video-to-audio

Our smallest model (157M) outperforms prior methods on the VGGSound test set (~15K videos) across most metrics while being computationally efficient. Larger models show improved FDPaSST and IB-scores, though with diminishing returns likely due to data quality limitations. We evaluate 8-second generations following Wang et al.

Table 2. Onset accuracy, average precision (AP), and F1-score on Greatest Hits, with DeSync on VGGSound for reference.
Method Acc. ↑ AP ↑ F1↑ DeSync↓
Frieren [67] 0.6949 0.7846 0.6550 0.851
V-AURA [65] 0.5852 0.8567 0.6441 0.654
FoleyCrafter [73] 0.4533 0.6939 0.4319 1.225
Seeing&Hearing [69] 0.1156 0.8342 0.1591 1.204
MMAudio-S-16kHz 0.7637 0.9010 0.7928 0.483
MMAudio-S-44.1kHz 0.7150 0.9097 0.7666 0.444
MMAudio-M-44.1kHz 0.7226 0.9054 0.7620 0.443
MMAudio-L-44.1kHz 0.7158 0.9064 0.7535 0.442

Text-to-audio

Without fine-tuning, our multimodal framework demonstrates state-of-the-art semantic alignment (CLAP) and audio quality (IS) on the AudioCaps test set, despite not being primarily designed for this task.

Table 3. Text-to-audio results on the AudioCaps test set. For a fair comparison, we follow the evaluation protocol of [12] and transcribe all baselines directly from [12], who have reproduced those results using officially released checkpoints under the same evaluation protocol.
Method Params FDPANNs FDVGG IS↑ CLAP↑
AudioLDM 2-L [39] 712M 32.50 5.11 8.54 0.212
TANGO [1] 866M 26.13 1.87 8.23 0.185
TANGO 2 [43] 866M 19.77 2.74 8.45 0.264
Make-An-Audio [16] 453M 27.93 2.59 7.44 0.207
Make-An-Audio 2 [15] 937M 15.34 1.27 9.58 0.251
GenAU-Large [12] 1.25B 16.51 1.21 11.75 0.285
MMAudio-S-16kHz 157M 14.42 2.98 11.36 0.282
MMAudio-S-44.1kHz 157M 15.26 2.74 11.32 0.331
MMAudio-M-44.1kHz 621M 14.38 4.07 12.02 0.351
MMAudio-L-44.1kHz 1.03B 15.04 4.03 12.08 0.348
Visualization of spectrograms from different models

Figure 3. We visualize the spectrograms of generated audio (by prior works and our method) and the ground-truth. Note our method generates the audio effects most closely aligned to the ground-truth, while other methods often generate sounds not explained by the visual input and not present in the ground-truth.

Ablations

All ablations use our small-16kHz model evaluated on the VGGSound test set.

Cross-modal alignment

Joint multimodal training benefits from:

  1. Incorporating text modality, creating a unified feature space
  2. Including uncaptioned audio data, which improves natural sound distribution learning
  3. Training on large multimodal datasets rather than only utilizing class labels
Table 4. Results when we vary the training modalities. A: Audio, V: Video, T: Text.
Training modalities FDPaSST IS↑ IB-score↑ DeSync↓
AVT+AT 70.19 14.44 29.13 0.483
AV+AT 72.77 12.88 28.10 0.502
AVT+A 71.01 14.30 28.72 0.496
AV+A 77.38 12.53 27.98 0.562
AV 77.27 12.69 28.10 0.502

In the second and third rows, we mask away the text token in either audio-visual data or audio-text data. In the last two rows, we do not use any audio-text data.

Multimodal data

Increasing the amount of audio-text training data improves distribution matching, semantic alignment, and temporal alignment, with diminishing returns at larger scales.

Table 5. Results when we vary the amount of multimodal training data.
% audio-text data FDPaSST IS↑ IB-score↑ DeSync↓
100% 70.19 14.44 29.13 0.483
50% 71.03 14.62 29.11 0.489
25% 71.67 14.41 28.75 0.505
10% 79.21 13.55 27.47 0.514
None 77.38 12.53 27.98 0.562

For the first four rows, we sample audio-visual and audio-text data at a 1:1 ratio during training. For the last row, only audio-visual data is used.

Conditional synchronization module

Our approach outperforms alternatives in temporal alignment compared to incorporating synchronization features into the visual branch or omitting them entirely.

Table 6. Results when we use synchronization features differently.
Variant FDPaSST IS↑ IB-score↑ DeSync↓
With sync module 70.19 14.44 29.13 0.483
Sum sync with visual 73.59 16.70 28.65 0.490
No sync features 69.33 15.05 29.31 0.973

RoPE embeddings

Aligned RoPE formulation improves audio-visual synchrony compared to both no RoPE embeddings and non-aligned variants.

Part of Table 6. Results when we use RoPE embeddings differently.
Variant FDPaSST IS↑ IB-score↑ DeSync↓
Aligned RoPE 70.19 14.44 29.13 0.483
No RoPE 70.24 14.54 29.23 0.509
Non-aligned RoPE 70.25 14.54 29.25 0.496

ConvMLP

Outperforms standard MLP in capturing local temporal structure, particularly for synchronization.

Part of Table 7. Results when we vary the MLP architecture.
Variant FDPaSST IS↑ IB-score↑ DeSync↓
ConvMLP 70.19 14.44 29.13 0.483
MLP 73.84 13.01 28.99 0.533

Architecture ratio

Our default assignment of multimodal (N1=4) and single-modal (N2=8) transformer blocks balances performance and parameter efficiency.

Part of Table 7. Results when we vary the ratio between multi-/single-modality transformer blocks.
Variant FDPaSST IS↑ IB-score↑ DeSync↓
N1 = 4, N2 = 8 70.19 14.44 29.13 0.483
N1 = 2, N2 = 13 70.33 15.18 29.39 0.487
N1 = 6, N2 = 3 72.53 13.75 29.06 0.509

The ablation studies confirm that our design choices are essential for achieving high-quality video-to-audio synthesis. Joint multimodal training with text data, aligned RoPE embeddings, our conditional synchronization module, and ConvMLP all contribute significantly to the model's performance, especially for temporal alignment and audio quality.

Reflections

What Surprised Me?

  1. Multimodal Training Doesn't Compromise Single-Modality Performance

    The paper reveals that training on both video-audio and text-audio data doesn't dilute the model's ability to excel in text-to-audio tasks. This challenges the assumption that multimodal models must trade off specialization for versatility. For instance, MMAudio's CLAP score (text-audio alignment) rivals dedicated text-to-audio models like AudioLDM 2, suggesting that cross-modal training enriches semantic understanding.

  2. Efficiency Without Sacrificing Quality

    Despite its compact size (157M parameters), MMAudio outperforms larger models (e.g., FoleyCrafter: 1.2B params) in synchronization and audio quality. The use of flow matching—a less common alternative to diffusion models—proves unexpectedly effective for fast, stable generation.

  3. Synchformer's Dual Role

    The repurposing of Synchformer—a model originally designed to detect desynchronization—to improve synchronization is a clever inversion. Its high-FPS features (24 fps) enable frame-level precision, a stark contrast to prior methods relying on handcrafted proxies like energy curves.

Scope for Improvement

  1. Speech Synthesis: The Elephant in the Room

    While MMAudio excels at ambient sounds and Foley effects, its failure to generate intelligible speech highlights a critical gap. Human speech involves layers of complexity (phonetics, prosody, language structure) absent in non-vocal audio. Future work could:

    • Integrate speech-specific modules (e.g., phoneme aligners, tone predictors)
    • Leverage speech-text-video datasets (e.g., talking-head videos with transcripts) to bridge this gap
  2. Data Diversity and Bias Mitigation

    MMAudio trains on automated captions from WavCaps and class-labeled VGGSound data, which may embed biases (e.g., overrepresentation of common sounds). Curating balanced, ethically sourced datasets—with annotations for rare or culturally specific sounds—could improve fairness and generalization.

  3. Long-Form Temporal Consistency

    The model processes 8–10s clips, but real-world applications (e.g., films, podcasts) require coherence over minutes. Expanding the context window with memory-augmented transformers or hierarchical modeling could address this.

  4. Real-Time Applications

    While MMAudio is fast (1.23s for 8s audio), true real-time synthesis (e.g., for live streaming) demands sub-second latency. Optimizing the vocoder or exploring latent space streaming could unlock this.

  5. Integration with Video Generation Models

    Combining MMAudio with video generators (e.g., Sora, Stable Video Diffusion) could enable end-to-end audiovisual synthesis, where AI generates synchronized sight and sound from text prompts.

  6. Ethical Safeguards

    The paper sidesteps discussions on misuse (e.g., deepfake audio for misinformation). Future iterations should include watermarking or detection mechanisms to mitigate risks.

Final Thoughts

MMAudio's limitations are not failures but signposts for progress. Its inability to handle speech underscores the need for specialized submodules in multimodal frameworks, while its reliance on existing datasets calls for more inclusive data practices. By addressing these gaps, the next generation of multimodal models could achieve human-like audiovisual understanding—a leap toward AI that truly "hears" and "sees" in tandem.

Acknowledgements

This work is supported in part by Sony. ASis supported by NSF grants 2008387, 2045586, 2106825, and NIFA award 2020-67021-32799. I sincerely thank Kazuki Shimada and Zhi Zhong for their helpful feedback on this manuscript

References

  1. Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. (2025). MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis. arXiv preprint arXiv:2412.15322. https://arxiv.org/abs/2412.15322
  2. Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. VGGSound: A large-scale audio-visual dataset. In ICASSP, 2020.
  3. Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, and Justin Salamon. Video-guided foley sound generation with multimodal controls. arXiv, 2024.
  4. Yoonjin Chung, Junwon Lee, and Juhan Nam. T-foley: A controllable waveform-domain diffusion model for temporal event-guided foley sound synthesis. In ICASSP. IEEE, 2024.
  5. Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. In ICASSP. IEEE, 2020.
  6. Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, and Andrew Owens. Conditional generation of audio from video via foley analogies. In CVPR, 2023.
  7. Benjamin Elizalde, Soham Deshmukh, and Huaming Wang. Natural language supervision for general-purpose audio representations. In ICASSP, 2024.
  8. Patrick Esser et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024.
  9. Jort F Gemmeke et al. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP. IEEE, 2017.
  10. Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruction tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
  11. Rohit Girdhar et al. Imagebind: One embedding space to bind them all. In CVPR, 2023.
  12. Moayed Haji-Ali et al. Taming data and transformers for audio generation. arXiv preprint arXiv:2406.19388, 2024.
  13. Jiawei Huang et al. Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474, 2023.
  14. Rongjie Huang et al. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In ICML, 2023.
  15. Vladimir Iashin and Esa Rahtu. Taming visually guided sound generation. In BMVC, 2021.
  16. Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues. In ICASSP. IEEE, 2024.
  17. Yujin Jeong, Yunji Kim, Sanghyuk Chun, and Jiyoung Lee. Read, watch and scream! sound generation from text and video. arXiv preprint arXiv:2407.05551, 2024.
  18. Chris Dongjoo Kim et al. AudioCaps: Generating captions for audios in the wild. In NAACL-HLT, 2019.
  19. Qiuqiang Kong et al. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. TASLP, 2020.
  20. Andrew Owens et al. Visually indicated sounds. In CVPR, 2016.
  21. Alec Radford et al. Learning transferable visual models from natural language supervision. In ICLR, 2021.
  22. Yong Ren et al. STA-V2A: Video-to-audio generation with semantic and temporal alignment. arXiv preprint arXiv:2409.08601, 2024.
  23. Ludan Ruan et al. MM-Diffusion: Learning multi-modal diffusion models for joint audio and video generation. In CVPR, 2023.
  24. Tim Salimans et al. Improved techniques for training gans. In NeurIPS, 2016.
  25. Zineng Tang et al. Any-to-any generation via composable diffusion. In NeurIPS, 2024.
  26. Ashish Vaswani et al. Attention is all you need. In NeurIPS, 2017.
  27. Yongqi Wang et al. Frieren: Efficient video-to-audio generation with rectified flow matching. In NeurIPS, 2024.
  28. Yazhou Xing et al. Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners. In CVPR, 2024.
  29. Yiming Zhang et al. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds. arXiv preprint arXiv:2407.01494, 2024.
  30. Bin Zhu et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. In ICLR, 2024.
  31. Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. In NeurIPS, 2024.
  32. Yong Ren, Chenxing Li, Manjie Xu, Wei Liang, Yu Gu, Rilin Chen, and Dong Yu. Sta-v2a: Video-to-audio gen eration with semantic and temporal alignment. arXiv preprint arXiv:2409.08601, 2024
  33. Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds. arXiv preprint arXiv:2407.01494, 2024.
  34. Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, and Qifeng Chen. Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners. In CVPR, 2024.