MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Motivation

Ever faced the problem of having a video but no audio? Generating realistic, synchronized audio from video is a challenging task that has significant implications in various fields, including film production, gaming, and virtual reality. The ability to synthesize high-quality audio that matches the visual content can enhance user experience and immersion. Existing methods either rely on limited video-audio datasets or bolt "control modules" onto text-to-audio models, which can lead to suboptimal results. This is where the concept of multimodal joint training comes into play.

Compare the difference:

Video without Sound

The same video without audio - notice what's missing

Video with Sound

Experience the full audiovisual experience

The Innovation: MMAudio proposes a novel approach for Foley i.e. to video-to-audio synthesis by leveraging multimodal joint training. By integrating video, text and audio data into a single training framework, the model learns to generate high-quality audio that is not only synchronized with the visual content but also semantically relevant. This approach allows for better generalization across different domains and improves the overall quality of the synthesized audio.

MMAudio performing multimodal joint training with high-quality, abundant audio-text data which enables effective data scaling. At inference, MMAudio generates conditions-aligned audio with video and/or text guidance.

Relevance Today:

With the rise of content creation platforms and the increasing demand for high-quality audio-visual content, the need for efficient video-to-audio synthesis methods has never been greater.
Multimodal learning is a hot topic in AI research, and MMAudio represents a significant step forward in this field.
The ability to generate realistic audio from video can revolutionize industries such as gaming, film, and virtual reality.

The Evolution of Multimodal Learning for Audio-Visual Synthesis

Multimodal learning has evolved significantly over the past few years, driven by advancements in deep learning and the availability of large-scale datasets. Early approaches focused on simple feature concatenation or shallow fusion techniques, which often failed to capture the complex relationships between different modalities.

Semantic Alignment has been a foundational concept in multimodal learning, focusing on establishing meaningful connections between different modalities at a conceptual level:

Early approaches relied on paired audio-visual data trained with either generative objectives ([2, 6]) or contrastive objectives ([42]), teaching models to understand cross-modal relationships.
The field advanced by incorporating audio-text paired training alongside audio-visual training, creating more robust semantic understanding that could transfer between modalities.
Innovations like ImageBind [11] and LanguageBind [30] demonstrated how joint training could create shared semantic spaces where conceptually similar content clusters together regardless of original modality.
Modern approaches now focus on omnidirectional feature sharing, allowing information to flow freely between all modalities rather than in predetermined directions.

Temporal Alignment represents another critical dimension in multimodal learning, particularly for time-based media like video and audio:

Traditional methods relied on handcrafted proxy features such as audio onsets [32, 33], energy patterns [17, 20], or root-mean-square of waveforms [4, 34] to synchronize modalities.
Contemporary approaches have shifted toward learning alignment directly from deep feature embeddings, with models like Synchformer [19] enabling more nuanced interpretation of temporal relationships.
Recent innovations in positional embeddings have significantly improved synchronization capabilities - some approaches scale up low-frequency (visual) embeddings while others subsample high-frequency (audio) embeddings to achieve better alignment.
Higher frame-rate processing (evolving from 8 FPS to 24 FPS) has dramatically improved audio-visual synchronization quality, particularly for precise timing events.

Multimodal Conditionding approaches have transformed how models integrate information from multiple sources:

The conventional approach involved adding "control modules" to inject visual features into pretrained text-to-audio networks [13, 17, 20, 34], but this increased parameter counts and created architectural inefficiencies.
This modular approach faced limitations as the text modality remained fixed during video-to-audio training, forcing the video modality to adapt to text semantics rather than allowing mutual adaptation.
Modern architectures now train all modalities simultaneously in joint frameworks, creating more coherent semantic spaces and enabling more natural cross-modal generation.
Alternative techniques like Seeing-and-Hearing [35] attempted alignment without joint training by performing gradient ascent on alignment scores at test time, but these approaches often resulted in lower quality and temporal misalignment compared to jointly-trained models.

This historical trajectory shows a clear progression from isolated single-modality approaches toward more integrated systems that can understand and generate content across the full spectrum of human sensory experience, with each advancement building upon previous innovations to create increasingly sophisticated and capable models.

Key Learnings

Through my exploration of multimodal learning on video-audio synthesis, I've gained several important insights:

Problem of Foley Sound Synthesis: Creating convincing, high-quality ambient sound effects for video (Foley) is complex. It requires not only generating realistic audio but ensuring it is semantically aligned and temporally aligned.
Multimodal Joint Training Approach: A central learing is the effectiveness of MMAudio's novel multimodal joint training framework. Instead of just training on video-audio data or adapting pre-trained models, MMAudio trains on video, audio and text jointly in a single network from scratch.
Attention Mechanisms: Cross-attention layers allow models to focus on relevant parts of one modality while processing information from another, enabling more nuanced understanding across modalities.
Conditional Flow Matching: Conditional Flow Matching is used for generative modeling. For more details, readers can refer to [63]. For a quick overview, at test time, to generate a sample, we randomly draw noise $x_0 \sim \mathcal{N}(0,I)$ and use an Ordinary Differential Equation(ODE) solver to numerically integrate from time $t = 0$ to time $t = 1$ following a learned time-dependent conditional velocity vector field $v_theta(t, \mathbf{C}, x):[0,1] \times \mathbb{R}^C \times \mathbb{R}^d \to \mathbb{R}^d$, where $t$ is the timestep, $\mathbf{C}$ is the condition (e.g., video and text), and $x$ is a point in the vector field. We represent the velocity vector field via a deep net parameterized by $\theta$.

At training time, we find $\theta$ by considering the conditional flow matching objective

$\mathbb{E}_{t,q(x_0),q(x_1,\mathbf{C})}||v_\theta(t, \mathbf{C}, x_t) - u(x_t|x_0, x_1)||^2$,

where $t \in [0, 1]$, $q(x_0)$ is the standard normal distribution, and $q(x_1, \mathbf{C})$ samples from the training data. Further,

$x_t = tx_1 + (1-t)x_0$

defines a linear interpolation path between noise and data, and

$u(x_t|x_0, x_1) = x_1 - x_0$

denotes its corresponding flow velocity at $x_t$.
Audio Encoding: For computational efficiency, we model generative process in a latent space. For this, we first transform audio waveforms via Short-Time Fourier Transform (STFT) and extract mel spectrograms, which are then encoded by a pretrained VAE into latents $x_1$. During testing, the generated latents are decoded by the VAE into spectrograms which are then vocoded by a pretrained vocoder into audio waveforms.
Overview: The below figure illustrates the network architecture. MMAudio consists of a series of ($N_1$) multimodal transformer blocks with visual/text/audio branches, followed by a series of ($N_2$) audio-only transformer blocks. For scynchrony, they devised a conditional synchronization module that extracts and integrates into the generation process for temporal alignment.
Multimodal Transformer: Our desire is to model the interactions between video, audio and text modalities. For this purpose, we largely adopt the MM-DiT block design from SD3[8] and introduce two new components for temporal alignment: aligned RoPE positional embeddings for aligning sequences of different frame rates and 1D convolutional MLPs for capturing local temporal structure. Note, they also included a self-attention. This design help us to build a deeper network with the same parameter count and compute without sacrificing multimodality.
Representations: All the features are represented in one-dimensional tokens. Absolute positional encoding is not used which allows us to generalize for different duration of videos. The visual features $F_v$ (one token per frame, at 8 fps) and text features $F_t$ (77 tokens) are extracted from CLIP as 1024d features. The audio latents $x$ are in the VAE latent space at 31.25 fps as 20d latents by default. The synchronization features $F_{syn}$ are extracted with Synchformer at 24 fps as 768d features. Except text tokens, all other follow the same temporal ordering.
Joint Attention: These tokens from different modalities communicate via joint attention. We use the query, key and value representations from all modalities and apply scaled dot product attention. They noticed that joint attention alone does not capture temporal alignment.
Aligned RoPE position embedding: For audio-visual synchrony, precise temporal alignment is crucial. They applied RoPE embeddings on the queries and keys in the visual and audio streams before joint attention. It is not applied to text stream. For aligning the fps, they scaled the frequencies of positional embeddings in the visual stream proportionally. They also introduced additional synchronization module for better synchrony.
ConvMLP: For better capturing of local temperature structure, ConvMLPs are used for streams. They used 1D convolutions (kernel size = 3 and padding = 1).
Global Conditionding: This injects global features into the network through scales and biases ind adaptive layer normalization layers (adaLN). First, they computed a global conditioning vector $c_g \in \mathbb{R}^{1 \times h}$ shared across all transformer blocks from Fourier encoding of flow timestep. Each layer modulates its input $y \in \mathbb{R}^(L \times h)$ ($L$ is the sequence length) as:

$$\text{adaLN}_y(y, c_y) = \text{LayerNorm}(y) \cdot \mathbf{1}\mathbf{W}_{\gamma}(c_g) + \mathbf{1}\mathbf{W}_{\beta}(c_g)$$

Here, $\mathbf{W}_{\gamma}, \mathbf{W}_{\beta}$ are MLPs, and 1 is a $L \times 1$ all-ones matrix.
Effective Data Scaling: Joint training enables effective data scaling. This is a crucial insight for training as the datasets alone is limited by data scarcity and expense. By jointly training on larger-scale, readily available audio-text data (like WavCaps, which is much bigger than VGGSound) effectively scaling the data. This approach helps learn a unified semantic space across modalities
Conditional synchronization Module: Token-level conditioning is developed to improve audio-visual synchrony. The cross-modality attention layers aggregate features via a soft distribution, which tampers precision. To deal with this, high frame rate (24fps) are extraced ($F_{syn}$) from input video.
Single-Modality Performance: Surprisingly, MMAudio's multimodal approach also achieves competitive performance in text-to-audio generation, even comparable to models specifically designed for that task. So, we indeed do have rich semantic feature space.
Addressing Data Overlaping Issues: The existence of data overlaps between common datasets used in video-to-audio research and states they have carefully removed overlapping samples from the training data for fair evaluation. This is important for unbiased results.
Limitations: There are some limitations to the model. It generates unintelligible mumbles when prompted to generate human speech. The theory which is believed behind this is human speech is more complex.

The architecture of MMAudio. It consists of a series of multimodal transformer blocks with visual/text/audio branches, followed by a series of audio-only transformer blocks. For synchrony, they devised a conditional synchronization module that extracts and integrates into the generation process for temporal alignment.
Note: The figure is a conceptual representation and may not reflect the exact architecture used in the paper.

Code and Demonstrations

Feature Extraction Pipeline

# Simplified audio feature extraction
class AudioFeatureExtractor:
    def __init__(self, model_name):
        self.clip_model = load_clip_model()
        self.vae = load_vae_model()
        self.mel_converter = MelConverter()

    def process(self, waveform):
        mel = self.mel_converter(waveform)
        latents = self.vae.encode(mel)
        text_features = self.clip_model.encode_text(latents)
        return latents, text_features

# Simplified video feature extraction
class VideoFeatureExtractor:
    def __init__(self):
        self.synchformer = load_synchronization_model()
        self.clip_encoder = load_video_encoder()
        
    def process(self, video_frames):
        clip_features = self.clip_encoder(video_frames)
        sync_features = self.synchformer(video_frames)
        return clip_features, sync_features

Multimodal Fusion Architecture

class MultimodalTransformer(nn.Module):
    def __init__(self):
        super().__init__()
        self.cross_attn = nn.MultiheadAttention(embed_dim=1024, num_heads=16)
        self.temporal_conv = nn.Conv1d(1024, 1024, kernel_size=3)
        self.adaLN = AdaptiveLayerNorm(1024)
        
    def forward(self, video_feats, audio_feats, text_feats):
        # Joint attention across modalities
        combined = self.cross_attn(
            query=audio_feats,
            key=torch.cat([video_feats, text_feats], dim=1),
            value=torch.cat([video_feats, text_feats], dim=1)
        )
        # Temporal processing
        aligned = self.temporal_conv(combined)
        # Conditional normalization
        output = self.adaLN(aligned)
        return output

Training and Evaluation

def train_step(batch, models):
    # Extract multimodal features
    audio_feats = models.audio_extractor(batch['waveform'])
    video_feats = models.video_extractor(batch['video_frames'])
    text_feats = models.text_encoder(batch['text'])
                    
    # Multimodal fusion
    fused_feats = models.fuser(video_feats, audio_feats, text_feats)
                    
    # Flow matching objective
    noise = torch.randn_like(fused_feats)
    t = torch.rand(fused_feats.size(0))
    x_t = t[:,None,None]*fused_feats + (1-t[:,None,None])*noise
    pred_velocity = models.velocity_net(x_t, t)
    loss = F.mse_loss(pred_velocity, fused_feats - noise)
                    
    return loss

Demo

Check out the demo of MMAudio in action. This was done in Colab environment:

# Colab demo for MMAudio
!nvidia-smi
# Check GPU availability
import torch
if torch.cuda.is_available():
    print('GPU is available')
    device = 'cuda'
else:
    print('GPU not available, using CPU')    
    device = 'cpu'
    
# Install dependencies
!pip install torch torchvision torchaudio transformers
!git clone https://github.com/hkchengrex/MMAudio.git
%cd MMAudio
!pip install -e .
    
# Example data:
%cd /content/MMAudio
!curl https://i.imgur.com/8xHJTzI.mp4 -O
from IPython.display import HTML
from base64 import b64encode
data_url = "data:video/mp4;base64," + b64encode(open('video.mp4', 'rb').read()).decode()
HTML('<video width="400" height="300" controls><source src="' + data_url + '" type="video/mp4"></video>')    
    
# Load the model
!python demo.py --duration=10 --video=video.mp4 --prompt "waves and seagulls"
data_url = "data:video/mp4;base64," + b64encode(open('./output/video.mp4', 'rb').read()).decode()
HTML('<video width="400" height="300" controls><source src="' + data_url + '" type="video/mp4"></video>')

Experiments

Metrics

We evaluate generation quality across four dimensions:

Distribution matching: Measures similarity between ground-truth and generated audio using Fréchet Distance (FD) with PaSST, PANNs, and VGGish embeddings, and Kullback-Leibler (KL) distance with PANNs and PaSST classifiers.
Audio quality: Assessed using Inception Score with PANNs classifier.
Semantic alignment: Measured via ImageBind cosine similarity (IB-score) between visual and audio features.
Temporal alignment: Quantified by Synchformer's desynchronization score (DeSync), measuring audio-visual misalignment in seconds.

Main Results

Video-to-audio

Our smallest model (157M) outperforms prior methods on the VGGSound test set (~15K videos) across most metrics while being computationally efficient. Larger models show improved FD_PaSST and IB-scores, though with diminishing returns likely due to data quality limitations. We evaluate 8-second generations following Wang et al.

Table 2. Onset accuracy, average precision (AP), and F1-score on Greatest Hits, with DeSync on VGGSound for reference.
Method	Acc. ↑	AP ↑	F1↑	DeSync↓
Frieren [67]	0.6949	0.7846	0.6550	0.851
V-AURA [65]	0.5852	0.8567	0.6441	0.654
FoleyCrafter [73]	0.4533	0.6939	0.4319	1.225
Seeing&Hearing [69]	0.1156	0.8342	0.1591	1.204
MMAudio-S-16kHz	0.7637	0.9010	0.7928	0.483
MMAudio-S-44.1kHz	0.7150	0.9097	0.7666	0.444
MMAudio-M-44.1kHz	0.7226	0.9054	0.7620	0.443
MMAudio-L-44.1kHz	0.7158	0.9064	0.7535	0.442

Text-to-audio

Without fine-tuning, our multimodal framework demonstrates state-of-the-art semantic alignment (CLAP) and audio quality (IS) on the AudioCaps test set, despite not being primarily designed for this task.

Table 3. Text-to-audio results on the AudioCaps test set. For a fair comparison, we follow the evaluation protocol of [12] and transcribe all baselines directly from [12], who have reproduced those results using officially released checkpoints under the same evaluation protocol.
Method	Params	FD_PANNs↓	FD_VGG↓	IS↑	CLAP↑
AudioLDM 2-L [39]	712M	32.50	5.11	8.54	0.212
TANGO [1]	866M	26.13	1.87	8.23	0.185
TANGO 2 [43]	866M	19.77	2.74	8.45	0.264
Make-An-Audio [16]	453M	27.93	2.59	7.44	0.207
Make-An-Audio 2 [15]	937M	15.34	1.27	9.58	0.251
GenAU-Large [12]	1.25B	16.51	1.21	11.75	0.285
MMAudio-S-16kHz	157M	14.42	2.98	11.36	0.282
MMAudio-S-44.1kHz	157M	15.26	2.74	11.32	0.331
MMAudio-M-44.1kHz	621M	14.38	4.07	12.02	0.351
MMAudio-L-44.1kHz	1.03B	15.04	4.03	12.08	0.348

Visualization of spectrograms from different models

Figure 3. We visualize the spectrograms of generated audio (by prior works and our method) and the ground-truth. Note our method generates the audio effects most closely aligned to the ground-truth, while other methods often generate sounds not explained by the visual input and not present in the ground-truth.

Ablations

All ablations use our small-16kHz model evaluated on the VGGSound test set.

Cross-modal alignment

Joint multimodal training benefits from:

Incorporating text modality, creating a unified feature space
Including uncaptioned audio data, which improves natural sound distribution learning
Training on large multimodal datasets rather than only utilizing class labels

Table 4. Results when we vary the training modalities. A: Audio, V: Video, T: Text.
Training modalities	FD_PaSST↓	IS↑	IB-score↑	DeSync↓
AVT+AT	70.19	14.44	29.13	0.483
AV+AT	72.77	12.88	28.10	0.502
AVT+A	71.01	14.30	28.72	0.496
AV+A	77.38	12.53	27.98	0.562
AV	77.27	12.69	28.10	0.502

In the second and third rows, we mask away the text token in either audio-visual data or audio-text data. In the last two rows, we do not use any audio-text data.

Multimodal data

Increasing the amount of audio-text training data improves distribution matching, semantic alignment, and temporal alignment, with diminishing returns at larger scales.

Table 5. Results when we vary the amount of multimodal training data.
% audio-text data	FD_PaSST↓	IS↑	IB-score↑	DeSync↓
100%	70.19	14.44	29.13	0.483
50%	71.03	14.62	29.11	0.489
25%	71.67	14.41	28.75	0.505
10%	79.21	13.55	27.47	0.514
None	77.38	12.53	27.98	0.562

For the first four rows, we sample audio-visual and audio-text data at a 1:1 ratio during training. For the last row, only audio-visual data is used.

Conditional synchronization module

Our approach outperforms alternatives in temporal alignment compared to incorporating synchronization features into the visual branch or omitting them entirely.

Table 6. Results when we use synchronization features differently.
Variant	FD_PaSST↓	IS↑	IB-score↑	DeSync↓
With sync module	70.19	14.44	29.13	0.483
Sum sync with visual	73.59	16.70	28.65	0.490
No sync features	69.33	15.05	29.31	0.973

RoPE embeddings

Aligned RoPE formulation improves audio-visual synchrony compared to both no RoPE embeddings and non-aligned variants.

Part of Table 6. Results when we use RoPE embeddings differently.
Variant	FD_PaSST↓	IS↑	IB-score↑	DeSync↓
Aligned RoPE	70.19	14.44	29.13	0.483
No RoPE	70.24	14.54	29.23	0.509
Non-aligned RoPE	70.25	14.54	29.25	0.496

ConvMLP

Outperforms standard MLP in capturing local temporal structure, particularly for synchronization.

Part of Table 7. Results when we vary the MLP architecture.
Variant	FD_PaSST↓	IS↑	IB-score↑	DeSync↓
ConvMLP	70.19	14.44	29.13	0.483
MLP	73.84	13.01	28.99	0.533

Architecture ratio

Our default assignment of multimodal (N₁=4) and single-modal (N₂=8) transformer blocks balances performance and parameter efficiency.

Part of Table 7. Results when we vary the ratio between multi-/single-modality transformer blocks.
Variant	FD_PaSST↓	IS↑	IB-score↑	DeSync↓
N₁ = 4, N₂ = 8	70.19	14.44	29.13	0.483
N₁ = 2, N₂ = 13	70.33	15.18	29.39	0.487
N₁ = 6, N₂ = 3	72.53	13.75	29.06	0.509

The ablation studies confirm that our design choices are essential for achieving high-quality video-to-audio synthesis. Joint multimodal training with text data, aligned RoPE embeddings, our conditional synchronization module, and ConvMLP all contribute significantly to the model's performance, especially for temporal alignment and audio quality.

Reflections

What Surprised Me?

Multimodal Training Doesn't Compromise Single-Modality Performance

The paper reveals that training on both video-audio and text-audio data doesn't dilute the model's ability to excel in text-to-audio tasks. This challenges the assumption that multimodal models must trade off specialization for versatility. For instance, MMAudio's CLAP score (text-audio alignment) rivals dedicated text-to-audio models like AudioLDM 2, suggesting that cross-modal training enriches semantic understanding.
Efficiency Without Sacrificing Quality

Despite its compact size (157M parameters), MMAudio outperforms larger models (e.g., FoleyCrafter: 1.2B params) in synchronization and audio quality. The use of flow matching—a less common alternative to diffusion models—proves unexpectedly effective for fast, stable generation.
Synchformer's Dual Role

The repurposing of Synchformer—a model originally designed to detect desynchronization—to improve synchronization is a clever inversion. Its high-FPS features (24 fps) enable frame-level precision, a stark contrast to prior methods relying on handcrafted proxies like energy curves.

Scope for Improvement

Speech Synthesis: The Elephant in the Room

While MMAudio excels at ambient sounds and Foley effects, its failure to generate intelligible speech highlights a critical gap. Human speech involves layers of complexity (phonetics, prosody, language structure) absent in non-vocal audio. Future work could:
- Integrate speech-specific modules (e.g., phoneme aligners, tone predictors)
- Leverage speech-text-video datasets (e.g., talking-head videos with transcripts) to bridge this gap
Data Diversity and Bias Mitigation

MMAudio trains on automated captions from WavCaps and class-labeled VGGSound data, which may embed biases (e.g., overrepresentation of common sounds). Curating balanced, ethically sourced datasets—with annotations for rare or culturally specific sounds—could improve fairness and generalization.
Long-Form Temporal Consistency

The model processes 8–10s clips, but real-world applications (e.g., films, podcasts) require coherence over minutes. Expanding the context window with memory-augmented transformers or hierarchical modeling could address this.
Real-Time Applications

While MMAudio is fast (1.23s for 8s audio), true real-time synthesis (e.g., for live streaming) demands sub-second latency. Optimizing the vocoder or exploring latent space streaming could unlock this.
Integration with Video Generation Models

Combining MMAudio with video generators (e.g., Sora, Stable Video Diffusion) could enable end-to-end audiovisual synthesis, where AI generates synchronized sight and sound from text prompts.
Ethical Safeguards

The paper sidesteps discussions on misuse (e.g., deepfake audio for misinformation). Future iterations should include watermarking or detection mechanisms to mitigate risks.

Final Thoughts

MMAudio's limitations are not failures but signposts for progress. Its inability to handle speech underscores the need for specialized submodules in multimodal frameworks, while its reliance on existing datasets calls for more inclusive data practices. By addressing these gaps, the next generation of multimodal models could achieve human-like audiovisual understanding—a leap toward AI that truly "hears" and "sees" in tandem.

Acknowledgements

This work is supported in part by Sony. ASis supported by NSF grants 2008387, 2045586, 2106825, and NIFA award 2020-67021-32799. I sincerely thank Kazuki Shimada and Zhi Zhong for their helpful feedback on this manuscript

References

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. (2025). MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis. arXiv preprint arXiv:2412.15322. https://arxiv.org/abs/2412.15322
Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. VGGSound: A large-scale audio-visual dataset. In ICASSP, 2020.
Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, and Justin Salamon. Video-guided foley sound generation with multimodal controls. arXiv, 2024.
Yoonjin Chung, Junwon Lee, and Juhan Nam. T-foley: A controllable waveform-domain diffusion model for temporal event-guided foley sound synthesis. In ICASSP. IEEE, 2024.
Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. In ICASSP. IEEE, 2020.
Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, and Andrew Owens. Conditional generation of audio from video via foley analogies. In CVPR, 2023.
Benjamin Elizalde, Soham Deshmukh, and Huaming Wang. Natural language supervision for general-purpose audio representations. In ICASSP, 2024.
Patrick Esser et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024.
Jort F Gemmeke et al. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP. IEEE, 2017.
Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruction tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
Rohit Girdhar et al. Imagebind: One embedding space to bind them all. In CVPR, 2023.
Moayed Haji-Ali et al. Taming data and transformers for audio generation. arXiv preprint arXiv:2406.19388, 2024.
Jiawei Huang et al. Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474, 2023.
Rongjie Huang et al. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In ICML, 2023.
Vladimir Iashin and Esa Rahtu. Taming visually guided sound generation. In BMVC, 2021.
Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues. In ICASSP. IEEE, 2024.
Yujin Jeong, Yunji Kim, Sanghyuk Chun, and Jiyoung Lee. Read, watch and scream! sound generation from text and video. arXiv preprint arXiv:2407.05551, 2024.
Chris Dongjoo Kim et al. AudioCaps: Generating captions for audios in the wild. In NAACL-HLT, 2019.
Qiuqiang Kong et al. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. TASLP, 2020.
Andrew Owens et al. Visually indicated sounds. In CVPR, 2016.
Alec Radford et al. Learning transferable visual models from natural language supervision. In ICLR, 2021.
Yong Ren et al. STA-V2A: Video-to-audio generation with semantic and temporal alignment. arXiv preprint arXiv:2409.08601, 2024.
Ludan Ruan et al. MM-Diffusion: Learning multi-modal diffusion models for joint audio and video generation. In CVPR, 2023.
Tim Salimans et al. Improved techniques for training gans. In NeurIPS, 2016.
Zineng Tang et al. Any-to-any generation via composable diffusion. In NeurIPS, 2024.
Ashish Vaswani et al. Attention is all you need. In NeurIPS, 2017.
Yongqi Wang et al. Frieren: Efficient video-to-audio generation with rectified flow matching. In NeurIPS, 2024.
Yazhou Xing et al. Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners. In CVPR, 2024.
Yiming Zhang et al. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds. arXiv preprint arXiv:2407.01494, 2024.
Bin Zhu et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. In ICLR, 2024.
Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. In NeurIPS, 2024.
Yong Ren, Chenxing Li, Manjie Xu, Wei Liang, Yu Gu, Rilin Chen, and Dong Yu. Sta-v2a: Video-to-audio gen eration with semantic and temporal alignment. arXiv preprint arXiv:2409.08601, 2024
Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds. arXiv preprint arXiv:2407.01494, 2024.
Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, and Qifeng Chen. Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners. In CVPR, 2024.