SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation
This is a demo for our paper 'SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation_'. The SoundLoCD is a LoRA-based conditional discrete contrastive latent diffusion model for text-to-sound effects generation. Unlike recent large-scale audio generation models, our SoundLoCD can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy enhances the connection between textual conditions and the generated audio outputs, resulting in coherent performance.
If you are interesting in our work, please cite it as below:
@INPROCEEDINGS{niu2023soundlocd,
author={Niu, Xinlei, Jing Zhang, Christian Walder, and Charles Patrick Martin}
title={SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation},
year={2023}
}