Release AnyEnhance pretrained models (#386)

open-mmlab · Jan 15, 2025 · c027722 · c027722
1 parent 04dfe6e
commit c027722
Show file tree

Hide file tree

Showing 19 changed files with 2,312 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -9,6 +9,7 @@
     <a href="egs/tts/README.md"><img src="https://img.shields.io/badge/README-TTS-blue"></a>
     <a href="models/vc/vevo/README.md"><img src="https://img.shields.io/badge/README-VC-blue"></a>
     <a href="models/vc/vevo/README.md"><img src="https://img.shields.io/badge/README-AC-blue"></a>
+    <a href="models/se/anyenhance/README.md"><img src="https://img.shields.io/badge/README-SE-blue"></a>
     <a href="egs/svc/README.md"><img src="https://img.shields.io/badge/README-SVC-blue"></a>
     <a href="egs/tta/README.md"><img src="https://img.shields.io/badge/README-TTA-blue"></a>
     <a href="egs/vocoder/README.md"><img src="https://img.shields.io/badge/README-Vocoder-purple"></a>
@@ -26,6 +27,7 @@
 - **SVS**: Singing Voice Synthesis (👨‍💻 developing)
 - **VC**: Voice Conversion (⛳ supported)
 - **AC**: Accent Conversion (⛳ supported)
+- **SE**: Speech Enhancement (⛳ supported)
 - **SVC**: Singing Voice Conversion (⛳ supported)
 - **TTA**: Text to Audio (⛳ supported)
 - **TTM**: Text to Music (👨‍💻 developing)
@@ -34,6 +36,7 @@
 In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building **large-scale datasets** for speech synthesis.
 
 ## 🚀 News
+- **2025/01/15**: We release **AnyEnhance**, a unified generative model capable of addressing diverse speech and singing voice front-end tasks, including denoising, dereverberation, declipping, super-resolution, and target speaker extraction. [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/anyenhance) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/se/anyenhance/README.md)
 - **2024/12/22**: We release the reproduction of **Vevo**, a zero-shot voice imitation framework with controllable timbre and style. Vevo can be applied into a series of speech generation tasks, including VC, TTS, AC, and more. The released pre-trained models are trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieve SOTA zero-shot VC performance. [![arXiv](https://img.shields.io/badge/OpenReview-Paper-COLOR.svg)](https://openreview.net/pdf?id=anQDiQZhDP) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/Vevo) [![WebPage](https://img.shields.io/badge/WebPage-Demo-red)](https://versavoice.github.io/) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/vc/vevo/README.md)
 - **2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieves SOTA zero-shot TTS performance.  [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2409.00750) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/maskgct) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/maskgct) [![ModelScope](https://img.shields.io/badge/ModelScope-space-purple)](https://modelscope.cn/studios/amphion/maskgct) [![ModelScope](https://img.shields.io/badge/ModelScope-model-cyan)](https://modelscope.cn/models/amphion/MaskGCT) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/tts/maskgct/README.md)
 - **2024/09/01**: [Amphion](https://arxiv.org/abs/2312.09911), [Emilia](https://arxiv.org/abs/2407.05361) and [DSFF-SVC](https://arxiv.org/abs/2310.11160) got accepted by IEEE SLT 2024! 🤗
@@ -72,6 +75,10 @@ Amphion supports the following voice conversion models:
 
 - Amphion supports AC with [Vevo-Style](models/vc/vevo/README.md). Particularly, it can conduct the accent conversion in a zero-shot manner. [![code](https://img.shields.io/badge/README-Code-blue)](models/vc/vevo/README.md)
 
+### SE: Speech Enhancement
+
+- Amphion supports SE with [AnyEnhance](models/se/anyenhance/README.md). Particularly, it can conduct enhancement and extraction tasks under various distortions (noise, reverberation, clipping, etc.). [![code](https://img.shields.io/badge/README-Code-blue)](models/se/anyenhance/README.md)
+
 ### SVC: Singing Voice Conversion
 
 - Ampion supports multiple content-based features from various pretrained models, including [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), and [ContentVec](https://github.com/auspicious3000/contentvec). Their specific roles in SVC has been investigated in our SLT 2024 paper. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2310.11160) [![code](https://img.shields.io/badge/README-Code-blue)](egs/svc/MultipleContentsSVC)
@@ -154,6 +161,7 @@ We detail the instructions of different tasks in the following recipes:
 - [Text to Speech (TTS)](egs/tts/README.md)
 - [Voice Conversion (VC)](models/vc/vevo/README.md)
 - [Accent Conversion (AC)](models/vc/vevo/README.md)
+- [Speech Enhancement (SE)](models/se/anyenhance/README.md)
 - [Singing Voice Conversion (SVC)](egs/svc/README.md)
 - [Text to Audio (TTA)](egs/tta/README.md)
 - [Vocoder](egs/vocoder/README.md)

diff --git a/imgs/anyenhance/anyenhance.png b/imgs/anyenhance/anyenhance.png
diff --git a/imgs/anyenhance/demo.png b/imgs/anyenhance/demo.png
diff --git a/imgs/anyenhance/enhancement-tasks.png b/imgs/anyenhance/enhancement-tasks.png
diff --git a/models/se/anyenhance/.gitignore b/models/se/anyenhance/.gitignore
@@ -0,0 +1,4 @@
+pretrained/
+
+!wav/enhancement/*.wav
+!wav/extraction/*.wav
diff --git a/models/se/anyenhance/README.md b/models/se/anyenhance/README.md
@@ -0,0 +1,113 @@
+# AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement and Extraction
+
+[![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/anyenhance)
+
+<br>
+<div align="center">
+<img src="../../../imgs/anyenhance/enhancement-tasks.png" width="100%">
+</div>
+<br>
+
+*Image adapted from [urgent challenge](https://urgent-challenge.github.io/urgent2024/baseline/).
+
+We present AnyEnhance, a unified model for both voice enhancement and extraction. Based on a masked generative model, AnyEnhance is capable of addressing diverse speech \& singing voice front-end tasks, including denoising, dereverberation, declipping, super-resolution, and target speaker extraction.
+
+<br>
+<div align="center">
+<img src="../../../imgs/anyenhance/anyenhance.png" width="100%">
+</div>
+<br>
+
+## Preparation
+
+### 1. Clone the repository
+
+```bash
+git clone https://github.com/open-mmlab/Amphion.git
+cd Amphion
+```
+
+### 2. Install the environment
+
+Before start installing, making sure you are under the `Amphion` directory. If not, use `cd` to enter.
+
+```bash
+# root_dir: Amphion
+cd models/se/anyenhance
+conda create -n anyenhance python=3.10
+conda activate anyenhance
+pip install -r requirements.txt
+```
+
+### 3. Download the pretrained models
+
+```bash
+# root_dir: Amphion
+# Uncomment to use HuggingFace mirror for mainland China users
+# export HF_ENDPOINT=https://hf-mirror.com
+cd models/se/anyenhance
+mkdir pretrained
+# Download the w2v-bert-2.0
+huggingface-cli download facebook/w2v-bert-2.0 --local-dir ./pretrained/w2v-bert-2.0
+# Download anyenhance weights
+huggingface-cli download amphion/anyenhance --local-dir ./pretrained/anyenhance
+```
+
+## Inference Examples
+
+### Enhancement without Prompt
+
+```bash
+# root_dir: Amphion
+python -m models.se.anyenhance.infer_anyenhance \
+    --task_type "enhancement" \
+    --input_file models/se/anyenhance/wav/enhancement/p226_006.wav \
+    --output_folder models/se/anyenhance/wav/enhanced \
+    --device cuda:0
+```
+
+### Enhancement with Prompt
+
+```bash
+# root_dir: Amphion
+python -m models.se.anyenhance.infer_anyenhance \
+    --task_type "enhancement" \
+    --input_file models/se/anyenhance/wav/enhancement/13-noisy.wav \
+    --prompt_file models/se/anyenhance/wav/enhancement/13-prompt.wav \
+    --output_folder models/se/anyenhance/wav/enhanced \
+    --device cuda:0
+```
+
+### Extraction
+
+```bash
+# root_dir: Amphion
+python -m models.se.anyenhance.infer_anyenhance \
+    --task_type "extraction" \
+    --input_file models/se/anyenhance/wav/extraction/9-noisy.wav \
+    --prompt_file models/se/anyenhance/wav/extraction/9-prompt.wav \
+    --output_folder models/se/anyenhance/wav/enhanced \
+    --device cuda:0
+```
+
+## Visualization Examples
+
+<br>
+<div align="center">
+<img src="../../../imgs/anyenhance/demo.png" width="100%">
+</div>
+<br>
+
+
+## Citations
+
+If you use AnyEnhance in your research, please cite the following papers:
+
+```bibtex
+@inproceedings{amphion,
+    author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
+    title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
+    booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
+    year={2024}
+}
+```