Whisper-Flamingo

Updates

Jan 2025: We released mWhisper-Flamingo, a SOTA audio-visual model for 9 languages (paper coming soon)!
Nov 2024: We achieved SOTA ASR (1.3%) and SOTA AVSR (1.4%) on LRS2 - checkpoints are released below.
Oct 2024: We achieved SOTA ASR (0.68% WER) and SOTA AVSR (0.72% WER) on LRS3 by training on LRS3 and VoxCeleb2 - checkpoints are released below.

Introduction

Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

We propose Whisper-Flamingo which integrates visual features into the Whisper speech recognition and translation model with gated cross attention. Our audio-visual Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and En-X translation for 6 languages in noisy conditions. Moreover, Whisper-Flamingo is a versatile model and conducts all of these tasks using one set of parameters, while prior methods are trained separately on each language.

Video Demos

Check out the video demo below (turn sound on). We made several videos about Whisper-Flamingo:

30s demo of Whisper-Flamingo (same video below): YouTube link
2m demo comparing Whisper and Whisper-Flamingo: YouTube link
10m presentation: YouTube link

Whisper-Flamingo.teaser.mp4

Colab Demos

We support two colab demos (local copies in ./notebooks):

Test Whisper-Flamingo on an example audio / video
Reproduce our results on LRS3 / MuAViC:

Virtual Environment for Training and Testing

Since this project uses the MuAViC dataset, we base our virtual environment on theirs.

Create a fresh virtual environment:

conda create -n whisper-flamingo python=3.8 -y
conda activate whisper-flamingo

Clone MuAViC repo and install their requirements:

conda install -c conda-forge ffmpeg==4.2.2 -y
conda install -c conda-forge sox -y
git clone https://github.com/facebookresearch/muavic.git muavic-setup
cd muavic-setup
pip install -r requirements.txt
cd ..

Clone the "muavic" branch of av_hubert's repo and install Fairseq:

# downgrade pip https://github.com/roudimit/whisper-flamingo/issues/4 https://github.com/facebookresearch/fairseq/issues/5511
python -m pip install pip==24.0
pip --version
git clone -b muavic https://github.com/facebookresearch/av_hubert.git
cd av_hubert
git submodule init
git submodule update
# Install av-hubert's requirements
pip install -r requirements.txt
# Install fairseq
cd fairseq
pip install --editable ./
cd ../..

Install extra packages used in our project:

pip install tiktoken==0.5.2 pytorch-lightning==2.1.3 numba==0.58.1 transformers==4.36.2 evaluate tensorboardX

Download and prepare data

LRS3 / MuAViC: We provide all data to reproduce the results on the test set. For instructions on how to prepare the training set (and more details about the test noise), see preparation/README.md. For MuAViC non-En, we also provide our text labels which were normalized by removing all punctuation except single apostrophes (the code that we used for text normalization is in notebooks/mtedx_labels.ipynb). Note that we normalized all the text (training / validation / test).

Download and extract our resources:

wget https://data.csail.mit.edu/public-release-sls/whisper-flamingo/muavic.tar.gz # En
wget https://data.csail.mit.edu/public-release-sls/whisper-flamingo/muavic-multi.tar.gz # Ar, De, El, Es, It, Fr, Pt, Ru
# NOTE: you can also download muavic-ar.tar.gz, muavic-de.tar.gz, etc... if you need a specific language.
wget https://data.csail.mit.edu/public-release-sls/whisper-flamingo/noise.tar.gz
tar -xf muavic.tar.gz
tar -xf muavic-multi.tar.gz
tar -xf noise.tar.gz
echo $(pwd)/noise/babble/muavic/babble_all.wav > ./noise/babble/muavic/test.tsv
echo $(pwd)/noise/babble/muavic/babble_all.wav > ./noise/babble/muavic/valid.tsv
echo $(pwd)/noise/babble/lrs3/noise.wav > ./noise/babble/lrs3/test.tsv
echo $(pwd)/noise/babble/lrs3/noise.wav > ./noise/babble/lrs3/valid.tsv

LRS2: The data can be downloaded here after signing a license and sending it to the BBC (helper script: notebooks/lrs2_download.ipynb). In our experience, it took a week to receive the username & password for the data download. We used the AutoAVSR scripts to process LRS2 (using the provided facial landmarks). Finally, the AutoAVSR data lists must be converted to AV-HuBERT / Fairseq manifests. We provide a script to do this (notebooks/lrs2_make_tsv.ipynb).

mWhisper-Flamingo / multilingual Pre-Trained Models

Audio-only Whisper (fine-tuned on MuAViC with noise)

Mod.	Size	Parameters	Langs.	Train GPUs	Download Link
A	Medium	769M	En, Ar, De, El, Es, It, Fr, Pt, Ru	4x A6000, 48GB	whisper_multi-all_medium
A	Small	244M	En, Ar, De, El, Es, It, Fr, Pt, Ru	4x A6000, 48GB	whisper_multi-all_small

Audio-visual mWhisper-Flamingo

Mod.	Size	Parameters	Langs.	Train GPUs	Download Link
A	Medium	1,390M	En, Ar, De, El, Es, It, Fr, Pt, Ru	4x A6000, 48GB	whisper-flamingo_multi-all_medium
A	Small	651M	En, Ar, De, El, Es, It, Fr, Pt, Ru	4x A6000, 48GB	whisper-flamingo_multi-all_small

English Pre-trained Models

We release our pre-trained models (GPUs = GPUs used for training).

Our audio models are fine-tuned with noise from MUSAN and LRS3 (including babble noise, speech, and music), making them perform better in noise (see the paper and our video demo for more details)
We also release the models trained on the combination of LRS3 and VoxCeleb2 (the transcripts of VoxCeleb2 were obtained by Whisper Large-V2, available from this repo). We release the models fine-tuned with noise (noisy) and without noise (clean). whisper_en_large_vc2_clean achieves SOTA ASR on LRS3 (0.68% WER) and whisper-flamingo_en_large_vc2_clean achieves SOTA AVSR on LRS3 (0.72% WER).
LRS2 models: these models are trained on the LRS2 dataset with noise added from MUSAN and LRS3. whisper_lrs2_medium achieves SOTA ASR on LRS2 (1.3% WER) and whisper-flamingo_lrs2_medium achieves SOTA AVSR on LRS2 (1.4%).
Our LRS3 models support transcription in English (En) and En-X translation into 6 languages: Greek (El), Spanish (Es), French (Fr), Italian (It), Portuguese (Pt), and Russian (Ru). Note that to enable the new En-X translation capabilities, we use the 'transcribe' token instead of the 'translate' token as input to the decoder since the latter was already used for X-En translation.
For English, our models don't output punctuation and capitalization since the LRS3 English training text removed them. For En-X translation, our models output punctuation and capitalization since they were retained in the training translations.

Audio-only Whisper (fine-tuned on LRS3 / MuAViC)

Mod.	Size	VoxCeleb2	Parameters	En ASR	En-X ST	GPUs	Download Link
A	Large-V2	yes	1,550M	yes	no	1x A6000, 48GB	noisy: whisper_en_large_vc2_noisy clean: whisper_en_large_vc2_clean
A	Large-V2	no	1,550M	yes	no	1x A6000, 48GB	whisper_en_large
A	Large-V2	no	1,550M	yes	yes	4x A6000, 48GB	whisper_en-x_large
A	LRS2-Medium	no	769M	yes	no	1x A6000, 48GB	whisper_lrs2_medium
A	Medium	no	769M	yes	yes	4x A5000, 24GB	whisper_en-x_medium
A	Small	no	244M	yes	yes	4x A5000, 24GB	whisper_en-x_small

Audio-visual Whisper-Flamingo

Mod.	Size	VoxCeleb2	Parameters	En ASR	En-X ST	GPUs	Download Link
AV	Large-V2	yes	2,497M	yes	no	1x A6000, 48GB	noisy: whisper-flamingo_en_large_vc2_noisy clean: whisper-flamingo_en_large_vc2_clean
AV	Large-V2	no	2,497M	yes	no	1x A6000, 48GB	whisper-flamingo_en_large
AV	Large-V2	no	2,497M	yes	yes	4x A6000, 48GB	whisper-flamingo_en-x_large
AV	LRS2-Medium	no	1,390M	yes	no	1x A6000, 48GB	whisper-flamingo_lrs2_medium
AV	Medium	no	1,390M	yes	yes	4x A6000, 48GB	whisper-flamingo_en-x_medium
AV	Small	no	651M	yes	yes	4x A5000, 24GB	whisper-flamingo_en-x_small

Decoding Script

Audio-Only Decoding

First, download our models. Ex: audio-only Whisper model fine-tuned for En-X translation.

mkdir models
wget https://data.csail.mit.edu/public-release-sls/whisper-flamingo/models/whisper_en-x_small.pt -P models

Decode an audio-only model (see whisper_decode_video.py for argument details):

For this model, to switch to En-X translation, change the lang to the target language and use --task En-X.
Here we use babble noise from MuAViC at 0 SNR. Use noise/babble/lrs3/test.tsv for babble noise from LRS3. Use --noise-snr 1000 to evaluate in clean conditions.
Here use beam size 1. In the paper we report results with beam size 15.
For GPU without fp16, and for cpu, use --fp16 0.

python -u whisper_decode_video.py --lang en \
                                --model-type small \
                                --noise-snr 0 \
                                --noise-fn noise/babble/muavic/test.tsv \
                                --modalities asr \
                                --checkpoint-path models/whisper_en-x_small.pt

LRS2 ASR decoding (adjust noise_snr as desired):

python -u whisper_decode_video.py --lang lrs2 \
                                --model-type medium \
                                --noise-snr 1000 \
                                --noise-fn noise/babble/lrs3/test.tsv \
                                --modalities asr \
                                --checkpoint-path models/whisper_lrs2_medium.pt

MuAViC Es decoding using our fine-tuned Whisper (you can change Es to any other supported language):

python -u whisper_decode_video.py --lang es \
                                --model-type small \
                                --noise-snr 0 \
                                --noise-fn noise/babble/lrs3/test.tsv \
                                --modalities asr \
                                --checkpoint-path models/whisper_multi-all_small.pt

Audio-Visual Decoding

First, download our models. Ex: our audio-visual Whisper-Flamingo model fine-tuned for En-X translation. Note: the AV-HuBERT weights must be downloaded and are used by Fairseq to load the architecture.

mkdir models
wget https://data.csail.mit.edu/public-release-sls/whisper-flamingo/models/whisper-flamingo_en-x_small.pt -P models
wget https://data.csail.mit.edu/public-release-sls/whisper-flamingo/models/large_noise_pt_noise_ft_433h_only_weights.pt -P models

Decode an audio-visual model:

python -u whisper_decode_video.py --lang en \
                                --model-type small \
                                --noise-snr 0 \
                                --noise-fn noise/babble/muavic/test.tsv \
                                --modalities avsr \
                                --use_av_hubert_encoder 1 \
                                --av_fusion separate \
                                --checkpoint-path models/whisper-flamingo_en-x_small.pt \
                                --av-hubert-path av_hubert/avhubert/ \
                                --av-hubert-ckpt models/large_noise_pt_noise_ft_433h_only_weights.pt

LRS2 AVSR decoding (adjust noise_snr as desired):

python -u whisper_decode_video.py --lang lrs2 \
                                --model-type medium \
                                --noise-snr 1000 \
                                --noise-fn noise/babble/lrs3/test.tsv \
                                --modalities avsr \
                                --use_av_hubert_encoder 1 \
                                --av_fusion separate \
                                --checkpoint-path models/whisper-flamingo_lrs2_medium.pt \
                                --av-hubert-path av_hubert/avhubert/ \
                                --av-hubert-ckpt models/large_noise_pt_noise_ft_433h_only_weights.pt

MuAViC Es decoding using mWhisper-Flamingo (you can change Es to any other supported language):

# NOTE: run this first to download the multilingual AV-HuBERT weights
wget https://data.csail.mit.edu/public-release-sls/mwhisper-flamingo/models/mavhubert_only_weights.pt -P models

python -u whisper_decode_video.py --lang es \
                                --model-type small \
                                --noise-snr 0 \
                                --noise-fn noise/babble/lrs3/test.tsv \
                                --modalities avsr \
                                --use_av_hubert_encoder 1 \
                                --av_fusion separate \
                                --checkpoint-path models/whisper-flamingo_multi-all_small.pt \
                                --av-hubert-path av_hubert/avhubert/ \
                                --av-hubert-ckpt models/mavhubert_only_weights.pt

Decoding Script in Parallel with SLURM

We provide slurm/whisper_decode_wrapper.sh (En-X) and slurm/whisper_decode_multi_wrapper.sh (multilingual ASR) for submitting decoding jobs to SLURM. After submitting all jobs, ie. source slurm/whisper_decode_wrapper.sh, use slurm/check_results.ipynb or slurm/multilingual_check_results.ipynb to print the results of all decoding runs. It will load the decoding WER / BLEU scores and print them in a convinient table.

Training

Step 1: Fine-tune audio-only Whisper on MuAViC with noise

First, pick a config from config/audio/, for example config/audio/audio_en-x_large.yaml. Then replace noise_fn: '/data/sls/scratch/roudi/datasets/musan/tsv/all/train.tsv' with the path to your training noise. Training Command:

python -u whisper_ft_muavic.py config/audio/audio_en-x_large.yaml

We also provide a slurm script in slurm/train_audio_4gpu.sh (En-X, multilingual models) and slurm/train_audio_1gpu.sh (En models). It took about 2-3 days to fine-tune Whisper Large-V2 on our GPUs. The medium and small models finish in about a day.

Step 2: Train audio-visual Whisper-Flamingo with gated cross attention

Once the audio model is fine-tuned, we freeze the weights and insert the gated cross-attention layers to train the audio-visual Whisper-Flamingo. Use the corresponding config in config/audio-visual/. Training Command:

python -u whisper_ft_muavic_video.py config/audio-visual/av_en-x_large.yaml

We also provide a slurm script in slurm/train_video_4gpu.sh (En-X, multilingual models) and slurm/train_video_1gpu.sh (En models). Training Whisper-Flamingo is faster since the cross-attention layers are the only trainable layers. It took about 1 day to train Whisper-Flamingo Large on our GPUs (not including the time to fine-tune the audio model in the first step).

Training progress

Model weights will be saved in models/checkpoint. Tensorboard can be opened to monitor several metrics.

cd slurm
tensorboard --logdir .  --port 6008

Training notes

Training should work on 1 GPU or multiple GPUs, although some settings need to be adjusted (such as batch size)
The original Whisper code always pads audio to 30s. We avoid this and instead batch together samples of similar length and pad to the longest sample in the batch (this minimizes padding).

Acknowledgments

This code based is based on the following repos: Whisper Fine-Tuning Demo, Whisper, AV-HuBERT, MuAViC, ESPnet, AutoAVSR, Flamingo-pytorch, e-mvsr.

License

Our work is licensed under BSD-3. However, please check the licenses of the works we build on, including AV-HuBERT.

Citation

mWhisper-Flamingo - coming soon!

@inproceedings{rouditchenko24_interspeech,
  title     = {Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation},
  author    = {Andrew Rouditchenko and Yuan Gong and Samuel Thomas and Leonid Karlinsky and Hilde Kuehne and Rogerio Feris and James Glass},
  year      = {2024},
  booktitle = {Interspeech 2024},
  pages     = {2420--2424},
  doi       = {10.21437/Interspeech.2024-322},
  issn      = {2958-1796},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Whisper-Flamingo

Updates

Introduction

Video Demos

Colab Demos

Virtual Environment for Training and Testing

Download and prepare data

mWhisper-Flamingo / multilingual Pre-Trained Models

Audio-only Whisper (fine-tuned on MuAViC with noise)

Audio-visual mWhisper-Flamingo

English Pre-trained Models

Audio-only Whisper (fine-tuned on LRS3 / MuAViC)

Audio-visual Whisper-Flamingo

Decoding Script

Audio-Only Decoding

Audio-Visual Decoding

Decoding Script in Parallel with SLURM

Training

Step 1: Fine-tune audio-only Whisper on MuAViC with noise

Step 2: Train audio-visual Whisper-Flamingo with gated cross attention

Training progress

Training notes

Acknowledgments

License

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
config		config
notebooks		notebooks
preparation		preparation
slurm		slurm
whisper		whisper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
spec_augment.py		spec_augment.py
utils.py		utils.py
utils_batch_samplers.py		utils_batch_samplers.py
whisper_decode_video.py		whisper_decode_video.py
whisper_ft_muavic.py		whisper_ft_muavic.py
whisper_ft_muavic_video.py		whisper_ft_muavic_video.py

License

roudimit/whisper-flamingo

Folders and files

Latest commit

History

Repository files navigation

Whisper-Flamingo

Updates

Introduction

Video Demos

Colab Demos

Virtual Environment for Training and Testing

Download and prepare data

mWhisper-Flamingo / multilingual Pre-Trained Models

Audio-only Whisper (fine-tuned on MuAViC with noise)

Audio-visual mWhisper-Flamingo

English Pre-trained Models

Audio-only Whisper (fine-tuned on LRS3 / MuAViC)

Audio-visual Whisper-Flamingo

Decoding Script

Audio-Only Decoding

Audio-Visual Decoding

Decoding Script in Parallel with SLURM

Training

Step 1: Fine-tune audio-only Whisper on MuAViC with noise

Step 2: Train audio-visual Whisper-Flamingo with gated cross attention

Training progress

Training notes

Acknowledgments

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages