Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality #1

Closed
abylouw opened this issue Apr 23, 2024 · 14 comments
Closed

Quality #1

abylouw opened this issue Apr 23, 2024 · 14 comments

Comments

@abylouw
Copy link

abylouw commented Apr 23, 2024

Hi,

Thank you for creating this repo and implementing the architecture in the paper. I have been looking at the paper and was going to try an implementation.

Do you have any preliminary results available? Do you think that it is better than for example iSTFTNet or MB-MelGan?

@wetdog
Copy link
Owner

wetdog commented Apr 24, 2024

I've not compared yet with iSTFTNet or MB-MelGan, but i'll try to hear those models on the same sample. Here is a sample of a first model.
image
src: https://drive.google.com/file/d/1awn-oHt-wycZFyB7c_tQtC8hIewbrwR1/view?usp=drive_link
wavenext: https://drive.google.com/file/d/1jUDebB0oxuzo7VMu8pQEII3UXdtv0kLs/view?usp=drive_link

I'll probably try to increase the alpha of mrd loss to 1.0 as this was suggested here gemelo-ai/vocos#48
I'll also post the weights of the trainings when it ends.

@egorsmkv
Copy link

@abylouw interesting to compare outputs for Ukrainian as well

I have a RAD-TTS model with these vocoders:

@patriotyk
Copy link

I've not compared yet with iSTFTNet or MB-MelGan, but i'll try to hear those models on the same sample. Here is a sample of a first model. image src: https://drive.google.com/file/d/1awn-oHt-wycZFyB7c_tQtC8hIewbrwR1/view?usp=drive_link wavenext: https://drive.google.com/file/d/1jUDebB0oxuzo7VMu8pQEII3UXdtv0kLs/view?usp=drive_link

I'll probably try to increase the alpha of mrd loss to 1.0 as this was suggested here gemelo-ai/vocos#48 I'll also post the weights of the trainings when it ends.

I am training your vocos-matcha with mrd loss = 1.0 and 44100 Hz So it is very slow, for almost 1M iterations, it still sounds slightly worse than yours https://huggingface.co/BSC-LT/vocos-mel-22khz. And metrics still worse.
The following logs is for about 3 weeks continuous training on 2 RTX3090 with dataset about 800 hours:
Знімок екрана 2024-05-02 о 15 55 22

@mush42
Copy link

mush42 commented Jun 6, 2024

Hi @wetdog

What's the status of this implementation in terms of quality and speed?
Do you have pre-trained weights available?

I've great expectations for this repo 🙂
Best of luck!

@wetdog
Copy link
Owner

wetdog commented Jun 7, 2024

Hi @mush42 I finished a training for the mel version this week. in terms of quality it achieves better periodicity,
pesq_score, pitch_loss than vocos trained on the same datasets. you can find the weights here: https://huggingface.co/BSC-LT/wavenext-mel

image

Also I fixed some things with the encodec experiment this week and now is training. For this trainings I used the mel features compatible with hifigan but probably is worth to train a version with 24khz using the same features as the original vocos. Let me know if you have some doubts.

@wetdog
Copy link
Owner

wetdog commented Jun 7, 2024

@egorsmkv Great work I would probably use your versions to run some metrics and compare the quality of those vocoders.

@patriotyk
Copy link

@wetdog I have added your wavenext pretrained model to my huggingface app that runs pflowtts model. But unfortunately it sounds not very good. There is 4 vocoders that generate all waveforms from the same mel spectrogram generated by pflowtts and wavenext sounds similar to hifigan but slightly worse. There are also 44100 vocos vocoder trained from your implementation and it sounds the best. You can check it here https://huggingface.co/spaces/patriotyk/pflowtts_ukr_demo

@wetdog
Copy link
Owner

wetdog commented Jun 7, 2024

@patriotyk Thanks for the quick implementation, Do you think that this could be due to the dataset where it was trained? I used libritts for this run but I would like to try a version with commophone https://arxiv.org/abs/2201.05912 to make it more "universal".

@mush42
Copy link

mush42 commented Jun 7, 2024

Hi

Heavy TTS user here.
I don't agree with @patriotyk on this.
My initial testing shows that wavenext is significantly better than vocos, both in inference speed and synthesis quality.

Specifically, there is an audible hissing noise in the audio vocoded by vocos, probably as an ISTFT artifact.

Here's a sample of an unseen speaker, where Matcha TTS is used to generate the melspectogram.
vocos-vs-wavenext.zip

Best
Musharraf

@patriotyk
Copy link

@wetdog I don't know, but seems to be yes. I will try pflowtts trained on libritts and we will see.

@mush42 Your Matcha TTS is trained on which dataset? Also your vocos sample is really bad, what pretrained model do you use here? On my app 'BSC-LT/vocos-mel-22khz' sounds much better.

@mush42
Copy link

mush42 commented Jun 7, 2024

@patriotyk
Matcha was trained on HifiCaptin US English female dataset.
I'm using an ONNX model converted from this model with a custom ISTFT implementation that uses CNN (in order to be ONNX exportable).

@fd873630
Copy link

fd873630 commented Jun 26, 2024

Hi @mush42 I finished a training for the mel version this week. in terms of quality it achieves better periodicity, pesq_score, pitch_loss than vocos trained on the same datasets. you can find the weights here: https://huggingface.co/BSC-LT/wavenext-mel

image

Also I fixed some things with the encodec experiment this week and now is training. For this trainings I used the mel features compatible with hifigan but probably is worth to train a version with 24khz using the same features as the original vocos. Let me know if you have some doubts.

Hi! @wetdog

Could you please share the .ckpt checkpoint file in addition to the .bin checkpoint file that you provided?

I want finetuning! but .bin checkpoint exist only generator!

@wetdog
Copy link
Owner

wetdog commented Jun 28, 2024

@fd873630 I just uploaded the ckpt. you can find it here https://huggingface.co/BSC-LT/wavenext-mel/blob/main/wavenext_2M_libritt_r.ckpt

@abylouw abylouw closed this as completed Jul 5, 2024
@mush42
Copy link

mush42 commented Jul 5, 2024

@wetdog
Thanks for open-sourcing your work guys.
Really appreciate it.

Best
Musharraf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants