Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Now fluentspeech.py also manages validation set. #271

Merged
merged 1 commit into from
Jun 16, 2022
Merged

Now fluentspeech.py also manages validation set. #271

merged 1 commit into from
Jun 16, 2022

Conversation

umbertocappellazzo
Copy link

Hi, I modified the fluentspeech.py file such that it can deal with the validation set as well (it was missing).
Cheers

@TLESORT
Copy link
Collaborator

TLESORT commented Jun 14, 2022

Hi Umberto,
Thanks for your PR. :)

The thing is that in CL, you are not supposed to have a valid set in the beginning. You are supposed to produce it at each task from the training data.
I would prefer that you adapt split_train_val(dataset, split_val = 0.1) to your use case or create a new function compatible with your task type to realize the same function.

@umbertocappellazzo
Copy link
Author

Hmm, I see. FluentSpeech Commands dataset provides the valid_data.csv file together w/ the train and test set, so I can't extract a portion of the train set because the valid already exists. Well, I'll see how to pull it off anyways 👍 Thank you Timothée :)

@TLESORT
Copy link
Collaborator

TLESORT commented Jun 14, 2022

Is it possible to merge train and valid to create train data?
If there is no way out, we can maybe keep the "valid" option you propose but this is not a recommended practice... :)

@umbertocappellazzo
Copy link
Author

I think, by and large, that it is possible, yet if we merge the train and valid, and then we split again, the original validation samples will not be the same as the new ones brought by split_train_val . In fact the validation set includes utterances from a certain number of speakers, and these speakers are different from the train set (and from the test set), and each speaker's utterances are kept into a specific subfolder. Therefore I think it would not be possible :/

@arthurdouillard
Copy link
Collaborator

@TLESORT I think we can allow having a define val dataset, I think a few others datasets in Continuum proposes that. It can be useful when everyone compare on the same val set.

@TLESORT
Copy link
Collaborator

TLESORT commented Jun 15, 2022

Yes, we can potentially accept it.

But first, having the same valid set is not supposed to be a requirement. The mandatory thing is to have the same test set.
Second, if the valid set is designed with additional knowledge, it is considered supplementary supervision. In particular, if the valid set is designed with the knowledge of the whole dataset, it could mean there is some bias toward the future ...

I also think we have already some datasets with valid sets, but it is against my will ;D :P

So, we can merge this, but we need to keep in mind that some particularities in the scenario make it not perfectly rigorous...

@umbertocappellazzo
Copy link
Author

Well, the split into train, test and valid has been made by the authors who created the corpus and I don't know whether they crafted then different sets. Since I'm the first to use FSC in a CL scenario, I think it could be ok to proceed in this way, and I understand your rigorousness for this matter. So, you have the last word about this.
I take advantage of this thread for asking one thing: does Continuum handle the case of unbalanced classes for rehearsal? I had a look at the I suppose not, but I wanna be sure. If the dataset contains unbalanced classes, it's not fair to keep the same # of samples for each class. If continnum doesn't cover this case, I can come up with a solution for my project and then I can make a PR (if you think this is worth it).

@arthurdouillard
Copy link
Collaborator

@umbertocappellazzo I'm merging this PR, and I create an issue to discuss what you're just talked about as it's quite different matter (#272).

@arthurdouillard arthurdouillard merged commit 9c0458c into Continvvm:master Jun 16, 2022
@arthurdouillard
Copy link
Collaborator

@umbertocappellazzo note that it has been deployed in the version 1.2.4. :)

Thanks for your contrib!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants