-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducibility issue when training on a smaller dataset and fewer GPUs #3
Comments
Your evaluate script looks legit to me, this's so weird. Could you provide more details like the training loss and ppl curve? It can be drawn by the script provided in the repo. |
Hi @STayinloves : Here is the result after I executed the script that you provided, besides I am not using Jupyter so I add so, I also upload train.log. Thank you again! |
You might want to see if |
I got an zero, here is the result: |
Your |
that's weird, since I download them from WMT and make sure files aren't wrong.
I think test examples are fine... Thank you for your response |
Update: I re-executed the preprocess and I am able to create 1996 sentences instead of 15 examples you mentioned above. my preprocess.log
it seems great, however, after 1 epoch training, I still got 0.15, since it is a huge difference between 20 and 0.15, just want to know, if I did something wrong, or I should be patient just wait for the result. I upload the train.log in here, sorry for my lack of experience. |
I would say just wait for one or two epochs to say, the model changes dramatically during the first few updates especially under the warmup scheduler. You can check the loss as an indicator. I worked on this repo one year ago, I don't quite remember whether it differs by runs or seeds. But I did notice it will reach nearly a performance upper bound in the first few epochs. There's nothing wrong with a lack of experience :) |
after 200,000 updates it is still 0.12, so, I guess something went wrong. but, still thank your response. |
You can try the interactive command to check some model output manually, a smaller dataset is also a good starter. |
after changed to a smaller dataset (training-parallel-nc-v12.tgz), and it's still the same result, I guess it's something went wrong on pre-process step, and I still cannot replicate the result. Is there anything that I need to do before execute those scripts? |
I just noticed a few facts that I was unaware of in our previous discussion.
Unfortunately, I don't currently have the resource to train a model on the full dataset, but based on the observation in my little experiment on I hope this helps! |
Update on my experiment yesterday, I tried to train the model on |
It helps a lot!! I've tried |
Adding to the discussion about different batch sizes, according to the results on Popel and Bojar, “Training Tips for the Transformer Model.” figure 5 and 6, when training big model, small batch size can lead to failure. |
@STayinloves |
@sanxing-chen Hi, can you please guide me about full dataset (~20m examples), from where I can get it. Thanks |
Hi @afaq-ahmad : after half year of research and trial and error, I think if you got (~20m examples) then train a regular transformer is total cool |
Thanks alot. I have 24 million sentences but when I train the method here example it's taking 12 hours for 1 epoch and only 0.2 points of blue score increase. It looks like it will take 30 days for training and reaching around 20 blue score. Do you have any idea how can I fast the procedure, I am using these parameters: !CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_zh |
You can leverage |
I only have a 1.05m sentences. How much can I adjust the batchsize or other parameters to achieve good results?The following are my training parameters and bleu values |
BLEU = 21.13, 55.6/27.2/15.2/9.0 (BP=0.992, ratio=0.992, hyp_len=549536, ref_len=553932) |
Hi @sunyi1123, You can play around |
Hi:
Just want to know How to replicate the result you mentioned on README,
The model reaches 20 BLEU on testing dataset, after training for only 2 epochs
.I simple used your setup to train my model, however after 3 epochs, I got
my generate-script is
and the training data I used are:
Thank you!
The text was updated successfully, but these errors were encountered: