Reproducibility issue when training on a smaller dataset and fewer GPUs #3

freddy5566 · 2020-06-03T10:00:50Z

Hi:

Just want to know How to replicate the result you mentioned on README, The model reaches 20 BLEU on testing dataset, after training for only 2 epochs.

I simple used your setup to train my model, however after 3 epochs, I got

020-06-03 17:49:03 | INFO | fairseq_cli.generate | Generate test with beam = 5: BLEU4 = 0.09, 7.5/0.7/0.0/0.0 (BP=1.000, ratio=1.996, syslen=289332, reflen=144951)

my generate-script is

fairseq-generate data-bin/wmt17_zh_en \
    --path checkpoints/checkpoint_best.pt \
    --batch-size 128 --beam 5 --remove-bpe

and the training data I used are:

training-parallel-nc-v12
United Nations Parallel-enzh

Thank you!

The text was updated successfully, but these errors were encountered:

sanxing-chen · 2020-06-03T14:55:02Z

Your evaluate script looks legit to me, this's so weird. Could you provide more details like the training loss and ppl curve? It can be drawn by the script provided in the repo.

freddy5566 · 2020-06-03T15:58:29Z

Hi @STayinloves :

Here is the result after I executed the script that you provided, besides I am not using Jupyter so I add plt.show() in the very end of file.

so, I also upload train.log.

Thank you again!

sanxing-chen · 2020-06-03T16:04:20Z

You might want to see if checkpoint_last.pt give you different results.

freddy5566 · 2020-06-03T16:09:43Z

I got an zero, here is the result:
2020-06-04 00:07:57 | INFO | fairseq_cli.generate | Generate test with beam=5: BLEU4 = 0.00, 5.4/0.0/0.0/0.0 (BP=0.448, ratio=0.554, syslen=80370, reflen=144951)

sanxing-chen · 2020-06-03T16:13:18Z

Your train.log says that you only have 15 examples in the validation set, this's probably wrong, I'm wondering whether the same mistake happens to the testing set.

freddy5566 · 2020-06-03T16:22:31Z

that's weird, since I download them from WMT and make sure files aren't wrong.
here is how I do pre-process:

download them in ./dataset
and put those files in test/valid/train just like you, and we use the same test/valid dataset
run prepare.sh

2020-06-04 00:07:57 | INFO | fairseq_cli.generate | Translated 8037 sentences (88407 tokens) in 14.6s (551.45 sentences/s, 6065.99 tokens/s)

I think test examples are fine...

Thank you for your response

freddy5566 · 2020-06-04T15:50:50Z

Update: I re-executed the preprocess and I am able to create 1996 sentences instead of 15 examples you mentioned above.

my preprocess.log

Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/wmt17_zh_en', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer='nag', padding_factor=8, quantization_config_path=None, seed=1, source_lang='zh', srcdict=None, target_lang='en', task='translation', tensorboard_logdir='', testpref='dataset//test.32000.bpe', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref='dataset//train.32000.bpe', user_dir=None, validpref='dataset//valid.32000.bpe', workers=12)
[zh] Dictionary: 36495 types
[zh] dataset//train.32000.bpe.zh: 222476 sents, 5624865 tokens, 0.0% replaced by <unk>
[zh] Dictionary: 36495 types
[zh] dataset//valid.32000.bpe.zh: 1996 sents, 58897 tokens, 0.278% replaced by <unk>
[zh] Dictionary: 36495 types
[zh] dataset//test.32000.bpe.zh: 2001 sents, 56962 tokens, 0.365% replaced by <unk>
[en] Dictionary: 31183 types
[en] dataset//train.32000.bpe.en: 222476 sents, 6106080 tokens, 0.0% replaced by <unk>
[en] Dictionary: 31183 types
[en] dataset//valid.32000.bpe.en: 1996 sents, 68078 tokens, 0.00881% replaced by <unk>
[en] Dictionary: 31183 types
[en] dataset//test.32000.bpe.en: 2001 sents, 63675 tokens, 0.00471% replaced by <unk>
Wrote preprocessed data to data-bin/wmt17_zh_en

it seems great, however, after 1 epoch training, I still got 0.15, since it is a huge difference between 20 and 0.15, just want to know, if I did something wrong, or I should be patient just wait for the result.

I upload the train.log in here, sorry for my lack of experience.

sanxing-chen · 2020-06-04T17:51:11Z

I would say just wait for one or two epochs to say, the model changes dramatically during the first few updates especially under the warmup scheduler. You can check the loss as an indicator.

I worked on this repo one year ago, I don't quite remember whether it differs by runs or seeds. But I did notice it will reach nearly a performance upper bound in the first few epochs.

There's nothing wrong with a lack of experience :)

freddy5566 · 2020-06-05T03:33:32Z

after 200,000 updates it is still 0.12, so, I guess something went wrong.
maybe I'll use a smaller dataset and model to do the experiment.

but, still thank your response.

sanxing-chen · 2020-06-05T03:35:48Z

You can try the interactive command to check some model output manually, a smaller dataset is also a good starter.

freddy5566 · 2020-06-06T17:16:26Z

after changed to a smaller dataset (training-parallel-nc-v12.tgz), and it's still the same result, I guess it's something went wrong on pre-process step, and I still cannot replicate the result. Is there anything that I need to do before execute those scripts?

sanxing-chen · 2020-06-06T21:31:00Z

I just noticed a few facts that I was unaware of in our previous discussion.

The training script can be affected by the number of GPUs available since it only limits the --max-tokens per GPU. So more GPUs will lead to a larger batch size in training. I use 6 GPUs previously while you seem to use 1 GPU (--update-freq setting can be helpful in this case). It's my fault that I didn't notice this in the repo, sorry for that.

Unfortunately, I don't currently have the resource to train a model on the full dataset, but based on the observation in my little experiment on training-parallel-nc-v12.tgz today (I download and run from scratch and will update the result later) I didn't find any other steps to add to the pre-processing step. I found my old training log and will attach it here.
train_wmt17_zh_en.log

I hope this helps!

sanxing-chen · 2020-06-07T15:30:48Z

Update on my experiment yesterday, I tried to train the model on training-parallel-nc-v12.tgz only (~200k examples) (I use --update-freq to ensure a similar batch size), it doesn't work. I observed the validation loss went up while the model could only output random fluent sentences. Then I switch to the full dataset (~20m examples), after one epoch (2.5 hours on 4 GTX 2080 Ti) I got BLEU4=18.89 on the testing set. So I suspect the model configuration cannot be trained on a small dataset easily.

freddy5566 · 2020-06-07T15:37:11Z

It helps a lot!!

I've tried transformer_iwslt_de_en and other models and turn out it doesn't work.
so, I guess training transformer dataset is quite important, anyway, you really save my day!

sanxing-chen · 2020-06-07T15:55:39Z

Adding to the discussion about different batch sizes, according to the results on Popel and Bojar, “Training Tips for the Transformer Model.” figure 5 and 6, when training big model, small batch size can lead to failure.

freddy5566 · 2020-06-07T16:01:21Z

@STayinloves
It helps a lot!!
I'll try an even bigger batch size, and also thanks for your help

afaq-ahmad · 2021-01-04T10:43:11Z

@sanxing-chen Hi, can you please guide me about full dataset (~20m examples), from where I can get it. Thanks

freddy5566 · 2021-01-04T14:56:56Z

Hi @afaq-ahmad :

after half year of research and trial and error, I think if you got (~20m examples) then train a regular transformer is total cool
you can follow this example
if you want to train a low resource MT model, flores is another cool project that you can start with.

afaq-ahmad · 2021-01-11T08:26:36Z

Hi @afaq-ahmad :

after half year of research and trial and error, I think if you got (~20m examples) then train a regular transformer is total cool
you can follow this example
if you want to train a low resource MT model, flores is another cool project that you can start with.

Thanks alot. I have 24 million sentences but when I train the method here example it's taking 12 hours for 1 epoch and only 0.2 points of blue score increase. It looks like it will take 30 days for training and reaching around 20 blue score. Do you have any idea how can I fast the procedure, I am using these parameters:

!CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_zh
--arch transformer --share-decoder-input-output-embed
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0
--lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000
--dropout 0.2 --weight-decay 0.0001
--criterion label_smoothed_cross_entropy --label-smoothing 0.1
--max-tokens 8192
--eval-bleu
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}'
--eval-bleu-detok moses
--eval-bleu-remove-bpe --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --save-dir checkpoints/transformer

freddy5566 · 2021-01-11T08:33:03Z

Hi @afaq-ahmad :
after half year of research and trial and error, I think if you got (~20m examples) then train a regular transformer is total cool
you can follow this example
if you want to train a low resource MT model, flores is another cool project that you can start with.

Thanks alot. I have 24 million sentences but when I train the method here example it's taking 12 hours for 1 epoch and only 0.2 points of blue score increase. It looks like it will take 30 days for training and reaching around 20 blue score. Do you have any idea how can I fast the procedure, I am using these parameters:

!CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_zh
--arch transformer --share-decoder-input-output-embed
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0
--lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000
--dropout 0.2 --weight-decay 0.0001
--criterion label_smoothed_cross_entropy --label-smoothing 0.1
--max-tokens 8192
--eval-bleu
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}'
--eval-bleu-detok moses
--eval-bleu-remove-bpe --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --save-dir checkpoints/transformer

You can leverage --fp16, --max-tokens
normally we set --max-tokens to be 4k or 3k
I also noticed that you didn't use --update-freq since you are using one gpu for training, you need to set it to be 4

kkeleve · 2021-12-22T03:29:38Z

I only have a 1.05m sentences. How much can I adjust the batchsize or other parameters to achieve good results?The following are my training parameters and bleu values
CUDA_VISIBLE_DEVICES=0 nohup fairseq-train ${data_dir}/data-bin
-a transformer --optimizer adam --source-lang ${src} --target-lang ${tgt}
--label-smoothing 0.1 --dropout 0.3 --max-tokens 4000
--lr-scheduler inverse_sqrt --weight-decay 0.0001
--criterion label_smoothed_cross_entropy --max-update 200000
--warmup-updates 10000 --warmup-init-lr '1e-7' --lr '0.001'
--adam-betas '(0.9, 0.98)' --adam-eps '1e-09' --clip-norm 25.0
--update-freq 4 --max-epoch 25
--tensorboard-logdir ~/nmt/log/tensorboardlog_tc4
--keep-last-epochs 2 --save-dir ${model_dir}/checkpoints_tc4 > ~/nmt/log/train_tc4.log 2>&1 &

kkeleve · 2021-12-22T03:29:57Z

BLEU = 21.13, 55.6/27.2/15.2/9.0 (BP=0.992, ratio=0.992, hyp_len=549536, ref_len=553932)

freddy5566 · 2021-12-23T05:17:29Z

Hi @sunyi1123,

You can play around warmup-updates, label-smoothing, and dropout. You can also apply a skill called "back translation". You firstly train a reverse-side MT model and use this trained model to translate rever-side sentences. This way, you will end up with 2x data.

sanxing-chen added the question Further information is requested label Jun 7, 2020

sanxing-chen changed the title ~~BLEU Score~~ Reproducibility issue when training on a smaller dataset and fewer GPUs Jun 7, 2020

sanxing-chen mentioned this issue Jul 28, 2020

How to reproduce the bleu score in 2 GPU cards? #4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility issue when training on a smaller dataset and fewer GPUs #3

Reproducibility issue when training on a smaller dataset and fewer GPUs #3

freddy5566 commented Jun 3, 2020 •

edited

Loading

sanxing-chen commented Jun 3, 2020

freddy5566 commented Jun 3, 2020 •

edited

Loading

sanxing-chen commented Jun 3, 2020

freddy5566 commented Jun 3, 2020

sanxing-chen commented Jun 3, 2020

freddy5566 commented Jun 3, 2020 •

edited

Loading

freddy5566 commented Jun 4, 2020 •

edited

Loading

sanxing-chen commented Jun 4, 2020 •

edited

Loading

freddy5566 commented Jun 5, 2020

sanxing-chen commented Jun 5, 2020

freddy5566 commented Jun 6, 2020 •

edited

Loading

sanxing-chen commented Jun 6, 2020 •

edited

Loading

sanxing-chen commented Jun 7, 2020

freddy5566 commented Jun 7, 2020

sanxing-chen commented Jun 7, 2020

freddy5566 commented Jun 7, 2020 •

edited

Loading

afaq-ahmad commented Jan 4, 2021

freddy5566 commented Jan 4, 2021

afaq-ahmad commented Jan 11, 2021

freddy5566 commented Jan 11, 2021

kkeleve commented Dec 22, 2021

kkeleve commented Dec 22, 2021

freddy5566 commented Dec 23, 2021

Reproducibility issue when training on a smaller dataset and fewer GPUs #3

Reproducibility issue when training on a smaller dataset and fewer GPUs #3

Comments

freddy5566 commented Jun 3, 2020 • edited Loading

sanxing-chen commented Jun 3, 2020

freddy5566 commented Jun 3, 2020 • edited Loading

sanxing-chen commented Jun 3, 2020

freddy5566 commented Jun 3, 2020

sanxing-chen commented Jun 3, 2020

freddy5566 commented Jun 3, 2020 • edited Loading

freddy5566 commented Jun 4, 2020 • edited Loading

sanxing-chen commented Jun 4, 2020 • edited Loading

freddy5566 commented Jun 5, 2020

sanxing-chen commented Jun 5, 2020

freddy5566 commented Jun 6, 2020 • edited Loading

sanxing-chen commented Jun 6, 2020 • edited Loading

sanxing-chen commented Jun 7, 2020

freddy5566 commented Jun 7, 2020

sanxing-chen commented Jun 7, 2020

freddy5566 commented Jun 7, 2020 • edited Loading

afaq-ahmad commented Jan 4, 2021

freddy5566 commented Jan 4, 2021

afaq-ahmad commented Jan 11, 2021

freddy5566 commented Jan 11, 2021

kkeleve commented Dec 22, 2021

kkeleve commented Dec 22, 2021

freddy5566 commented Dec 23, 2021

freddy5566 commented Jun 3, 2020 •

edited

Loading

freddy5566 commented Jun 3, 2020 •

edited

Loading

freddy5566 commented Jun 3, 2020 •

edited

Loading

freddy5566 commented Jun 4, 2020 •

edited

Loading

sanxing-chen commented Jun 4, 2020 •

edited

Loading

freddy5566 commented Jun 6, 2020 •

edited

Loading

sanxing-chen commented Jun 6, 2020 •

edited

Loading

freddy5566 commented Jun 7, 2020 •

edited

Loading