Replicate the training of CoVe embedding using TensorFlow (the official implementation is in PyTorch).
This will take a while as it downloads and preprocess MT-M and MT-L. It also downloads GLoVe and Char-emb kazuma.
sh ./download_and_preprocess_to_tokens.sh
The 3 datasets are:
-
MT-S: WMT'16 Multimodal Translation: Multi30k (de-en) - A corpus of 30,000 sentence pairs that briefly describe Flickr captions (generally referred as Multi30K).
-
MT-M: IWSLT'16 (de-en) - A corpus of 209,772 sentence pairs from transcribed TED presentations that cover a wide variety of topics.
-
MT-L: WMT'17 (de-en) - A corpus of 7 million sentence pairs that comes from web crawl data, a news and commentary corpus, European Parliament proceedings, and European Union press releases.
Regarding the embeddings, we have here 2 possible embeddings, one at the words level (GLoVe) and one at the character level (Char-emb Kazuma):
- GLoVe: word embedding of dimension 300.
- Char-emb kazuma: character level embedding (up to 4-grams) of dimension 100.
Training of MT-LSTM as a 2 layers bidirectional LSTM encoder of an attentional sequence-to-sequence model trained on a Machine Translation task can be found in CoVe_training_MT_S.ipynb, CoVe_training_MT_M.ipynb and CoVe_training_MT_L.ipynb
Each notebook - CoVe_training_MT_S.ipynb, CoVe_training_MT_M.ipynb, CoVe_training_MT_L.ipynb for respectively CoVe-S, CoVe-M, CoVe-L - allows to preprocess the data, build and train an MT-LSTM model, then evaluate on validation and test sets the quality of the translation, and finally shows how to compute a CoVe embedding.
- Running the downloading can be very long, with an average of 30min.
- Running an epoch on a recent MacBook Pro:
- CoVe-S takes in average 1min
- CoVe-M takes in average 5min
- CoVe-L takes in average 30min
- Commands from the CoVe Github where they explain how to download the data and preprocess to tokenized files: https://github.com/salesforce/cove
- Neural Machine Translation (NMT) tensorflow tutorial (official) repository where they explain how to make use of tf.contrib.seq2seq for NMT: https://github.com/tensorflow/nmt
- Official PyTorch implementation of CoVe from the author of the paper.