e5
Research Analysis: Sophia Paper's Training Strategy
Architecture
Model: Autoregressive models on OpenWebText
Context length: 1024
Model type: Decoder-only Transformers
Model sizes: 125M (small), 355M (medium), and 770M (large)
Datasets
OpenWebText (Gokaslan & Cohen, 2019)
Baselines
Adam with decoupled weight decay (AdamW) (Loshchilov & Hutter, 2017)
Lion (Chen et al., 2023)
Algorithmic Pseudocode
Initialize the model (GPT-2) with the desired number of parameters (small, medium, or large).
Load the OpenWebText dataset.
Set the context length to 1024.
Set the batch size to 480.
Use a cosine learning rate schedule with the final learning rate equal to 0.05 times the peak learning rate.
Apply gradient clipping with a threshold of 1.0.
Use a fixed 2k steps of learning rate warm-up.
Train the model using the Sophia optimizer with the chosen Hessian estimator (Sophia-H or Sophia-G) and hyperparameters.
Train the model for 100K, 200K, or 400K steps.
Evaluate the model using log perplexity on OpenWebText and in-context learning results on SuperGLUE.
Training Code with Hugging Face Transformers API
High-Level Architecture
Load the OpenWebText dataset from Hugging Face Datasets.
Preprocess the dataset:
Tokenize the text using a tokenizer.
Group the tokenized text into chunks of a specified sequence length.
Save the preprocessed dataset.
Algorithmic Pseudocode
Load the OpenWebText dataset.
Initialize the tokenizer.
Define a tokenize function that tokenizes the text and adds an end-of-sequence token.
Apply the tokenize function to the dataset using the map function.
Define a group_texts function that concatenates all texts and splits them into chunks of the specified sequence length.
Apply the group_texts function to the tokenized dataset using the map function.
Save the preprocessed dataset.
Algorithmic Pseudocode
Load the OpenWebText dataset.
Preprocess the dataset:
Tokenize the text using a tokenizer.
Group the tokenized text into chunks of a specified sequence length.
Initialize the GPT-2 model and tokenizer.
Set up the training arguments.
Create the Trainer with the model, training arguments, and preprocessed dataset.
Train the model using the DecoupledSophia optimizer with the chosen Hessian estimator and hyperparameters.
Evaluate the model using log perplexity on OpenWebText and in-context learning results on SuperGLUE.