LLM Training

LLM Training

Survey

awesome-open-source-lms - allenai

· (docs.google)

LLM Training

Grokking at the Edge of Numerical Stability, arXiv, 2501.04697, arxiv, pdf, cication: -1

Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, ..., Tolga Birdal · (grokking-at-the-edge-of-numerical-stability. - LucasPrietoAl)
Scaling Laws for Floating Point Quantization Training, arXiv, 2501.02423, arxiv, pdf, cication: -1

Xingwu Sun, Shuaipeng Li, Ruobing Xie, ..., Di Wang, Jie Jiang
Flash LLMs: Pipeline Parallel 🎬
360-LLaMA-Factory - Qihoo360
Metadata Conditioning Accelerates Language Model Pre-training, arXiv, 2501.01956, arxiv, pdf, cication: -1

Tianyu Gao, Alexander Wettig, Luxi He, ..., Sadhika Malladi, Danqi Chen · (MeCo - princeton-pli)
🌟 picotron - huggingface

The minimalist & most-hackable repository for pre-training Llama-like models with 4D Parallelism · (youtube)
Multi-matrix Factorization Attention, arXiv, 2412.19255, arxiv, pdf, cication: -1

Jingcheng Hu, Houyi Li, Yinmin Zhang, ..., Xiangyu Zhang, Heung-Yeung Shum
Rate of Model Collapse in Recursive Training, arXiv, 2412.17646, arxiv, pdf, cication: -1

Ananda Theertha Suresh, Andrew Thangaraj, Aditya Nanda Kishore Khandavally
[10 December 2024, NeurIPs] Tutorial on Language Modeling
Establishing Task Scaling Laws via Compute-Efficient Model Ladders, arXiv, 2412.04403, arxiv, pdf, cication: -1

Akshita Bhagia, Jiacheng Liu, Alexander Wettig, ..., Jesse Dodge, Hannaneh Hajishirzi · (𝕏)
Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning, arXiv, 2411.05193, arxiv, pdf, cication: -1

Joey Hong, Anca Dragan, Sergey Levine · (𝕏)
🌟 Predicting Emergent Capabilities by Finetuning, arXiv, 2411.16035, arxiv, pdf, cication: -1

Charlie Snell, Eric Wallace, Dan Klein, ..., Sergey Levine · (𝕏)
🌟 Observational Scaling Laws and the Predictability of Language Model Performance, arXiv, 2405.10938, arxiv, pdf, cication: -1

Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto
Balancing Pipeline Parallelism with Vocabulary Parallelism, arXiv, 2411.05288, arxiv, pdf, cication: -1

Man Tsung Yeung, Penghui Qi, Min Lin, ..., Xinyi Wan · (VocabularyParallelism - sail-sg)
Learning to (Learn at Test Time): RNNs with Expressive Hidden States, arXiv, 2407.04620, arxiv, pdf, cication: 19

Yu Sun, Xinhao Li, Karan Dalal, ..., Tatsunori Hashimoto, Carlos Guestrin · (yueatsprograms.github) · (ttt-lm-pytorch - test-time-training)
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective, arXiv, 2410.23743, arxiv, pdf, cication: -1

Ming Li, Yanhong Li, Tianyi Zhou · (Layer_Gradient - MingLiiii) · (aimodels)

Pretraining

MiniPLM: Knowledge Distillation for Pre-Training Language Models, arXiv, 2410.17215, arxiv, pdf, cication: -1

Yuxian Gu, Hao Zhou, Fandong Meng, ..., Jie Zhou, Minlie Huang
Pre-training Distillation for Large Language Models: A Design Space Exploration, arXiv, 2410.16215, arxiv, pdf, cication: -1

Hao Peng, Xin Lv, Yushi Bai, ..., Lei Hou, Juanzi Li

Post Training

Finetuning

$\text{Transformer}^2$: Self-adaptive LLMs, arXiv, 2501.06252, arxiv, pdf, cication: -1

Qi Sun, Edoardo Cetin, Yujin Tang
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models, arXiv, 2407.01906, arxiv, pdf, cication: -1

Zihan Wang, Deli Chen, Damai Dai, ..., Zhuoshu Li, Y. Wu · (𝕏)
DELIFT: Data Efficient Language model Instruction Fine Tuning, arXiv, 2411.04425, arxiv, pdf, cication: -1

Ishika Agarwal, Krishnateja Killamsetty, Lucian Popa, ..., Marina Danilevksy · (𝕏)
SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation, arXiv, 2410.14745, arxiv, pdf, cication: -1

Junyu Luo, Xiao Luo, Xiusi Chen, ..., Wei Ju, Ming Zhang

Optimization

Spike No More: Stabilizing the Pre-training of Large Language Models, arXiv, 2312.16903, arxiv, pdf, cication: -1

Sho Takase, Shun Kiyono, Sosuke Kobayashi, ..., Jun Suzuki
No More Adam: Learning Rate Scaling at Initialization is All You Need, arXiv, 2412.11768, arxiv, pdf, cication: -1

Minghao Xu, Lichuan Xiang, Xu Cai, ..., Hongkai Wen · (SGD_SaI - AnonymousAlethiometer)
Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN, arXiv, 2412.13795, arxiv, pdf, cication: -1

Pengxiang Li, Lu Yin, Shiwei Liu · (MixLN. - pixeli99)
🌟 Muon scaling again 𝕏
APOLLO: SGD-like Memory, AdamW-level Performance, arXiv, 2412.05270, arxiv, pdf, cication: -1

Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, ..., Zhangyang Wang, Jinwon Lee · (zhuhanqing.github)
DeMo: Decoupled Momentum Optimization, arXiv, 2411.19870, arxiv, pdf, cication: -1

Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma · (DeMo - bloc97)
ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate, arXiv, 2411.02853, arxiv, pdf, cication: -1

Shohei Taniguchi, Keno Harada, Gouki Minegishi, ..., Yusuke Iwasawa, Yutaka Matsuo · (adopt - iShohei220)
unit-scaling - graphcore-research
Natural Language Reinforcement Learning, arXiv, 2411.14251, arxiv, pdf, cication: -1

Xidong Feng, Ziyu Wan, Haotian Fu, ..., Ying Wen, Jun Wang · (Natural-language-RL - waterhorse1)
Cautious Optimizers: Improving Training with One Line of Code, arXiv, 2411.16085, arxiv, pdf, cication: -1

Kaizhao Liang, Lizhang Chen, Bo Liu, ..., Qiang Liu · (C-Optim - kyleliang919) · (qbitai)
MARS: Unleashing the Power of Variance Reduction for Training Large Models, arXiv, 2411.10438, arxiv, pdf, cication: -1

Huizhuo Yuan, Yifeng Liu, Shuang Wu, ..., Xun Zhou, Quanquan Gu · (MARS. - AGI-Arena)
nGPT: Normalized Transformer with Representation Learning on the Hypersphere, arXiv, 2410.01131, arxiv, pdf, cication: -1

Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, ..., Boris Ginsburg · (ngpt - NVIDIA)
Top-$nσ$: Not All Logits Are You Need, arXiv, 2411.07641, arxiv, pdf, cication: -1

Chenxia Tang, Jianchun Liu, Hongli Xu, ..., Liusheng Huang · (top_nsigma - Tomorrowdawn)
Cut Your Losses in Large-Vocabulary Language Models, arXiv, 2411.09009, arxiv, pdf, cication: -1

Erik Wijmans, Brody Huval, Alexander Hertzberg, ..., Vladlen Koltun, Philipp Krähenbühl · (ml-cross-entropy - apple)
The Practitioner’s Guide to the Maximal Update Parameterization

· (nanoGPT-mup - EleutherAI)
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models, arXiv, 2411.03884, arxiv, pdf, cication: -1

Zhijian Zhuo, Ya Wang, Yutao Zeng, ..., Xun Zhou, Jinwen Ma · (PolyCom - BryceZhuo)
The Road Less Scheduled, arXiv, 2405.15682, arxiv, pdf, cication: -1

Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, ..., Ahmed Khaled, Ashok Cutkosky · (schedule_free - facebookresearch)
🎬 Hacks to Make LLM Training Faster - Daniel Han, Unsloth AI

Architecture

Memory Layers at Scale, arXiv, 2412.09764, arxiv, pdf, cication: -1

Vincent-Pierre Berges, Barlas Oğuz, Daniel Haziza, ..., Luke Zettlemoyer, Gargi Ghosh

Mixture Of Experts

ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing, arXiv, 2412.14711, arxiv, pdf, cication: -1

Ziteng Wang, Jianfei Chen, Jun Zhu · (ReMoE - thu-ml)
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing, arXiv, 2412.14711, arxiv, pdf, cication: -1

Ziteng Wang, Jianfei Chen, Jun Zhu · (ReMoE - thu-ml)
MoDEM: Mixture of Domain Expert Models, arXiv, 2410.07490, arxiv, pdf, cication: -1

Toby Simonds, Kemal Kurniawan, Jey Han Lau · (reddit)
MH-MoE: Multi-Head Mixture-of-Experts, arXiv, 2411.16205, arxiv, pdf, cication: -1

Shaohan Huang, Xun Wu, Shuming Ma, ..., Furu Wei
MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks, arXiv, 2406.04801, arxiv, pdf, cication: -1

Xingkui Zhu, Yiran Guan, Dingkang Liang, ..., Yuliang Liu, Xiang Bai · (MoE-Jetpack - Adlith)
Overview of the Largest Mixture of Expert Models Released So Far

· (reddit)
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models, arXiv, 2411.00918, arxiv, pdf, cication: -1

Nam V. Nguyen, Thong T. Doan, Luong Tran, ..., Van Nguyen, Quang Pham · (LibMoE - Fsoft-AIC)
Mixture of Parrots: Experts improve memorization more than reasoning, arXiv, 2410.19034, arxiv, pdf, cication: -1

Samy Jelassi, Clara Mohri, David Brandfonbrener, ..., Sham M. Kakade, Eran Malach
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design, arXiv, 2410.19123, arxiv, pdf, cication: -1

Ruisi Cai, Yeonju Ro, Geon-Woo Kim, ..., Aditya Akella, Zhangyang Wang · (READ-ME - VITA-Group)
Stealing User Prompts from Mixture of Experts, arXiv, 2410.22884, arxiv, pdf, cication: -1

Itay Yona, Ilia Shumailov, Jamie Hayes, ..., Nicholas Carlini

Merge

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging, arXiv, 2412.19512, arxiv, pdf, cication: -1

Hua Farn, Hsuan Su, Shachi H Kumar, ..., Shang-Tse Chen, Hung-yi Lee
How to Merge Your Multimodal Models Over Time?, arXiv, 2412.06712, arxiv, pdf, cication: -1

Sebastian Dziadzio, Vishaal Udandarao, Karsten Roth, ..., Samuel Albanie, Matthias Bethge · (𝕏)
Exploring Model Kinship for Merging Large Language Models, arXiv, 2410.12613, arxiv, pdf, cication: -1

Yedi Hu, Yunzhi Yao, Ningyu Zhang, ..., Shumin Deng, Huajun Chen · (ModelKinship - zjunlp)

Online Learning

Toolkits

lingua - facebookresearch
Supercharging Training using float8 and FSDP2
unsloth - unslothai
cohere-finetune - cohere-ai
🌟 open-instruct - allenai

· (arxiv)
academic-pretraining - apoorvkh

Trade-offs when Pre-Training with Academic Resources · (arxiv)
experimental async-TP 𝕏

· (t)
🎬 torchtune: Easy and Accessible Finetuning in Native PyTorch - Evan Smothers, Meta
Fixing Gradient Accumulation 🤗
Fixed a bug which caused all training losses to diverge for large gradient accumulation sizes. 𝕏
AutoTrain: No-code training for state-of-the-art models, arXiv, 2410.15735, arxiv, pdf, cication: -1

Abhishek Thakur

· (autotrain-advanced - huggingface)

Misc

What's the deal with mid-training?

· (𝕏)
Cool things from DeepSeek v3's paper 𝕏
Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han 🎬
OLMo 2 and building effective teams for training language models
🌟 modded-nanogpt - KellerJordan
Pretraining on the Test Set Is All You Need, arXiv, 2309.08632, arxiv, pdf, cication: 14

Rylan Schaeffer
SGLang Developer Sync - 20241102 🎬
Argilla 2.4: Easily Build Fine-Tuning and Evaluation Datasets on the Hub — No Code Required 🤗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm_training.md

llm_training.md

LLM Training

Survey

LLM Training

Pretraining

Post Training

Finetuning

Optimization

Architecture

Mixture Of Experts

Merge

Online Learning

Toolkits

Misc

Files

llm_training.md

Latest commit

History

llm_training.md

File metadata and controls

LLM Training

Survey

LLM Training

Pretraining

Post Training

Finetuning

Optimization

Architecture

Mixture Of Experts

Merge

Online Learning

Toolkits

Misc