Skip to content

Latest commit

 

History

History
250 lines (168 loc) · 22.7 KB

llm_training.md

File metadata and controls

250 lines (168 loc) · 22.7 KB

LLM Training

Survey

LLM Training

  • Grokking at the Edge of Numerical Stability, arXiv, 2501.04697, arxiv, pdf, cication: -1

    Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, ..., Tolga Birdal · (grokking-at-the-edge-of-numerical-stability. - LucasPrietoAl) Star

  • Scaling Laws for Floating Point Quantization Training, arXiv, 2501.02423, arxiv, pdf, cication: -1

    Xingwu Sun, Shuaipeng Li, Ruobing Xie, ..., Di Wang, Jie Jiang

  • Flash LLMs: Pipeline Parallel 🎬

  • 360-LLaMA-Factory - Qihoo360 Star

  • Metadata Conditioning Accelerates Language Model Pre-training, arXiv, 2501.01956, arxiv, pdf, cication: -1

    Tianyu Gao, Alexander Wettig, Luxi He, ..., Sadhika Malladi, Danqi Chen · (MeCo - princeton-pli) Star

  • 🌟 picotron - huggingface Star

    The minimalist & most-hackable repository for pre-training Llama-like models with 4D Parallelism · (youtube)

  • Multi-matrix Factorization Attention, arXiv, 2412.19255, arxiv, pdf, cication: -1

    Jingcheng Hu, Houyi Li, Yinmin Zhang, ..., Xiangyu Zhang, Heung-Yeung Shum

  • Rate of Model Collapse in Recursive Training, arXiv, 2412.17646, arxiv, pdf, cication: -1

    Ananda Theertha Suresh, Andrew Thangaraj, Aditya Nanda Kishore Khandavally

  • [10 December 2024, NeurIPs] Tutorial on Language Modeling

  • Establishing Task Scaling Laws via Compute-Efficient Model Ladders, arXiv, 2412.04403, arxiv, pdf, cication: -1

    Akshita Bhagia, Jiacheng Liu, Alexander Wettig, ..., Jesse Dodge, Hannaneh Hajishirzi · (𝕏)

  • Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning, arXiv, 2411.05193, arxiv, pdf, cication: -1

    Joey Hong, Anca Dragan, Sergey Levine · (𝕏)

  • 🌟 Predicting Emergent Capabilities by Finetuning, arXiv, 2411.16035, arxiv, pdf, cication: -1

    Charlie Snell, Eric Wallace, Dan Klein, ..., Sergey Levine · (𝕏)

  • 🌟 Observational Scaling Laws and the Predictability of Language Model Performance, arXiv, 2405.10938, arxiv, pdf, cication: -1

    Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto

  • Balancing Pipeline Parallelism with Vocabulary Parallelism, arXiv, 2411.05288, arxiv, pdf, cication: -1

    Man Tsung Yeung, Penghui Qi, Min Lin, ..., Xinyi Wan · (VocabularyParallelism - sail-sg) Star

  • Learning to (Learn at Test Time): RNNs with Expressive Hidden States, arXiv, 2407.04620, arxiv, pdf, cication: 19

    Yu Sun, Xinhao Li, Karan Dalal, ..., Tatsunori Hashimoto, Carlos Guestrin · (yueatsprograms.github) · (ttt-lm-pytorch - test-time-training) Star

  • What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective, arXiv, 2410.23743, arxiv, pdf, cication: -1

    Ming Li, Yanhong Li, Tianyi Zhou · (Layer_Gradient - MingLiiii) Star · (aimodels)

Pretraining

  • MiniPLM: Knowledge Distillation for Pre-Training Language Models, arXiv, 2410.17215, arxiv, pdf, cication: -1

    Yuxian Gu, Hao Zhou, Fandong Meng, ..., Jie Zhou, Minlie Huang

  • Pre-training Distillation for Large Language Models: A Design Space Exploration, arXiv, 2410.16215, arxiv, pdf, cication: -1

    Hao Peng, Xin Lv, Yushi Bai, ..., Lei Hou, Juanzi Li

Post Training

Finetuning

  • $\text{Transformer}^2$: Self-adaptive LLMs, arXiv, 2501.06252, arxiv, pdf, cication: -1

    Qi Sun, Edoardo Cetin, Yujin Tang

  • Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models, arXiv, 2407.01906, arxiv, pdf, cication: -1

    Zihan Wang, Deli Chen, Damai Dai, ..., Zhuoshu Li, Y. Wu · (𝕏)

  • DELIFT: Data Efficient Language model Instruction Fine Tuning, arXiv, 2411.04425, arxiv, pdf, cication: -1

    Ishika Agarwal, Krishnateja Killamsetty, Lucian Popa, ..., Marina Danilevksy · (𝕏)

  • SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation, arXiv, 2410.14745, arxiv, pdf, cication: -1

    Junyu Luo, Xiao Luo, Xiusi Chen, ..., Wei Ju, Ming Zhang

Optimization

  • Spike No More: Stabilizing the Pre-training of Large Language Models, arXiv, 2312.16903, arxiv, pdf, cication: -1

    Sho Takase, Shun Kiyono, Sosuke Kobayashi, ..., Jun Suzuki

  • No More Adam: Learning Rate Scaling at Initialization is All You Need, arXiv, 2412.11768, arxiv, pdf, cication: -1

    Minghao Xu, Lichuan Xiang, Xu Cai, ..., Hongkai Wen · (SGD_SaI - AnonymousAlethiometer) Star

  • Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN, arXiv, 2412.13795, arxiv, pdf, cication: -1

    Pengxiang Li, Lu Yin, Shiwei Liu · (MixLN. - pixeli99) Star

  • 🌟 Muon scaling again 𝕏

  • APOLLO: SGD-like Memory, AdamW-level Performance, arXiv, 2412.05270, arxiv, pdf, cication: -1

    Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, ..., Zhangyang Wang, Jinwon Lee · (zhuhanqing.github)

  • DeMo: Decoupled Momentum Optimization, arXiv, 2411.19870, arxiv, pdf, cication: -1

    Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma · (DeMo - bloc97) Star

  • ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate, arXiv, 2411.02853, arxiv, pdf, cication: -1

    Shohei Taniguchi, Keno Harada, Gouki Minegishi, ..., Yusuke Iwasawa, Yutaka Matsuo · (adopt - iShohei220) Star

  • unit-scaling - graphcore-research Star

  • Natural Language Reinforcement Learning, arXiv, 2411.14251, arxiv, pdf, cication: -1

    Xidong Feng, Ziyu Wan, Haotian Fu, ..., Ying Wen, Jun Wang · (Natural-language-RL - waterhorse1) Star

  • Cautious Optimizers: Improving Training with One Line of Code, arXiv, 2411.16085, arxiv, pdf, cication: -1

    Kaizhao Liang, Lizhang Chen, Bo Liu, ..., Qiang Liu · (C-Optim - kyleliang919) Star · (qbitai)

  • MARS: Unleashing the Power of Variance Reduction for Training Large Models, arXiv, 2411.10438, arxiv, pdf, cication: -1

    Huizhuo Yuan, Yifeng Liu, Shuang Wu, ..., Xun Zhou, Quanquan Gu · (MARS. - AGI-Arena) Star

  • nGPT: Normalized Transformer with Representation Learning on the Hypersphere, arXiv, 2410.01131, arxiv, pdf, cication: -1

    Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, ..., Boris Ginsburg · (ngpt - NVIDIA) Star

  • Top-$nσ$: Not All Logits Are You Need, arXiv, 2411.07641, arxiv, pdf, cication: -1

    Chenxia Tang, Jianchun Liu, Hongli Xu, ..., Liusheng Huang · (top_nsigma - Tomorrowdawn) Star

  • Cut Your Losses in Large-Vocabulary Language Models, arXiv, 2411.09009, arxiv, pdf, cication: -1

    Erik Wijmans, Brody Huval, Alexander Hertzberg, ..., Vladlen Koltun, Philipp Krähenbühl · (ml-cross-entropy - apple) Star

  • The Practitioner’s Guide to the Maximal Update Parameterization

    · (nanoGPT-mup - EleutherAI) Star

  • Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models, arXiv, 2411.03884, arxiv, pdf, cication: -1

    Zhijian Zhuo, Ya Wang, Yutao Zeng, ..., Xun Zhou, Jinwen Ma · (PolyCom - BryceZhuo) Star

  • The Road Less Scheduled, arXiv, 2405.15682, arxiv, pdf, cication: -1

    Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, ..., Ahmed Khaled, Ashok Cutkosky · (schedule_free - facebookresearch) Star

  • 🎬 Hacks to Make LLM Training Faster - Daniel Han, Unsloth AI

Architecture

  • Memory Layers at Scale, arXiv, 2412.09764, arxiv, pdf, cication: -1

    Vincent-Pierre Berges, Barlas Oğuz, Daniel Haziza, ..., Luke Zettlemoyer, Gargi Ghosh

Mixture Of Experts

  • ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing, arXiv, 2412.14711, arxiv, pdf, cication: -1

    Ziteng Wang, Jianfei Chen, Jun Zhu · (ReMoE - thu-ml) Star

  • ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing, arXiv, 2412.14711, arxiv, pdf, cication: -1

    Ziteng Wang, Jianfei Chen, Jun Zhu · (ReMoE - thu-ml) Star

  • MoDEM: Mixture of Domain Expert Models, arXiv, 2410.07490, arxiv, pdf, cication: -1

    Toby Simonds, Kemal Kurniawan, Jey Han Lau · (reddit)

  • MH-MoE: Multi-Head Mixture-of-Experts, arXiv, 2411.16205, arxiv, pdf, cication: -1

    Shaohan Huang, Xun Wu, Shuming Ma, ..., Furu Wei

  • MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks, arXiv, 2406.04801, arxiv, pdf, cication: -1

    Xingkui Zhu, Yiran Guan, Dingkang Liang, ..., Yuliang Liu, Xiang Bai · (MoE-Jetpack - Adlith) Star

  • Overview of the Largest Mixture of Expert Models Released So Far

    · (reddit)

  • LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models, arXiv, 2411.00918, arxiv, pdf, cication: -1

    Nam V. Nguyen, Thong T. Doan, Luong Tran, ..., Van Nguyen, Quang Pham · (LibMoE - Fsoft-AIC) Star

  • Mixture of Parrots: Experts improve memorization more than reasoning, arXiv, 2410.19034, arxiv, pdf, cication: -1

    Samy Jelassi, Clara Mohri, David Brandfonbrener, ..., Sham M. Kakade, Eran Malach

  • Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design, arXiv, 2410.19123, arxiv, pdf, cication: -1

    Ruisi Cai, Yeonju Ro, Geon-Woo Kim, ..., Aditya Akella, Zhangyang Wang · (READ-ME - VITA-Group) Star

  • Stealing User Prompts from Mixture of Experts, arXiv, 2410.22884, arxiv, pdf, cication: -1

    Itay Yona, Ilia Shumailov, Jamie Hayes, ..., Nicholas Carlini

Merge

  • Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging, arXiv, 2412.19512, arxiv, pdf, cication: -1

    Hua Farn, Hsuan Su, Shachi H Kumar, ..., Shang-Tse Chen, Hung-yi Lee

  • How to Merge Your Multimodal Models Over Time?, arXiv, 2412.06712, arxiv, pdf, cication: -1

    Sebastian Dziadzio, Vishaal Udandarao, Karsten Roth, ..., Samuel Albanie, Matthias Bethge · (𝕏)

  • Exploring Model Kinship for Merging Large Language Models, arXiv, 2410.12613, arxiv, pdf, cication: -1

    Yedi Hu, Yunzhi Yao, Ningyu Zhang, ..., Shumin Deng, Huajun Chen · (ModelKinship - zjunlp) Star

Online Learning

Toolkits

Misc