-
awesome-open-source-lms - allenai
· (docs.google)
-
Grokking at the Edge of Numerical Stability,
arXiv, 2501.04697
, arxiv, pdf, cication: -1Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, ..., Tolga Birdal · (grokking-at-the-edge-of-numerical-stability. - LucasPrietoAl)
-
Scaling Laws for Floating Point Quantization Training,
arXiv, 2501.02423
, arxiv, pdf, cication: -1Xingwu Sun, Shuaipeng Li, Ruobing Xie, ..., Di Wang, Jie Jiang
-
360-LLaMA-Factory - Qihoo360
-
Metadata Conditioning Accelerates Language Model Pre-training,
arXiv, 2501.01956
, arxiv, pdf, cication: -1Tianyu Gao, Alexander Wettig, Luxi He, ..., Sadhika Malladi, Danqi Chen · (MeCo - princeton-pli)
-
🌟 picotron - huggingface
The minimalist & most-hackable repository for pre-training Llama-like models with 4D Parallelism · (youtube)
-
Multi-matrix Factorization Attention,
arXiv, 2412.19255
, arxiv, pdf, cication: -1Jingcheng Hu, Houyi Li, Yinmin Zhang, ..., Xiangyu Zhang, Heung-Yeung Shum
-
Rate of Model Collapse in Recursive Training,
arXiv, 2412.17646
, arxiv, pdf, cication: -1Ananda Theertha Suresh, Andrew Thangaraj, Aditya Nanda Kishore Khandavally
-
Establishing Task Scaling Laws via Compute-Efficient Model Ladders,
arXiv, 2412.04403
, arxiv, pdf, cication: -1Akshita Bhagia, Jiacheng Liu, Alexander Wettig, ..., Jesse Dodge, Hannaneh Hajishirzi · (𝕏)
-
Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning,
arXiv, 2411.05193
, arxiv, pdf, cication: -1Joey Hong, Anca Dragan, Sergey Levine · (𝕏)
-
🌟 Predicting Emergent Capabilities by Finetuning,
arXiv, 2411.16035
, arxiv, pdf, cication: -1Charlie Snell, Eric Wallace, Dan Klein, ..., Sergey Levine · (𝕏)
-
🌟 Observational Scaling Laws and the Predictability of Language Model Performance,
arXiv, 2405.10938
, arxiv, pdf, cication: -1Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto
-
Balancing Pipeline Parallelism with Vocabulary Parallelism,
arXiv, 2411.05288
, arxiv, pdf, cication: -1Man Tsung Yeung, Penghui Qi, Min Lin, ..., Xinyi Wan · (VocabularyParallelism - sail-sg)
-
Learning to (Learn at Test Time): RNNs with Expressive Hidden States,
arXiv, 2407.04620
, arxiv, pdf, cication: 19Yu Sun, Xinhao Li, Karan Dalal, ..., Tatsunori Hashimoto, Carlos Guestrin · (yueatsprograms.github) · (ttt-lm-pytorch - test-time-training)
-
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective,
arXiv, 2410.23743
, arxiv, pdf, cication: -1Ming Li, Yanhong Li, Tianyi Zhou · (Layer_Gradient - MingLiiii) · (aimodels)
-
MiniPLM: Knowledge Distillation for Pre-Training Language Models,
arXiv, 2410.17215
, arxiv, pdf, cication: -1Yuxian Gu, Hao Zhou, Fandong Meng, ..., Jie Zhou, Minlie Huang
-
Pre-training Distillation for Large Language Models: A Design Space Exploration,
arXiv, 2410.16215
, arxiv, pdf, cication: -1Hao Peng, Xin Lv, Yushi Bai, ..., Lei Hou, Juanzi Li
-
$\text{Transformer}^2$ : Self-adaptive LLMs,arXiv, 2501.06252
, arxiv, pdf, cication: -1Qi Sun, Edoardo Cetin, Yujin Tang
-
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models,
arXiv, 2407.01906
, arxiv, pdf, cication: -1Zihan Wang, Deli Chen, Damai Dai, ..., Zhuoshu Li, Y. Wu · (𝕏)
-
DELIFT: Data Efficient Language model Instruction Fine Tuning,
arXiv, 2411.04425
, arxiv, pdf, cication: -1Ishika Agarwal, Krishnateja Killamsetty, Lucian Popa, ..., Marina Danilevksy · (𝕏)
-
SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation,
arXiv, 2410.14745
, arxiv, pdf, cication: -1Junyu Luo, Xiao Luo, Xiusi Chen, ..., Wei Ju, Ming Zhang
-
Spike No More: Stabilizing the Pre-training of Large Language Models,
arXiv, 2312.16903
, arxiv, pdf, cication: -1Sho Takase, Shun Kiyono, Sosuke Kobayashi, ..., Jun Suzuki
-
No More Adam: Learning Rate Scaling at Initialization is All You Need,
arXiv, 2412.11768
, arxiv, pdf, cication: -1Minghao Xu, Lichuan Xiang, Xu Cai, ..., Hongkai Wen · (SGD_SaI - AnonymousAlethiometer)
-
Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN,
arXiv, 2412.13795
, arxiv, pdf, cication: -1Pengxiang Li, Lu Yin, Shiwei Liu · (MixLN. - pixeli99)
-
APOLLO: SGD-like Memory, AdamW-level Performance,
arXiv, 2412.05270
, arxiv, pdf, cication: -1Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, ..., Zhangyang Wang, Jinwon Lee · (zhuhanqing.github)
-
DeMo: Decoupled Momentum Optimization,
arXiv, 2411.19870
, arxiv, pdf, cication: -1Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma · (DeMo - bloc97)
-
ADOPT: Modified Adam Can Converge with Any
$β_2$ with the Optimal Rate,arXiv, 2411.02853
, arxiv, pdf, cication: -1Shohei Taniguchi, Keno Harada, Gouki Minegishi, ..., Yusuke Iwasawa, Yutaka Matsuo · (adopt - iShohei220)
-
unit-scaling - graphcore-research
-
Natural Language Reinforcement Learning,
arXiv, 2411.14251
, arxiv, pdf, cication: -1Xidong Feng, Ziyu Wan, Haotian Fu, ..., Ying Wen, Jun Wang · (Natural-language-RL - waterhorse1)
-
Cautious Optimizers: Improving Training with One Line of Code,
arXiv, 2411.16085
, arxiv, pdf, cication: -1Kaizhao Liang, Lizhang Chen, Bo Liu, ..., Qiang Liu · (C-Optim - kyleliang919) · (qbitai)
-
MARS: Unleashing the Power of Variance Reduction for Training Large Models,
arXiv, 2411.10438
, arxiv, pdf, cication: -1Huizhuo Yuan, Yifeng Liu, Shuang Wu, ..., Xun Zhou, Quanquan Gu · (MARS. - AGI-Arena)
-
nGPT: Normalized Transformer with Representation Learning on the Hypersphere,
arXiv, 2410.01131
, arxiv, pdf, cication: -1Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, ..., Boris Ginsburg · (ngpt - NVIDIA)
-
Top-$nσ$: Not All Logits Are You Need,
arXiv, 2411.07641
, arxiv, pdf, cication: -1Chenxia Tang, Jianchun Liu, Hongli Xu, ..., Liusheng Huang · (top_nsigma - Tomorrowdawn)
-
Cut Your Losses in Large-Vocabulary Language Models,
arXiv, 2411.09009
, arxiv, pdf, cication: -1Erik Wijmans, Brody Huval, Alexander Hertzberg, ..., Vladlen Koltun, Philipp Krähenbühl · (ml-cross-entropy - apple)
-
The Practitioner’s Guide to the Maximal Update Parameterization
· (nanoGPT-mup - EleutherAI)
-
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models,
arXiv, 2411.03884
, arxiv, pdf, cication: -1Zhijian Zhuo, Ya Wang, Yutao Zeng, ..., Xun Zhou, Jinwen Ma · (PolyCom - BryceZhuo)
-
The Road Less Scheduled,
arXiv, 2405.15682
, arxiv, pdf, cication: -1Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, ..., Ahmed Khaled, Ashok Cutkosky · (schedule_free - facebookresearch)
-
🎬 Hacks to Make LLM Training Faster - Daniel Han, Unsloth AI
-
Memory Layers at Scale,
arXiv, 2412.09764
, arxiv, pdf, cication: -1Vincent-Pierre Berges, Barlas Oğuz, Daniel Haziza, ..., Luke Zettlemoyer, Gargi Ghosh
-
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing,
arXiv, 2412.14711
, arxiv, pdf, cication: -1Ziteng Wang, Jianfei Chen, Jun Zhu · (ReMoE - thu-ml)
-
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing,
arXiv, 2412.14711
, arxiv, pdf, cication: -1Ziteng Wang, Jianfei Chen, Jun Zhu · (ReMoE - thu-ml)
-
MoDEM: Mixture of Domain Expert Models,
arXiv, 2410.07490
, arxiv, pdf, cication: -1Toby Simonds, Kemal Kurniawan, Jey Han Lau · (reddit)
-
MH-MoE: Multi-Head Mixture-of-Experts,
arXiv, 2411.16205
, arxiv, pdf, cication: -1Shaohan Huang, Xun Wu, Shuming Ma, ..., Furu Wei
-
MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks,
arXiv, 2406.04801
, arxiv, pdf, cication: -1Xingkui Zhu, Yiran Guan, Dingkang Liang, ..., Yuliang Liu, Xiang Bai · (MoE-Jetpack - Adlith)
-
Overview of the Largest Mixture of Expert Models Released So Far
· (reddit)
-
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models,
arXiv, 2411.00918
, arxiv, pdf, cication: -1Nam V. Nguyen, Thong T. Doan, Luong Tran, ..., Van Nguyen, Quang Pham · (LibMoE - Fsoft-AIC)
-
Mixture of Parrots: Experts improve memorization more than reasoning,
arXiv, 2410.19034
, arxiv, pdf, cication: -1Samy Jelassi, Clara Mohri, David Brandfonbrener, ..., Sham M. Kakade, Eran Malach
-
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design,
arXiv, 2410.19123
, arxiv, pdf, cication: -1Ruisi Cai, Yeonju Ro, Geon-Woo Kim, ..., Aditya Akella, Zhangyang Wang · (READ-ME - VITA-Group)
-
Stealing User Prompts from Mixture of Experts,
arXiv, 2410.22884
, arxiv, pdf, cication: -1Itay Yona, Ilia Shumailov, Jamie Hayes, ..., Nicholas Carlini
-
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging,
arXiv, 2412.19512
, arxiv, pdf, cication: -1Hua Farn, Hsuan Su, Shachi H Kumar, ..., Shang-Tse Chen, Hung-yi Lee
-
How to Merge Your Multimodal Models Over Time?,
arXiv, 2412.06712
, arxiv, pdf, cication: -1Sebastian Dziadzio, Vishaal Udandarao, Karsten Roth, ..., Samuel Albanie, Matthias Bethge · (𝕏)
-
Exploring Model Kinship for Merging Large Language Models,
arXiv, 2410.12613
, arxiv, pdf, cication: -1Yedi Hu, Yunzhi Yao, Ningyu Zhang, ..., Shumin Deng, Huajun Chen · (ModelKinship - zjunlp)
-
lingua - facebookresearch
-
unsloth - unslothai
-
cohere-finetune - cohere-ai
-
🌟 open-instruct - allenai
· (arxiv)
-
academic-pretraining - apoorvkh
Trade-offs when Pre-Training with Academic Resources · (arxiv)
-
· (t)
-
🎬 torchtune: Easy and Accessible Finetuning in Native PyTorch - Evan Smothers, Meta
-
Fixed a bug which caused all training losses to diverge for large gradient accumulation sizes. 𝕏
-
AutoTrain: No-code training for state-of-the-art models,
arXiv, 2410.15735
, arxiv, pdf, cication: -1Abhishek Thakur
· (autotrain-advanced - huggingface)
-
What's the deal with mid-training?
· (𝕏)
-
OLMo 2 and building effective teams for training language models
-
🌟 modded-nanogpt - KellerJordan
-
Pretraining on the Test Set Is All You Need,
arXiv, 2309.08632
, arxiv, pdf, cication: 14Rylan Schaeffer
-
Argilla 2.4: Easily Build Fine-Tuning and Evaluation Datasets on the Hub — No Code Required 🤗