-
A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression,
arXiv, 2412.17483
, arxiv, pdf, cication: -1Chenlong Deng, Zhisong Zhang, Kelong Mao, ..., Dong Yu, Zhicheng Dou
-
A Survey on Large Language Model Acceleration based on KV Cache Management,
arXiv, 2412.19442
, arxiv, pdf, cication: -1Haoyang Li, Yiming Li, Anxin Tian, ..., Qing Li, Lei Chen · (Awesome-KV-Cache-Management - TreeAI-Lab)
-
A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness,
arXiv, 2411.03350
, arxiv, pdf, cication: -1Fali Wang, Zhiwei Zhang, Xianren Zhang, ..., Ming Huang, Suhang Wang · (mp.weixin.qq)
-
A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness,
arXiv, 2411.03350
, arxiv, pdf, cication: -1Fali Wang, Zhiwei Zhang, Xianren Zhang, ..., Ming Huang, Suhang Wang
-
A Survey of Small Language Models,
arXiv, 2410.20011
, arxiv, pdf, cication: -1Chien Van Nguyen, Xuan Shen, Ryan Aponte, ..., Ryan A. Rossi, Thien Huu Nguyen
-
🌟 Tensor Product Attention Is All You Need,
arXiv, 2501.06425
, arxiv, pdf, cication: -1Yifan Zhang, Yifeng Liu, Huizhuo Yuan, ..., Quanquan Gu, Andrew Chi-Chih Yao · (tensorgi.github) · (T6 - tensorgi)
-
🌟 Better & Faster Large Language Models via Multi-token Prediction,
arXiv, 2404.19737
, arxiv, pdf, cication: -1Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, ..., David Lopez-Paz, Gabriel Synnaeve
-
SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator,
arXiv, 2412.12094
, arxiv, pdf, cication: -1Guoxuan Chen, Han Shi, Jiawei Li, ..., Weiyang Liu, Chao Huang
-
Yi-Lightning Technical Report,
arXiv, 2412.01253
, arxiv, pdf, cication: -101. AI, :, Alan Wake, ..., Zhiyuan Liu, Zirui Zhang
-
Knowledge Composition using Task Vectors with Learned Anisotropic Scaling,
arXiv, 2407.02880
, arxiv, pdf, cication: -1Frederic Z. Zhang, Paul Albert, Cristian Rodriguez-Opazo, ..., Anton van den Hengel, Ehsan Abbasnejad · (atlas - fredzzhang)
-
Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study,
arXiv, 2411.02462
, arxiv, pdf, cication: -1André Storhaug, Jingyue Li · (peft-unit-test-generation-replication-package - andstor)
-
LoRA vs Full Fine-tuning: An Illusion of Equivalence,
arXiv, 2410.21228
, arxiv, pdf, cication: -1Reece Shuttleworth, Jacob Andreas, Antonio Torralba, ..., Pratyusha Sharma · (𝕏)
-
Unsloth - Dynamic 4-bit Quantization
· (𝕏)
-
auto-round - intel
· (reddit)
-
Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens,
arXiv, 2411.17691
, arxiv, pdf, cication: -1Xu Ouyang, Tao Ge, Thomas Hartvigsen, ..., Haitao Mi, Dong Yu · (huggingface)
-
PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs,
arXiv, 2410.05265
, arxiv, pdf, cication: -1Mengzhao Chen, Yi Liu, Jiahao Wang, ..., Wenqi Shao, Ping Luo · (PrefixQuant - ChenMnZ) · (arxiv)
-
🌟 Scaling Laws for Precision,
arXiv, 2411.04330
, arxiv, pdf, cication: -1Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, ..., Christopher Ré, Aditi Raghunathan · (𝕏) · (𝕏)
-
🌟 BitNet a4.8: 4-bit Activations for 1-bit LLMs,
arXiv, 2411.04965
, arxiv, pdf, cication: -1Hongyu Wang, Shuming Ma, Furu Wei
-
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization,
arXiv, 2411.02355
, arxiv, pdf, cication: -1Eldar Kurtic, Alexandre Marques, Shubhra Pandit, ..., Mark Kurtz, Dan Alistarh
-
QTIP: Quantization with Trellises and Incoherence Processing,
arXiv, 2406.11235
, arxiv, pdf, cication: 1Albert Tseng, Qingyao Sun, David Hou, ..., Christopher De Sa · (qtip - Cornell-RelaxML) · (x) · (t)
-
Active Data Curation Effectively Distills Large-Scale Multimodal Models,
arXiv, 2411.18674
, arxiv, pdf, cication: -1Vishaal Udandarao, Nikhil Parthasarathy, Muhammad Ferjad Naeem, ..., Alessio Tonioni, Olivier J. Hénaff · (𝕏)
-
Stronger Models are NOT Stronger Teachers for Instruction Tuning,
arXiv, 2411.07133
, arxiv, pdf, cication: -1Zhangchen Xu, Fengqing Jiang, Luyao Niu, ..., Bill Yuchen Lin, Radha Poovendran
-
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling,
arXiv, 2410.11325
, arxiv, pdf, cication: -1Wenda Xu, Rujun Han, Zifeng Wang, ..., Chen-Yu Lee, Tomas Pfister
-
The Super Weight in Large Language Models,
arXiv, 2411.07191
, arxiv, pdf, cication: -1Mengxia Yu, De Wang, Qi Shan, ..., Colorado Reed, Alvin Wan
-
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity,
arXiv, 2411.02335
, arxiv, pdf, cication: -1Yuqi Luo, Chenyang Song, Xu Han, ..., Zhiyuan Liu, Maosong Sun
-
What Matters in Transformers? Not All Attention is Needed,
arXiv, 2406.15786
, arxiv, pdf, cication: 1Shwai He, Guoheng Sun, Zheyu Shen, ..., Ang Li
-
Efficiently Serving LLM Reasoning Programs with Certaindex,
arXiv, 2412.20993
, arxiv, pdf, cication: -1Yichao Fu, Junda Chen, Siqi Zhu, ..., Aurick Qiao, Hao Zhang
-
🌟 sglang - sgl-project
-
FlashInfer 0.2 - Efficient and Customizable Kernels for LLM Inference Serving
· (𝕏)
-
· (𝕏)
-
ZhiLight - zhihu
-
Introducing SGLang Router: a cache-aware router for LLM Inference in SGLang v0.4 𝕏
-
Puzzle: Distillation-Based NAS for Inference-Optimized LLMs,
arXiv, 2411.19146
, arxiv, pdf, cication: -1Akhiad Bercovich, Tomer Ronen, Talor Abramovich, ..., Ran Zilberstein, Ran El-Yaniv
-
Star Attention: Efficient LLM Inference over Long Sequences,
arXiv, 2411.17116
, arxiv, pdf, cication: -1Shantanu Acharya, Fei Jia, Boris Ginsburg · (Star-Attention - NVIDIA)
-
Mooncake - kvcache-ai
A KVCache-centric Disaggregated Architecture for LLM Serving
-
Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding,
arXiv, 2411.18462
, arxiv, pdf, cication: -1Ziyin Zhang, Jiahao Xu, Tian Liang, ..., Rui Wang, Zhaopeng Tu
-
SAM Decoding: Speculative Decoding via Suffix Automaton,
arXiv, 2411.10666
, arxiv, pdf, cication: -1Yuxuan Hu, Ke Wang, Jing Zhang, ..., Cuiping Li, Hong Chen · (SAM-Decoding - hyx1999)
-
FastDraft: How to Train Your Draft,
arXiv, 2411.11055
, arxiv, pdf, cication: -1Ofir Zafrir, Igor Margulis, Dorin Shteyman, ..., Guy Boudoukh
-
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration,
arXiv, 2411.10958
, arxiv, pdf, cication: -1Jintao Zhang, Haofeng Huang, Pengle Zhang, ..., Jun Zhu, Jianfei Chen · (SageAttention - thu-ml)
-
distributed-llama - b4rtaz
-
SGLang: Fast Serving Framework for Large Language and Vision-Language Models on AMD GPUs
-
OpenAI beats Anthropic and Fireworks to releasing Speculative Decoding
-
Latency optimizationImprove latency across a wide variety of LLM-related use cases.
-
A Simple and Effective
$L_2$ Norm-Based Strategy for KV Cache Compression,arXiv, 2406.11430
, arxiv, pdf, cication: 5Alessio Devoto, Yu Zhao, Simone Scardapane, ..., Pasquale Minervini
-
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration,
arXiv, 2410.02367
, arxiv, pdf, cication: -1Jintao Zhang, Jia wei, Pengle Zhang, ..., Jun Zhu, Jianfei Chen
-
Fast Best-of-N Decoding via Speculative Rejection,
arXiv, 2410.20290
, arxiv, pdf, cication: -1Hanshi Sun, Momin Haider, Ruiqi Zhang, ..., Peter Bartlett, Andrea Zanette
-
Optimizing and Characterizing High-Throughput Low-Latency LLM Inference in MLCEngine
· (reddit)
-
Battle of Inference Engines: Llama.cpp vs MLC LLM vs vLLM
· (reddit)
-
Universal Assisted Generation: Faster Decoding with Any Assistant Model 🤗
-
Models continually pretrained using LayerSkip 🤗
· (arxiv)
-
SlimLM: An Efficient Small Language Model for On-Device Document Assistance,
arXiv, 2411.09944
, arxiv, pdf, cication: -1Thang M. Pham, Phat T. Nguyen, Seunghyun Yoon, ..., Franck Dernoncourt, Trung Bui
-
Hymba: A Hybrid-head Architecture for Small Language Models,
arXiv, 2411.13676
, arxiv, pdf, cication: -1Xin Dong, Yonggan Fu, Shizhe Diao, ..., Jan Kautz, Pavlo Molchanov
-
MobileLLM is an auto-regressive language model leveraging an optimized transformer architecture 🤗
· (arxiv)
-
🌟 SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration,
arXiv, 2411.10958
, arxiv, pdf, cication: -1Jintao Zhang, Haofeng Huang, Pengle Zhang, ..., Jun Zhu, Jianfei Chen · (SageAttention. - thu-ml)
-
ThunderKittens: Simple, Fast, and Adorable AI Kernels,
arXiv, 2410.20399
, arxiv, pdf, cication: -1Benjamin F. Spector, Simran Arora, Aaryan Singhal, ..., Daniel Y. Fu, Christopher Ré
-
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs,
arXiv, 2410.13276
, arxiv, pdf, cication: -1Yizhao Gao, Zhichen Zeng, Dayou Du, ..., Fan Yang, Mao Yang
-
MoH: Multi-Head Attention as Mixture-of-Head Attention,
arXiv, 2410.11842
, arxiv, pdf, cication: -1Peng Jin, Bo Zhu, Li Yuan, ..., Shuicheng Yan · (arxiv) · (MoH - SkyworkAI) · (huggingface)
-
nano-sparse-attention - PiotrNawrot
· (𝕏)
-
sgl-learning-materials - sgl-project
-
exo - exo-explore
Run your own AI cluster at home with everyday devices.