Add FP5 E2M2 support from upstream #399

gau-nernst · 2024-06-19T01:01:40Z

usyd-fsalab/fp6_llm@5df6737

Also close #402

New API

from torchao.quantization.quant_api import quantize
from torchao.prototype.quant_llm import fp6_llm_weight_only, quant_llm_fpx_weight_only

model = ...
model.half()  # not necessary, but recommeneded to maintain accuracy
quantize(model, fp6_llm_weight_only())  # convert nn.Lineaer.weight to FP6 E3M2 in-place

# for generic FPx EyMz where x = 1 + y + z
# quantize(model, quant_llm_fpx_weight_only(2, 2))  # use FP5 E2M2 instead

# fully compatible with torch.compile()
model.compile(mode="max-autotune", fullgraph=True)

Benchmark results

Benchmarks are run on a machine with a single 4070Ti SUPER GPU using the scripts in _models/llama. tokens/s is measured using generate.py which generates text in a latency optimized way (batchsize=1). wikitext perplexity is measured using eval.py which uses lm_eval. The model used is meta-llama/Llama-2-7b-chat-hf.

FPx quantization is run with --precision float16. The rest uses the default precision of bfloat16.

Quantization	wikitext perplexity	tokens/s
INT8	12.21	87.45
INT4-256 (tinygemm)	(bug)	157.10
FP6 E3M2	12.34	106.76
FP6 E2M3	12.23	106.77
FP5 E3M1	12.55	122.69
FP5 E2M2	12.47	122.66
FP4 E3M0	14.58	145.55
FP4 E2M1	15.01	146.05
FP3 E2M0	74625.18	164.49

pytorch-bot · 2024-06-19T01:01:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/399

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4e585e9 with merge base c2cf973 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

msaroufim · 2024-06-24T16:09:20Z

For the nightly failure in CI make sure to rebase to main in ao. There's some upstream issue with triton we haven't debugged yet #429

EDIT: Fix was merged

jerryzh168 · 2024-06-24T18:20:34Z

torchao/prototype/quant_llm/quant_llm.py

+
+
+class QuantLlmLinearWeight(Tensor):
+    _implements = classmethod(_implements)


haha this is clever

nit: we can just do implements = classmethod(_implements) I think, implements can be public classmethod

jerryzh168 · 2024-06-25T02:04:52Z

tensor subclass changes LGTM, I'll leave the rest to @msaroufim. I'll also take a closer look at the rest of the code tomorrow

jerryzh168 · 2024-06-25T02:05:40Z

another nit comment I have is the quant_llm name, should we rename this to something more specific like fp6/fp5? also what is the difference between this as mx?

gau-nernst · 2024-06-25T02:29:27Z

another nit comment I have is the quant_llm name, should we rename this to something more specific like fp6/fp5?

I also don't really like the Quant-LLM name since it's quite ambiguous. But the upstream repo renames it to that (https://github.com/usyd-fsalab/fp6_llm) so I follow it. The original name (first release) was FP6-LLM.

The kernel supports arbitrary FP2 -> FP7, so naming it as only FP6/FP5 is quite limited. My benchmarks show that FP4 is not competitive with INT4 tinygemm (might be improved by block-wise quantization). For now I only enable FP6 E3M2, FP6 E2M3, FP5 E2M2, and FP5 E3M1 (each dtype support is a template instantiation. upstream repo only tested for FP6E3M2 and F5 E2M2). Something better might be like quant_llm_fpx. In the future upstream repo might even support INTx.

I'm open to changing the name.

also what is the difference between this as mx?

The FPx dtype itself is quite similar to MX, thus I refactored dtype casting code from @vkuzo (#363) and re-use it here. However, there are key differences

MX specs only specify FP6 E3M2, FP6 E2M3 and FP4 E2M1, while Quant-LLM kernel supports arbitrary FPx as mentioned above
MX dtype use block-wise quantization scale, and scale is in E8M0 format. In contrast, Quant-LLM kernel only supports per-row quantization scale.
Quant-LLM uses a special layout optimized for tensor cores (see pack_tc_fpx()). The idea is similar to tinygemm I think.

I think those differences are significant enough to make separate subclasses for them. It also helps with maintenance.

jerryzh168 · 2024-06-25T17:19:40Z

Thanks for the detailed context @gau-nernst, OK makes sense to keep the name for now I guess, but maybe we can talk to the author to make the name a bit more descriptive like you said (e.g. quant_llm_fpx), in torchao I feel we should probably use fpx (as a name for dtype) in the end to indicate that it supports different floating point dtypes.

* first update from upstream * add some primitives to support fp5 * binding for ExMy * add QuantLlmLinear * fix * update README * update README * remove fp6_linear from C++ * fix * fix * fix * update * add more experimental config * update * add from tc_fpx * remove redundant code * fix import * fix test * avoid division by 0 * add subclass. use uint8 * subclass API * update doc * remove unused op * update * rename. update * update docs * rename * fix for PyTorch 2.2 * _implements -> implements * set CUDA context * fix __repr__

first update from upstream

ce8fd7d

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 19, 2024

gau-nernst added 27 commits June 19, 2024 10:00

add some primitives to support fp5

d8bd7b6

binding for ExMy

129adff

add QuantLlmLinear

64c6cee

fix

38ad773

update README

057367e

update README

a6ed669

remove fp6_linear from C++

0d409a8

fix

0bd9ee1

fix

8fbd3d4

fix

5906eed

update

3b008d5

add more experimental config

9076e58

update

442e9c5

add from tc_fpx

d2d8019

remove redundant code

bb52ad0

fix import

80661ab

fix test

edfbe3d

avoid division by 0

0ecdd86

Merge branch 'pytorch:main' into fp5_llm

4e44a8d

add subclass. use uint8

e6c7d6b

subclass API

ca43bf8

update doc

8de2722

remove unused op

50bfe82

update

b9375a4

rename. update

3072257

update docs

7b822ef

rename

ca45dda

fix for PyTorch 2.2

36fe61e

gau-nernst marked this pull request as ready for review June 24, 2024 15:07

gau-nernst requested a review from msaroufim June 24, 2024 15:07

msaroufim requested a review from jerryzh168 June 24, 2024 15:12

jerryzh168 reviewed Jun 24, 2024

View reviewed changes

msaroufim mentioned this pull request Jun 24, 2024

The next tutorials #426

Open

7 tasks

gau-nernst added 2 commits June 24, 2024 22:00

Merge branch 'main' into fp5_llm

30608f3

_implements -> implements

57ad040

gau-nernst added 2 commits June 25, 2024 20:15

set CUDA context

ceaa71c

fix __repr__

4e585e9

msaroufim approved these changes Jun 25, 2024

View reviewed changes

msaroufim merged commit 70aef5d into pytorch:main Jun 25, 2024
13 checks passed

gau-nernst deleted the fp5_llm branch June 25, 2024 21:30

msaroufim mentioned this pull request Jun 27, 2024

What should .dtype for tensor subclass return? #442

Open

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

ci: Add llama3 gpu workflow in perioidic (pytorch#399)

f8224d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FP5 E2M2 support from upstream #399

Add FP5 E2M2 support from upstream #399

gau-nernst commented Jun 19, 2024 •

edited

Loading

pytorch-bot bot commented Jun 19, 2024 •

edited

Loading

msaroufim commented Jun 24, 2024 •

edited

Loading

jerryzh168 Jun 24, 2024 •

edited

Loading

jerryzh168 commented Jun 25, 2024 •

edited

Loading

jerryzh168 commented Jun 25, 2024

gau-nernst commented Jun 25, 2024

jerryzh168 commented Jun 25, 2024



		class QuantLlmLinearWeight(Tensor):
		_implements = classmethod(_implements)

Add FP5 E2M2 support from upstream #399

Add FP5 E2M2 support from upstream #399

Conversation

gau-nernst commented Jun 19, 2024 • edited Loading

New API

Benchmark results

pytorch-bot bot commented Jun 19, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/399

✅ No Failures

msaroufim commented Jun 24, 2024 • edited Loading

jerryzh168 Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

jerryzh168 commented Jun 25, 2024 • edited Loading

jerryzh168 commented Jun 25, 2024

gau-nernst commented Jun 25, 2024

jerryzh168 commented Jun 25, 2024

gau-nernst commented Jun 19, 2024 •

edited

Loading

pytorch-bot bot commented Jun 19, 2024 •

edited

Loading

msaroufim commented Jun 24, 2024 •

edited

Loading

jerryzh168 Jun 24, 2024 •

edited

Loading

jerryzh168 commented Jun 25, 2024 •

edited

Loading