[float8] Add support for blockwise fp8 quantization scheme used in DeepSeek v3 #1594

danielvegamyhre · 2025-01-22T01:05:45Z

DeepSeek v3 uses a blockwise fp8 quantization strategy, where the scaling factor is computed independently for each block, rather than for each tensor/row/etc. The code is available here.

It would be useful for torchao to support this as well, for users wishing to do research or development with this same quantization strategy.

cc @drisspg @vkuzo

gau-nernst · 2025-01-22T13:30:31Z

Just want to add some of my observations here. I played around abit with block-wise FP8 on my consumer GPU (4070Ti SUPER, sm89). A simple triton kernel does not perform really well, only 1.4x speedup over BF16 (for reference, row-wise FP8 is ~1.9x speedup). With dynamic quant overhead, e2e speedup won't be too attractive. (ofc optimizing for Hopper will be completely different).

Also tried block-wise INT8 (which is the main idea of JetFire). A simple triton kernel performs somewhat ok on consumer GPU (1.9x speedup over BF16, compared to row-wise INT8 is 2.9x speedup - note that INT8 matmul is 4x faster than BF16 on consumer GPUs), but on A100, couldn't get any speedup (speedup < 1). Probably because in the case of block-wise INT8, there is a dtype conversion from INT32 to FP32 when scaling MMA accumulate results, while FP8 does not.

For quantization BLOCK_SIZE_K (number of elements along K dim that share 1 scale value), I think only K<=128 would have a simple and performant implementation, since if BLOCK_SIZE_K is too big, we will use too much shared memory. Tried a few ways around it, such as loading tiles smaller than quantization BLOCK_SIZE_K, but couldn't make it fast.

drisspg · 2025-01-22T17:52:12Z

We should also take a look at the new blockwise fp8 gemm added in cutlass 3.7

cc @alexsamardzic

Degnel · 2025-02-05T15:14:45Z

A PR has been created for this issue: PR #1668

danielvegamyhre added the topic: new feature Use this tag if this PR adds a new feature label Jan 22, 2025

drisspg added float8 inference labels Jan 22, 2025

danielvegamyhre mentioned this issue Jan 28, 2025

FP8 QAT / FP8 block-wise quantization #1632

Open

drisspg mentioned this issue Feb 4, 2025

DeepSeek: block quantization pytorch/pytorch#146368

Open

vkuzo mentioned this issue Feb 5, 2025

Feat/blockwise fp8 quant #1668

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[float8] Add support for blockwise fp8 quantization scheme used in DeepSeek v3 #1594

[float8] Add support for blockwise fp8 quantization scheme used in DeepSeek v3 #1594

danielvegamyhre commented Jan 22, 2025

gau-nernst commented Jan 22, 2025

drisspg commented Jan 22, 2025

Degnel commented Feb 5, 2025

[float8] Add support for blockwise fp8 quantization scheme used in DeepSeek v3 #1594

[float8] Add support for blockwise fp8 quantization scheme used in DeepSeek v3 #1594

Comments

danielvegamyhre commented Jan 22, 2025

gau-nernst commented Jan 22, 2025

drisspg commented Jan 22, 2025

Degnel commented Feb 5, 2025