You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DeepSeek v3 uses a blockwise fp8 quantization strategy, where the scaling factor is computed independently for each block, rather than for each tensor/row/etc. The code is available here.
It would be useful for torchao to support this as well, for users wishing to do research or development with this same quantization strategy.
Just want to add some of my observations here. I played around abit with block-wise FP8 on my consumer GPU (4070Ti SUPER, sm89). A simple triton kernel does not perform really well, only 1.4x speedup over BF16 (for reference, row-wise FP8 is ~1.9x speedup). With dynamic quant overhead, e2e speedup won't be too attractive. (ofc optimizing for Hopper will be completely different).
Also tried block-wise INT8 (which is the main idea of JetFire). A simple triton kernel performs somewhat ok on consumer GPU (1.9x speedup over BF16, compared to row-wise INT8 is 2.9x speedup - note that INT8 matmul is 4x faster than BF16 on consumer GPUs), but on A100, couldn't get any speedup (speedup < 1). Probably because in the case of block-wise INT8, there is a dtype conversion from INT32 to FP32 when scaling MMA accumulate results, while FP8 does not.
For quantization BLOCK_SIZE_K (number of elements along K dim that share 1 scale value), I think only K<=128 would have a simple and performant implementation, since if BLOCK_SIZE_K is too big, we will use too much shared memory. Tried a few ways around it, such as loading tiles smaller than quantization BLOCK_SIZE_K, but couldn't make it fast.
DeepSeek v3 uses a blockwise fp8 quantization strategy, where the scaling factor is computed independently for each block, rather than for each tensor/row/etc. The code is available here.
It would be useful for torchao to support this as well, for users wishing to do research or development with this same quantization strategy.
cc @drisspg @vkuzo
The text was updated successfully, but these errors were encountered: