Tensor subclass methods for `DTensor` and `FSDP2` #1664

jeromeku · 2025-02-05T00:40:54Z

Is there a protocol / interface that a tensor subclass must implement in order to be used with DTensor primitives and for training with FSDP2?

I've been walking through NF4 as an example as it covers both. However, the methods are scattered across __torch_function__ and __torch_dispatch__ (though the unittests make it clear which ops are tested for FSDP).

Is there a cleaner / expected format for subclassing a tensor such that

it can be used with DTensor collectives and FSDP2, and
composed with subclass-specific overrides for streamlined use with torch.compile?

@msaroufim @awgu @weifengpy @jerryzh168

p.s. Fwiw, also looked at the developer-guide tensor subclass example but found the abstractions a bit hard to follow; would personally prefer using torch-native functionalities.

The text was updated successfully, but these errors were encountered:

gau-nernst · 2025-02-05T03:00:30Z

I worked on some tensor subclasses that can work with DTensor+FSDP2. You might find this useful (more concise than NF4)

https://github.com/pytorch/ao/blob/v0.8.0/torchao/prototype/quantized_training/int8.py

drisspg · 2025-02-05T05:18:19Z

@weifengpy Do we have any documentation on this?

weifengpy · 2025-02-05T18:42:23Z

However, the methods are scattered across torch_function and torch_dispatch

@jeromeku I always prefer __torch_dispatch__ and it should be enough. NF4 has legacy __torch_dispatch__ implementations for single device. When I extend NF4 to FSDP2, I have to use __torch_function__ to be backward compatible

Is there a cleaner / expected format for subclassing a tensor such that it can be used with DTensor collectives and FSDP2

For FSDP2 + NF4, as you mentioned, requried tensor ops are defined in TestFSDPOps

ao/test/dtypes/test_nf4.py

Line 310 in 8afd10e

class TestFSDPOps(TestCase):

. I implemented those tensor ops one by one because NF4 contains many attributes (scalars or small tensors) that are not shardable. I have to manually define the behavior for each op

For FSDP2 + float8, it's simply dispatching every tensor op to inner tensors with 3 lines of code:

ao/torchao/float8/fsdp_utils.py

Lines 196 to 199 in 8afd10e

    
           args, kwargs = pytree.tree_map_only( 
        
               WeightWithDynamicFloat8CastTensor, unwrap, (args, kwargs or {}) 
        
           ) 
        
           out = func(*args, **kwargs)

For DTensor, if you try DTensor(local_tensor=your_tensor_subclass) and call .full_tensor(), you should be able to see unimpelmented tensor op related to all-gather. I guess this is required for state dict. I did not have an example because we issue collectives on local_tensors directly in TorchTune

https://github.com/pytorch/torchtune/blob/9475b5adab6aa2746b08c73059ca9af9f791559a/torchtune/training/_distributed.py#L259

Do we have any documentation on this?

@drisspg I commented with examples above but not official document yet. Agree we should document this better. will think about it

drisspg added the question Further information is requested label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor subclass methods for `DTensor` and `FSDP2` #1664

Tensor subclass methods for `DTensor` and `FSDP2` #1664

jeromeku commented Feb 5, 2025 •

edited

Loading

gau-nernst commented Feb 5, 2025

drisspg commented Feb 5, 2025

weifengpy commented Feb 5, 2025 •

edited

Loading

Tensor subclass methods for DTensor and FSDP2 #1664

Tensor subclass methods for DTensor and FSDP2 #1664

Comments

jeromeku commented Feb 5, 2025 • edited Loading

gau-nernst commented Feb 5, 2025

drisspg commented Feb 5, 2025

weifengpy commented Feb 5, 2025 • edited Loading

Tensor subclass methods for `DTensor` and `FSDP2` #1664

Tensor subclass methods for `DTensor` and `FSDP2` #1664

jeromeku commented Feb 5, 2025 •

edited

Loading

weifengpy commented Feb 5, 2025 •

edited

Loading