[BUG]: Low_Level_Zero plugin crashes with LoRA #5909

Fallqs · 2024-07-15T11:27:11Z

Is there an existing issue for this bug?

I have searched the existing issues

🐛 Describe the bug

The line 808 of zero/low_level/low_level_optim.py assumes that every single parameter in model.parameters() is trainable. However, this is not true when it comes to LoRA tuning, resulting in training crashes.

To solve this issue, you may just add a shortcut below this for-loop:

for p in model.parameters():  # line 808
    if not p.requires_grad:
        continue
    ...

Environment

CUDA 12.1
PyTorch 2.1.2
ColossalAI 0.4.0 [This BUG is not observed in 0.3.5]

The text was updated successfully, but these errors were encountered:

botbw · 2024-07-19T07:23:23Z

Hey @Fallqs thanks for reporting the bug and I will look into this. Btw will it be possible to share the code you are using or a min repro for the LoRA crash?

281LinChenjian · 2024-07-22T05:37:19Z

Sorry to bother you, could you please describe it in more detail? Because I am using the 0.3.6 version of colossalai, I put the following code in the corresponding position according to your code implementation, but it didn't work. Is it because I put it in the wrong position?i also want to use lora tuning.

this is my code：

def _sync_grad(self):
    for group_id in range(self.num_param_groups):
        param_group = self._working_param_groups[group_id]
        for param in param_group:
            if param.requires_grad and param.grad is not None:
                self._add_to_bucket(param, group_id)

    for p in model.parameters():  # line 808
        if not p.requires_grad:
            continue
        self._run_reduction()

this is my issue:
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 427, in
[rank0]: main()
[rank0]: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 331, in main
[rank0]: optimizer.step()
[rank0]: File "/home/yangl/.local/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 597, in step
[rank0]: working_grads = self._grad_store.get_working_grads_by_group_id(group_id)
[rank0]: File "/home/yangl/.local/lib/python3.9/site-packages/colossalai/zero/low_level/bookkeeping/gradient_store.py", line 85, in get_working_grads_by_group_id
[rank0]: for param_grads in self._grads_of_params[group_id].values():
[rank0]: KeyError: 0

Edenzzzz · 2024-07-22T08:14:39Z

Please share a minimum script to reproduce the error. Your code is wrong as _run_reduction reduces grads for all bucketed parameters.
As far as I can tell, non-trainable params are not added to the bucket for reduction.

ColossalAI/colossalai/zero/low_level/low_level_optim.py

Line 652 in 4ec17a7

if param.requires_grad and param.grad is not None:

281LinChenjian · 2024-07-22T08:55:40Z

Please share a minimum script to reproduce the error. Your code is wrong as _run_reduction reduces grads for all bucketed parameters. As far as I can tell, non-trainable params are not added to the bucket for reduction.

ColossalAI/colossalai/zero/low_level/low_level_optim.py

Line 652 in 4ec17a7

if param.requires_grad and param.grad is not None:

Thank you for your reply. Regarding the above issue, I have found that my code was added in the wrong location. He used line 808 from version 0.4.1. Then I have a new question:

def update_ipt(self, model): 
    for n,p in model.named_parameters():
        if "lora_" in n:
            # if p.grad is not None:
            #     print("grad:",p.grad)
            # print(p.requires_grad)
            # if not p.requires_grad:
            #     p.requires_grad = True  # Ensure requires_grad is True for 'lora_' parameters
            # p.retain_grad()
            if n not in self.ipt:
                self.ipt[n] = torch.zeros_like(p)
                self.exp_avg_ipt[n] = torch.zeros_like(p) 
                self.exp_avg_unc[n] = torch.zeros_like(p) 
            with torch.no_grad():
                # Calculate sensitivity 
                print("p.grad:",p.grad)
                self.ipt[n] = (p * p.grad).abs().detach()
                # Update sensitivity 
                self.exp_avg_ipt[n] = self.beta1 * self.exp_avg_ipt[n] + \
                                    (1-self.beta1)*self.ipt[n]
                # Update uncertainty 
                self.exp_avg_unc[n] = self.beta2 * self.exp_avg_unc[n] + \
                                    (1-self.beta2)*(self.ipt[n]-self.exp_avg_ipt[n]).abs()

When I tried to use p.grad, I found that an error occurred. After checking, I found that after using colorssalAI, I cannot directly access the gradient using p.grad. So the question is, how can we obtain gradient information?

[rank0]: Traceback (most recent call last):
[rank0]: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 439, in
[rank0]: main()
[rank0]: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 346, in main
[rank0]: rankallocator.update_and_mask(model, epoch)
[rank0]: File "/home/yangl/LCJ_97/AdaLoRA/loralib/loralib/adalora.py", line 320, in update_and_mask
[rank0]: self.update_ipt(model)
[rank0]: File "/home/yangl/LCJ_97/AdaLoRA/loralib/loralib/adalora.py", line 228, in update_ipt
[rank0]: self.ipt[n] = (p * p.grad).abs().detach()
[rank0]: TypeError: unsupported operand type(s) for *: 'Parameter' and 'NoneType'

This is the website I searched for：
hpcaitech/Open-Sora#283

Thank you again for your enthusiastic response.

Edenzzzz · 2024-07-22T09:02:33Z

You can get the grads this way by calling get_partitioned_gradients_by_param_id, described in the issue you mentioned
hpcaitech/Open-Sora#283 (comment)

281LinChenjian · 2024-07-22T09:17:15Z

You can get the grads this way, described in the issue you mentioned hpcaitech/Open-Sora#283 (comment)

I have read the above code before, but it did not involve zero_optizer in my code implementation. Can you be more specific on how to implement it?

Edenzzzz · 2024-07-22T09:20:55Z

Does your training code involve an optimizer? That's what you're looking for

281LinChenjian · 2024-07-23T06:54:07Z

Does your training code involve an optimizer? That's what you're looking for

Sorry to bother you again, I will refine my question. The following is a minimal reproduction of my problem. However, it involves several methods of opensora that need to be imported. I used the code optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p)) mentioned above to try to access the gradient of the parameters, but I did not get any value. The optimizer I used here is HybridAdam, and I used Booster which is not used in the link you gave. My question is how can I get the gradient in the case of the above code?
This is my code：

import torch
import torch.nn as nn
import colossalai
from colossalai.booster import Booster
from colossalai.booster.plugin import GeminiPlugin
from colossalai.nn.optimizer import HybridAdam

from opensora.utils.train_utils import MaskGenerator, create_colossalai_plugin, update_ema
from opensora.utils.config_utils import define_experiment_workspace, parse_configs, save_training_config
colossalai.launch_from_torch({})
cfg = parse_configs(training=True)
cfg_dtype = cfg.get("dtype", "bf16")
plugin = create_colossalai_plugin(
    plugin=cfg.get("plugin", "zero2"),
    dtype=cfg_dtype,
    grad_clip=cfg.get("grad_clip", 0),
    sp_size=cfg.get("sp_size", 1),
    reduce_bucket_size_in_m=cfg.get("reduce_bucket_size_in_m", 20),
)
booster = Booster(plugin=plugin)

class Model(nn.Module):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.embedding = nn.Embedding(100, 1024)
        self.lora_linear = nn.Linear(1024,1024)
        # self.lora_linear = loralib.SVDLinear(1024, 1024, r=12)
    
    def forward(self, x):
        embed = self.embedding(x)
        transform = self.lora_linear(embed)
        loss = (transform ** 2).sum()
        return loss
    
model = Model().train().cuda()

optimizer = HybridAdam(model.parameters(), lr=5e-5, betas=(0.9, 0.999), weight_decay=0)
model, optimizer = booster.boost(model, optimizer)[:2]

global_step=0
inputs = torch.tensor([1,2,3], device="cuda")
loss = model(inputs)

booster.backward(loss, optimizer)
print("loss:",loss)  # loss: tensor(1088., device='cuda:0', dtype=torch.bfloat16, grad_fn=<SumBackward0>)
optimizer.step()

for n, p in model.named_parameters():
    _grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p))
    print("grad:", _grad) # output:     grad:[]

This is my run command：
python3 -m torch.distributed.run --nproc_per_node 1 /home/yangl/LCJ_97/Open-Sora/scripts/little_check.py configs/opensora-v1-2/train/stage1.py

I also tried a code that can successfully obtain the gradient, as follows：

import torch
import torch.nn as nn
import colossalai
from colossalai.booster import Booster
from colossalai.booster.plugin import GeminiPlugin
from colossalai.nn.optimizer import HybridAdam
import loralib 
from loralib import RankAllocator
from loralib import compute_orth_regu 
from opensora.utils.train_utils import MaskGenerator, create_colossalai_plugin, update_ema
from opensora.utils.config_utils import define_experiment_workspace, parse_configs, save_training_config
colossalai.launch_from_torch({})
cfg = parse_configs(training=True)
cfg_dtype = cfg.get("dtype", "bf16")
plugin = create_colossalai_plugin(
    plugin=cfg.get("plugin", "zero2"),
    dtype=cfg_dtype,
    grad_clip=cfg.get("grad_clip", 0),
    sp_size=cfg.get("sp_size", 1),
    reduce_bucket_size_in_m=cfg.get("reduce_bucket_size_in_m", 20),
)
booster = Booster(plugin=plugin)

class Model(nn.Module):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.embedding = nn.Embedding(100, 1024)
        # self.embedding.requires_grad_(False)
        
        self.lora_linear = loralib.SVDLinear(1024, 1024, r=12)
    
    def forward(self, x):
        embed = self.embedding(x)
        transform = self.lora_linear(embed)
        loss = (transform ** 2).sum()
        return loss
    
model = Model().cuda()
loralib.mark_only_lora_as_trainable(model)
optimizer = HybridAdam(model.parameters(), lr=5e-5, betas=(0.9, 0.999), weight_decay=0)
# model, optimizer = booster.boost(model, optimizer)[:2]
rankallocator = RankAllocator(
    model, lora_r=12, target_rank=8,
    init_warmup=500, final_warmup=1500, mask_interval=10, 
    total_step=3000, beta1=0.85, beta2=0.85, 
)
global_step=0
inputs = torch.tensor([1,2,3], device="cuda")
loss = model(inputs)
# booster.backward(loss, optimizer)
print("loss:",loss)
(loss+compute_orth_regu(model, regu_weight=0.1)).backward()
optimizer.step()

for n, p in model.named_parameters():
    if p.grad is not None:
        print("不为None")
    if "lora_" in n:
        print("n,p:",n,p.shape)
        # grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p))
        print("grad:",p.grad)
        # print("grad:",grad)
rankallocator.update_and_mask(model, global_step)

I don't know what the difference is between the two. I think the difference is that one uses booster.backward(loss, optimizer) and the other uses loss.backward() to pass the gradient back. Is it possible that I can't get the gradient when I use bosster, or is there something wrong with my code?
Because I plan to use LoRA to fine-tune a large model, so getting the gradient is very important for me. Can you provide some help here?

botbw · 2024-07-24T02:17:56Z

hey @281LinChenjian ,

Regarding the problem you've got:

Code snippet 1

here after optimizer updates the param, it clears the _grad_store and you can no longer access the gradient, please access the gradient after optimizer.backward(loss) and before optiizer.step().

Code snippet 2

Don't call loss.backward() if you are using our optimizer.

booster.backward(loss, optimizer) calls optimizer.backward(loss) and finally reaches here, where the loss.backward() is called.
after loss.backward() is called, grad reduction will execute to make sure your gradients are reduced and correct (so don't call loss.backward() since it doesn't do so).
after gradient reduction, grad on param (i.e. param.grad) are zeroed here and that's why you see all param.grad are None.

Since there is no universal API for gradient accessing it might be a bit tricky and confusing, do feel free to ask here or open another issue if you still have problem :)

281LinChenjian · 2024-07-24T08:16:16Z

Thank you for your generous help. I have thoroughly understood how to use this method. Thank you again for your patient answers!!!

281LinChenjian · 2024-08-06T03:27:34Z

hey @281LinChenjian ,

Regarding the problem you've got:

Code snippet 1

here after optimizer updates the param, it clears the _grad_store and you can no longer access the gradient, please access the gradient after optimizer.backward(loss) and before optiizer.step().

Code snippet 2

Don't call loss.backward() if you are using our optimizer.

booster.backward(loss, optimizer) calls optimizer.backward(loss) and finally reaches here, where the loss.backward() is called.

after loss.backward() is called, grad reduction will execute to make sure your gradients are reduced and correct (so don't call loss.backward() since it doesn't do so).

after gradient reduction, grad on param (i.e. param.grad) are zeroed here and that's why you see all param.grad are None.

Since there is no universal API for gradient accessing it might be a bit tricky and confusing, do feel free to ask here or open another issue if you still have problem :)

Sorry to bother you again, I found that when training with multiple graphics cards, the shape obtained by _grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p))
is smaller than p.shape. Specifically, when I use two graphics cards for training, the gradient shape I get is half of p, and when I use four graphics cards for training, the shape I get is one quarter of p. I think this is related to the implementation of your distributed training framework. Is this normal? How should I solve the current problem?
The following is my code and error:

        for n,p in model.named_parameters():
            if "lora_" in n:
                if n not in self.ipt:
                    self.ipt[n] = torch.zeros_like(p)
                    self.exp_avg_ipt[n] = torch.zeros_like(p) 
                    self.exp_avg_unc[n] = torch.zeros_like(p) 
                with torch.no_grad():
                    # Calculate sensitivity 
                    print("n,p:",n,p.shape)
                    # print("p.grad:",optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p)))
                    _grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p)) # meet some problems
                    print("_grad:",_grad[0].shape,len(_grad))
                    self.ipt[n] = (p * _grad[0].view(p.shape)).abs().detach()

This is the error when training with two graphics cards：

This is the error when training with four graphics cards：

Interestingly, 3456×1152=11990656×2，and 3456×1152=995328×4.
So my gradient size and parameter size cannot match correctly, which is the problem I am currently facing.
Apparently, the gradient I get in this way is one or two times less than the actual parameter amount.
Is it because the implementation of each network structure is also evenly distributed on each card?

Edenzzzz · 2024-08-06T04:20:54Z

ZeRO splits gradients evenly across devices

281LinChenjian · 2024-08-06T05:09:42Z

ZeRO splits gradients evenly across devices

Is there any way to integrate them together, or is there any way to get the corresponding gradients on different graphics cards?

botbw · 2024-08-06T07:35:34Z

@281LinChenjian I guess you'll have to manually do torch.distributed.all_gather

For your reference

Fallqs added the bug Something isn't working label Jul 15, 2024

botbw self-assigned this Jul 19, 2024

botbw mentioned this issue Jul 19, 2024

[PROPOSAL]: Does the LowLevelZero Plugin Support Lora, This Code Is Confusing #5908

Open

1 task

botbw removed their assignment Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Low_Level_Zero plugin crashes with LoRA #5909

[BUG]: Low_Level_Zero plugin crashes with LoRA #5909

Fallqs commented Jul 15, 2024

botbw commented Jul 19, 2024

281LinChenjian commented Jul 22, 2024

Edenzzzz commented Jul 22, 2024

281LinChenjian commented Jul 22, 2024

Edenzzzz commented Jul 22, 2024 •

edited

Loading

281LinChenjian commented Jul 22, 2024

Edenzzzz commented Jul 22, 2024 •

edited

Loading

281LinChenjian commented Jul 23, 2024

botbw commented Jul 24, 2024 •

edited

Loading

281LinChenjian commented Jul 24, 2024

281LinChenjian commented Aug 6, 2024

Code snippet 1

Code snippet 2

Edenzzzz commented Aug 6, 2024

281LinChenjian commented Aug 6, 2024

botbw commented Aug 6, 2024 •

edited

Loading

[BUG]: Low_Level_Zero plugin crashes with LoRA #5909

[BUG]: Low_Level_Zero plugin crashes with LoRA #5909

Comments

Fallqs commented Jul 15, 2024

Is there an existing issue for this bug?

🐛 Describe the bug

Environment

botbw commented Jul 19, 2024

281LinChenjian commented Jul 22, 2024

Edenzzzz commented Jul 22, 2024

281LinChenjian commented Jul 22, 2024

Edenzzzz commented Jul 22, 2024 • edited Loading

281LinChenjian commented Jul 22, 2024

Edenzzzz commented Jul 22, 2024 • edited Loading

281LinChenjian commented Jul 23, 2024

botbw commented Jul 24, 2024 • edited Loading

Code snippet 1

Code snippet 2

281LinChenjian commented Jul 24, 2024

281LinChenjian commented Aug 6, 2024

Code snippet 1

Code snippet 2

Edenzzzz commented Aug 6, 2024

281LinChenjian commented Aug 6, 2024

botbw commented Aug 6, 2024 • edited Loading

Edenzzzz commented Jul 22, 2024 •

edited

Loading

Edenzzzz commented Jul 22, 2024 •

edited

Loading

botbw commented Jul 24, 2024 •

edited

Loading

botbw commented Aug 6, 2024 •

edited

Loading