Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Low_Level_Zero plugin crashes with LoRA #5909

Open
1 task done
Fallqs opened this issue Jul 15, 2024 · 14 comments
Open
1 task done

[BUG]: Low_Level_Zero plugin crashes with LoRA #5909

Fallqs opened this issue Jul 15, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@Fallqs
Copy link

Fallqs commented Jul 15, 2024

Is there an existing issue for this bug?

  • I have searched the existing issues

🐛 Describe the bug

The line 808 of zero/low_level/low_level_optim.py assumes that every single parameter in model.parameters() is trainable. However, this is not true when it comes to LoRA tuning, resulting in training crashes.

To solve this issue, you may just add a shortcut below this for-loop:

for p in model.parameters():  # line 808
    if not p.requires_grad:
        continue
    ...

Environment

CUDA 12.1
PyTorch 2.1.2
ColossalAI 0.4.0 [This BUG is not observed in 0.3.5]

@Fallqs Fallqs added the bug Something isn't working label Jul 15, 2024
@botbw
Copy link
Member

botbw commented Jul 19, 2024

Hey @Fallqs thanks for reporting the bug and I will look into this. Btw will it be possible to share the code you are using or a min repro for the LoRA crash?

@281LinChenjian
Copy link

Sorry to bother you, could you please describe it in more detail? Because I am using the 0.3.6 version of colossalai, I put the following code in the corresponding position according to your code implementation, but it didn't work. Is it because I put it in the wrong position?i also want to use lora tuning.

this is my code:

def _sync_grad(self):
    for group_id in range(self.num_param_groups):
        param_group = self._working_param_groups[group_id]
        for param in param_group:
            if param.requires_grad and param.grad is not None:
                self._add_to_bucket(param, group_id)

    for p in model.parameters():  # line 808
        if not p.requires_grad:
            continue
        self._run_reduction()

this is my issue:
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 427, in
[rank0]: main()
[rank0]: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 331, in main
[rank0]: optimizer.step()
[rank0]: File "/home/yangl/.local/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 597, in step
[rank0]: working_grads = self._grad_store.get_working_grads_by_group_id(group_id)
[rank0]: File "/home/yangl/.local/lib/python3.9/site-packages/colossalai/zero/low_level/bookkeeping/gradient_store.py", line 85, in get_working_grads_by_group_id
[rank0]: for param_grads in self._grads_of_params[group_id].values():
[rank0]: KeyError: 0

@Edenzzzz
Copy link
Contributor

Please share a minimum script to reproduce the error. Your code is wrong as _run_reduction reduces grads for all bucketed parameters.
As far as I can tell, non-trainable params are not added to the bucket for reduction.

if param.requires_grad and param.grad is not None:

@281LinChenjian
Copy link

Please share a minimum script to reproduce the error. Your code is wrong as _run_reduction reduces grads for all bucketed parameters. As far as I can tell, non-trainable params are not added to the bucket for reduction.

if param.requires_grad and param.grad is not None:

Thank you for your reply. Regarding the above issue, I have found that my code was added in the wrong location. He used line 808 from version 0.4.1. Then I have a new question:

def update_ipt(self, model): 
    for n,p in model.named_parameters():
        if "lora_" in n:
            # if p.grad is not None:
            #     print("grad:",p.grad)
            # print(p.requires_grad)
            # if not p.requires_grad:
            #     p.requires_grad = True  # Ensure requires_grad is True for 'lora_' parameters
            # p.retain_grad()
            if n not in self.ipt:
                self.ipt[n] = torch.zeros_like(p)
                self.exp_avg_ipt[n] = torch.zeros_like(p) 
                self.exp_avg_unc[n] = torch.zeros_like(p) 
            with torch.no_grad():
                # Calculate sensitivity 
                print("p.grad:",p.grad)
                self.ipt[n] = (p * p.grad).abs().detach()
                # Update sensitivity 
                self.exp_avg_ipt[n] = self.beta1 * self.exp_avg_ipt[n] + \
                                    (1-self.beta1)*self.ipt[n]
                # Update uncertainty 
                self.exp_avg_unc[n] = self.beta2 * self.exp_avg_unc[n] + \
                                    (1-self.beta2)*(self.ipt[n]-self.exp_avg_ipt[n]).abs()

When I tried to use p.grad, I found that an error occurred. After checking, I found that after using colorssalAI, I cannot directly access the gradient using p.grad. So the question is, how can we obtain gradient information?

[rank0]: Traceback (most recent call last):
[rank0]: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 439, in
[rank0]: main()
[rank0]: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 346, in main
[rank0]: rankallocator.update_and_mask(model, epoch)
[rank0]: File "/home/yangl/LCJ_97/AdaLoRA/loralib/loralib/adalora.py", line 320, in update_and_mask
[rank0]: self.update_ipt(model)
[rank0]: File "/home/yangl/LCJ_97/AdaLoRA/loralib/loralib/adalora.py", line 228, in update_ipt
[rank0]: self.ipt[n] = (p * p.grad).abs().detach()
[rank0]: TypeError: unsupported operand type(s) for *: 'Parameter' and 'NoneType'

This is the website I searched for:
hpcaitech/Open-Sora#283

Thank you again for your enthusiastic response.

@Edenzzzz
Copy link
Contributor

Edenzzzz commented Jul 22, 2024

You can get the grads this way by calling get_partitioned_gradients_by_param_id, described in the issue you mentioned
hpcaitech/Open-Sora#283 (comment)

@281LinChenjian
Copy link

You can get the grads this way, described in the issue you mentioned hpcaitech/Open-Sora#283 (comment)

I have read the above code before, but it did not involve zero_optizer in my code implementation. Can you be more specific on how to implement it?

@Edenzzzz
Copy link
Contributor

Edenzzzz commented Jul 22, 2024

Does your training code involve an optimizer? That's what you're looking for

@281LinChenjian
Copy link

Does your training code involve an optimizer? That's what you're looking for

Sorry to bother you again, I will refine my question. The following is a minimal reproduction of my problem. However, it involves several methods of opensora that need to be imported. I used the code optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p)) mentioned above to try to access the gradient of the parameters, but I did not get any value. The optimizer I used here is HybridAdam, and I used Booster which is not used in the link you gave. My question is how can I get the gradient in the case of the above code?
This is my code:

import torch
import torch.nn as nn
import colossalai
from colossalai.booster import Booster
from colossalai.booster.plugin import GeminiPlugin
from colossalai.nn.optimizer import HybridAdam

from opensora.utils.train_utils import MaskGenerator, create_colossalai_plugin, update_ema
from opensora.utils.config_utils import define_experiment_workspace, parse_configs, save_training_config
colossalai.launch_from_torch({})
cfg = parse_configs(training=True)
cfg_dtype = cfg.get("dtype", "bf16")
plugin = create_colossalai_plugin(
    plugin=cfg.get("plugin", "zero2"),
    dtype=cfg_dtype,
    grad_clip=cfg.get("grad_clip", 0),
    sp_size=cfg.get("sp_size", 1),
    reduce_bucket_size_in_m=cfg.get("reduce_bucket_size_in_m", 20),
)
booster = Booster(plugin=plugin)

class Model(nn.Module):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.embedding = nn.Embedding(100, 1024)
        self.lora_linear = nn.Linear(1024,1024)
        # self.lora_linear = loralib.SVDLinear(1024, 1024, r=12)
    
    def forward(self, x):
        embed = self.embedding(x)
        transform = self.lora_linear(embed)
        loss = (transform ** 2).sum()
        return loss
    
model = Model().train().cuda()

optimizer = HybridAdam(model.parameters(), lr=5e-5, betas=(0.9, 0.999), weight_decay=0)
model, optimizer = booster.boost(model, optimizer)[:2]

global_step=0
inputs = torch.tensor([1,2,3], device="cuda")
loss = model(inputs)

booster.backward(loss, optimizer)
print("loss:",loss)  # loss: tensor(1088., device='cuda:0', dtype=torch.bfloat16, grad_fn=<SumBackward0>)
optimizer.step()

for n, p in model.named_parameters():
    _grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p))
    print("grad:", _grad) # output:     grad:[]

This is my run command:
python3 -m torch.distributed.run --nproc_per_node 1 /home/yangl/LCJ_97/Open-Sora/scripts/little_check.py configs/opensora-v1-2/train/stage1.py

I also tried a code that can successfully obtain the gradient, as follows:

import torch
import torch.nn as nn
import colossalai
from colossalai.booster import Booster
from colossalai.booster.plugin import GeminiPlugin
from colossalai.nn.optimizer import HybridAdam
import loralib 
from loralib import RankAllocator
from loralib import compute_orth_regu 
from opensora.utils.train_utils import MaskGenerator, create_colossalai_plugin, update_ema
from opensora.utils.config_utils import define_experiment_workspace, parse_configs, save_training_config
colossalai.launch_from_torch({})
cfg = parse_configs(training=True)
cfg_dtype = cfg.get("dtype", "bf16")
plugin = create_colossalai_plugin(
    plugin=cfg.get("plugin", "zero2"),
    dtype=cfg_dtype,
    grad_clip=cfg.get("grad_clip", 0),
    sp_size=cfg.get("sp_size", 1),
    reduce_bucket_size_in_m=cfg.get("reduce_bucket_size_in_m", 20),
)
booster = Booster(plugin=plugin)

class Model(nn.Module):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.embedding = nn.Embedding(100, 1024)
        # self.embedding.requires_grad_(False)
        
        self.lora_linear = loralib.SVDLinear(1024, 1024, r=12)
    
    def forward(self, x):
        embed = self.embedding(x)
        transform = self.lora_linear(embed)
        loss = (transform ** 2).sum()
        return loss
    
model = Model().cuda()
loralib.mark_only_lora_as_trainable(model)
optimizer = HybridAdam(model.parameters(), lr=5e-5, betas=(0.9, 0.999), weight_decay=0)
# model, optimizer = booster.boost(model, optimizer)[:2]
rankallocator = RankAllocator(
    model, lora_r=12, target_rank=8,
    init_warmup=500, final_warmup=1500, mask_interval=10, 
    total_step=3000, beta1=0.85, beta2=0.85, 
)
global_step=0
inputs = torch.tensor([1,2,3], device="cuda")
loss = model(inputs)
# booster.backward(loss, optimizer)
print("loss:",loss)
(loss+compute_orth_regu(model, regu_weight=0.1)).backward()
optimizer.step()

for n, p in model.named_parameters():
    if p.grad is not None:
        print("不为None")
    if "lora_" in n:
        print("n,p:",n,p.shape)
        # grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p))
        print("grad:",p.grad)
        # print("grad:",grad)
rankallocator.update_and_mask(model, global_step)

I don't know what the difference is between the two. I think the difference is that one uses booster.backward(loss, optimizer) and the other uses loss.backward() to pass the gradient back. Is it possible that I can't get the gradient when I use bosster, or is there something wrong with my code?
Because I plan to use LoRA to fine-tune a large model, so getting the gradient is very important for me. Can you provide some help here?

@botbw
Copy link
Member

botbw commented Jul 24, 2024

hey @281LinChenjian ,

Regarding the problem you've got:

Code snippet 1

here after optimizer updates the param, it clears the _grad_store and you can no longer access the gradient, please access the gradient after optimizer.backward(loss) and before optiizer.step().

Code snippet 2

Don't call loss.backward() if you are using our optimizer.

  • booster.backward(loss, optimizer) calls optimizer.backward(loss) and finally reaches here, where the loss.backward() is called.
  • after loss.backward() is called, grad reduction will execute to make sure your gradients are reduced and correct (so don't call loss.backward() since it doesn't do so).
  • after gradient reduction, grad on param (i.e. param.grad) are zeroed here and that's why you see all param.grad are None.

Since there is no universal API for gradient accessing it might be a bit tricky and confusing, do feel free to ask here or open another issue if you still have problem :)

@281LinChenjian
Copy link

Thank you for your generous help. I have thoroughly understood how to use this method. Thank you again for your patient answers!!!

@281LinChenjian
Copy link

hey @281LinChenjian ,

Regarding the problem you've got:

Code snippet 1

here after optimizer updates the param, it clears the _grad_store and you can no longer access the gradient, please access the gradient after optimizer.backward(loss) and before optiizer.step().

Code snippet 2

Don't call loss.backward() if you are using our optimizer.

  • booster.backward(loss, optimizer) calls optimizer.backward(loss) and finally reaches here, where the loss.backward() is called.
  • after loss.backward() is called, grad reduction will execute to make sure your gradients are reduced and correct (so don't call loss.backward() since it doesn't do so).
  • after gradient reduction, grad on param (i.e. param.grad) are zeroed here and that's why you see all param.grad are None.

Since there is no universal API for gradient accessing it might be a bit tricky and confusing, do feel free to ask here or open another issue if you still have problem :)

Sorry to bother you again, I found that when training with multiple graphics cards, the shape obtained by _grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p))
is smaller than p.shape. Specifically, when I use two graphics cards for training, the gradient shape I get is half of p, and when I use four graphics cards for training, the shape I get is one quarter of p. I think this is related to the implementation of your distributed training framework. Is this normal? How should I solve the current problem?
The following is my code and error:

        for n,p in model.named_parameters():
            if "lora_" in n:
                if n not in self.ipt:
                    self.ipt[n] = torch.zeros_like(p)
                    self.exp_avg_ipt[n] = torch.zeros_like(p) 
                    self.exp_avg_unc[n] = torch.zeros_like(p) 
                with torch.no_grad():
                    # Calculate sensitivity 
                    print("n,p:",n,p.shape)
                    # print("p.grad:",optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p)))
                    _grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p)) # meet some problems
                    print("_grad:",_grad[0].shape,len(_grad))
                    self.ipt[n] = (p * _grad[0].view(p.shape)).abs().detach()

This is the error when training with two graphics cards:
d80b2f8a064a1fe76efc39a8262e9b5

This is the error when training with four graphics cards:
46068bf889ebbcdb8e337a282a07c88

Interestingly, 3456×1152=11990656×2,and 3456×1152=995328×4.
So my gradient size and parameter size cannot match correctly, which is the problem I am currently facing.
Apparently, the gradient I get in this way is one or two times less than the actual parameter amount.
Is it because the implementation of each network structure is also evenly distributed on each card?

@Edenzzzz
Copy link
Contributor

Edenzzzz commented Aug 6, 2024

ZeRO splits gradients evenly across devices

@281LinChenjian
Copy link

ZeRO splits gradients evenly across devices

Is there any way to integrate them together, or is there any way to get the corresponding gradients on different graphics cards?

@botbw
Copy link
Member

botbw commented Aug 6, 2024

@281LinChenjian I guess you'll have to manually do torch.distributed.all_gather

For your reference

@botbw botbw removed their assignment Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants