Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Why does text-only data use the empty image token? #1792

Open
MSungK opened this issue Dec 9, 2024 · 1 comment
Open

[Question] Why does text-only data use the empty image token? #1792

MSungK opened this issue Dec 9, 2024 · 1 comment

Comments

@MSungK
Copy link

MSungK commented Dec 9, 2024

Question

Text-only data is implemented in such a way that learning proceeds with visual tokens set to empty. In my opinion, since the length of visual tokens is quite significant, it seems more efficient not to use meaningless visual tokens for text-only data. Moreover, since a sampler that samples data from the same modality is already implemented, I am even more puzzled.
Is there a specific reason for this?

@raja-7-c
Copy link

Yes, indeed, for text-only data, they set the visual input to zero values.

    # image exist in the data
    if 'image' in self.list_data_dict[i]:
        data_dict['image'] = image
    elif self.data_args.is_multimodal:
        # image does not exist in the data, but the model is multimodal
        crop_size = self.data_args.image_processor.crop_size
        data_dict['image'] = torch.zeros(3, crop_size['height'], crop_size['width'])
    return data_dict

This leads the vision encoder to produce some embeddings, influenced by its bias term. These embeddings, comprising 576 visual tokens, are appended to the text input for every text-only sample. The combined tokens are then used as a condition to autoregressively predict the next text token.

It's not entirely clear why this approach should help. One hypothesis might be that incorporating these visual embeddings—even if they carry no meaningful information—could regularize the model by maintaining consistent input sizes across modalities. Since a modality-specific sampler already exists, this approach does seem counterintuitive without a more detailed justification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants