You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Text-only data is implemented in such a way that learning proceeds with visual tokens set to empty. In my opinion, since the length of visual tokens is quite significant, it seems more efficient not to use meaningless visual tokens for text-only data. Moreover, since a sampler that samples data from the same modality is already implemented, I am even more puzzled.
Is there a specific reason for this?
The text was updated successfully, but these errors were encountered:
Yes, indeed, for text-only data, they set the visual input to zero values.
# image exist in the data
if 'image' in self.list_data_dict[i]:
data_dict['image'] = image
elif self.data_args.is_multimodal:
# image does not exist in the data, but the model is multimodal
crop_size = self.data_args.image_processor.crop_size
data_dict['image'] = torch.zeros(3, crop_size['height'], crop_size['width'])
return data_dict
This leads the vision encoder to produce some embeddings, influenced by its bias term. These embeddings, comprising 576 visual tokens, are appended to the text input for every text-only sample. The combined tokens are then used as a condition to autoregressively predict the next text token.
It's not entirely clear why this approach should help. One hypothesis might be that incorporating these visual embeddings—even if they carry no meaningful information—could regularize the model by maintaining consistent input sizes across modalities. Since a modality-specific sampler already exists, this approach does seem counterintuitive without a more detailed justification.
Question
Text-only data is implemented in such a way that learning proceeds with visual tokens set to empty. In my opinion, since the length of visual tokens is quite significant, it seems more efficient not to use meaningless visual tokens for text-only data. Moreover, since a sampler that samples data from the same modality is already implemented, I am even more puzzled.
Is there a specific reason for this?
The text was updated successfully, but these errors were encountered: