[Question] Why does text-only data use the empty image token? #1792

MSungK · 2024-12-09T15:28:41Z

Question

Text-only data is implemented in such a way that learning proceeds with visual tokens set to empty. In my opinion, since the length of visual tokens is quite significant, it seems more efficient not to use meaningless visual tokens for text-only data. Moreover, since a sampler that samples data from the same modality is already implemented, I am even more puzzled.
Is there a specific reason for this?

raja-7-c · 2025-01-17T10:52:25Z

Yes, indeed, for text-only data, they set the visual input to zero values.

    # image exist in the data
    if 'image' in self.list_data_dict[i]:
        data_dict['image'] = image
    elif self.data_args.is_multimodal:
        # image does not exist in the data, but the model is multimodal
        crop_size = self.data_args.image_processor.crop_size
        data_dict['image'] = torch.zeros(3, crop_size['height'], crop_size['width'])
    return data_dict

This leads the vision encoder to produce some embeddings, influenced by its bias term. These embeddings, comprising 576 visual tokens, are appended to the text input for every text-only sample. The combined tokens are then used as a condition to autoregressively predict the next text token.

It's not entirely clear why this approach should help. One hypothesis might be that incorporating these visual embeddings—even if they carry no meaningful information—could regularize the model by maintaining consistent input sizes across modalities. Since a modality-specific sampler already exists, this approach does seem counterintuitive without a more detailed justification.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Why does text-only data use the empty image token? #1792

[Question] Why does text-only data use the empty image token? #1792

MSungK commented Dec 9, 2024

raja-7-c commented Jan 17, 2025

[Question] Why does text-only data use the empty image token? #1792

[Question] Why does text-only data use the empty image token? #1792

Comments

MSungK commented Dec 9, 2024

Question

raja-7-c commented Jan 17, 2025