Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bad results with OCR inferece #76

Open
katie312 opened this issue Feb 12, 2025 · 9 comments
Open

bad results with OCR inferece #76

katie312 opened this issue Feb 12, 2025 · 9 comments

Comments

@katie312
Copy link

katie312 commented Feb 12, 2025

I input a pic like this:

Image

it seems like a very easy task, but there is so much problem in the output (structure is fine, but some words are wrong). Is there something wrong with the inference code? or does it only support English.

the code:

import torch
from transformers import AutoModelForCausalLM

from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl2.utils.io import load_pil_images


# specify the path to the model
model_path = "/deepseek-vl2-model"
vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

## single image conversation example
## Please note that <|ref|> and <|/ref|> are designed specifically for the object localization feature. These special tokens are not required for normal conversations.
## If you would like to experience the grounded captioning functionality (responses that include both object localization and reasoning), you need to add the special token <|grounding|> at the beginning of the prompt. Examples could be found in Figure 9 of our paper.
conversation = [
    {
        "role": "<|User|>",
        "content": "<image>\n把图片里面的所有内容进行ocr识别,markdown格式输出",
        "images": ["./396_2352fdc7zf.png"],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True,
    system_prompt=""
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=False)
print(f"{prepare_inputs['sft_format'][0]}", answer)
@ibbol
Copy link

ibbol commented Feb 12, 2025

我用没问题啊

Image

@ibbol
Copy link

ibbol commented Feb 12, 2025

Image

@katie312
Copy link
Author

Image

请问这个在线的是什么模型呀,我这边本地下载用的small版本。
其实原图不是这张,因为是公司的数据不方便放出来,但是大致布局差不多,就几个图标+段文字,输出效果特别差🥲

@ibbol
Copy link

ibbol commented Feb 12, 2025

Image

请问这个在线的是什么模型呀,我这边本地下载用的small版本。 其实原图不是这张,因为是公司的数据不方便放出来,但是大致布局差不多,就几个图标+段文字,输出效果特别差🥲

tiny版本,我也想部署small版,但是我是4张24G的4090,不知道怎么分配显存,按照别人的方法拆分layer好像还是显存不够。

@katie312
Copy link
Author

Image

请问这个在线的是什么模型呀,我这边本地下载用的small版本。 其实原图不是这张,因为是公司的数据不方便放出来,但是大致布局差不多,就几个图标+段文字,输出效果特别差🥲

tiny版本,我也想部署small版,但是我是4张24G的4090,不知道怎么分配显存,按照别人的方法拆分layer好像还是显存不够。

我用的L40作推理,能跑起来,但是为什么tiny比small效果还好,严重怀疑我代码有问题了🥲,或者麻烦测试一下复杂一点的文档看看效果呢?

@ibbol
Copy link

ibbol commented Feb 12, 2025

Image

Image

@ibbol
Copy link

ibbol commented Feb 12, 2025

Image

@Giserlei123
Copy link

我直接使用可视化界面,识别效果很好,但是我自己用你这段代码效果也是很差

@ibbol
Copy link

ibbol commented Feb 12, 2025

Image
我用楼主的代码好像没问题啊?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants