Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【VLLM加速推理】看到cosyvoice2.0的llm模块为Qwen的,想咨询下会做vllm加速适配吗 #873

Open
wang-TJ-20 opened this issue Jan 11, 2025 · 8 comments

Comments

@wang-TJ-20
Copy link

看到llm的计算是这块,想问下这个会做vllm加速适配吗,这个有参考的吗
image

@aluminumbox
Copy link
Collaborator

vllm暂时不支持embedding输入,会考虑做vllm加速但还在研究中

@darkacorn
Copy link

i dont think that will make much of a difference .. we talking about a 0.5B param model - and not like that needs to be batched ..

it would be a nice to have optimisation but i dont think that is a big impact play to be honest

@aluminumbox
Copy link
Collaborator

i dont think that will make much of a difference .. we talking about a 0.5B param model - and not like that needs to be batched ..

it would be a nice to have optimisation but i dont think that is a big impact play to be honest

i am not expert in vllm, but i think vllm also have some inference optimization like page attention. anyway we will and will not provide similar inference optimization with our aliyun service

@darkacorn
Copy link

vllm has embeddings .. but that is only relevant for the llm aspect .. there is way more - i think the biggest bottle neck here is the flowmatcher

@aluminumbox
Copy link
Collaborator

vllm has embeddings .. but that is only relevant for the llm aspect .. there is way more - i think the biggest bottle neck here is the flowmatcher

we already provided tensorrt for flow matching, this is our aliyun service inference method

@wang-TJ-20
Copy link
Author

i dont think that will make much of a difference .. we talking about a 0.5B param model - and not like that needs to be batched ..
it would be a nice to have optimisation but i dont think that is a big impact play to be honest

i am not expert in vllm, but i think vllm also have some inference optimization like page attention. anyway we will and will not provide similar inference optimization with our aliyun service

所以llm模块还有其他的一些优化思路吗

@WuNein
Copy link

WuNein commented Jan 16, 2025

sglang可以输入input embeddings

sgl-project/sglang#2052

@wang-TJ-20
Copy link
Author

@WuNein hi ,你有试过cosyvoice 在sglang的加速适配吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants