-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow performance with the latest outlines + latest (or previous than latest vllm) #1351
Comments
I second this. Further, aside from taking longer, it also accumulates way more RAM during (what seems to be) the compilation stage. Uses >50GB RAM and crashes when using vLLM with ~2k samples. The previous version is way faster and only uses ~4GB RAM |
Is this still the case on the latest release? |
Yes it is, tried to upgrade to the latest version last week and had to revert the change as a result |
any update on this? |
I haven’t had the chance to debug this yet, but I wonder if it might be a good idea to track performance regressions in CI, WDYT? |
I am switching temporanly to xgrammar :/ it is not solved BTW maybe related vllm-project/vllm#12122 |
Just for the record, this is still an issue for me, meaning that the newest version of Outlines I can use is v0.0.46, as all newer versions cause me to use >50GB memory, crashing the script. |
Could you share an example so I can try to reproduce the issue locally? |
#1386 integrates the last @denadai2 The Outlines integration in vLLM is completely non-optimal, we are working on updating it using |
thanks so much. I love outlines and at Spotify we were enjoying it! |
I'm using Outlines with vLLM as well, so I suppose my issue is the same as @denadai2's. @rlouf Is there a way to use Outlines with vLLM which makes the integration "optimal"? Using logits processors instead of the built-in vLLM structured generation, perhaps? |
Describe the issue as clearly as possible:
I have some issues with vllm + outlines. It seems the performance is WAAAY worse in the new outlines version. Could you help me? Using it with an H100. I observe that the GPU compute utilization is much lower.
Note that here I use llama 3.2 1B, but the difference increase with larger models: llama 3.3 70B in a private usecase I have has ~ 800 output tokens per second while the new outlines has ~ 70 of them.
New outlines e.g. vllm v0.6.5 (see the processed prompts speed) - outlines 0.1.8
Old outlines vllm v0.6.1 - outlines 0.0.46
Steps/code to reproduce the bug:
Old outlines
Error message:
No response
Outlines/Python version information:
You can see that from the logs
Context for the issue:
No response
The text was updated successfully, but these errors were encountered: