Slow performance with the latest outlines + latest (or previous than latest vllm) #1351

denadai2 · 2024-12-30T13:56:17Z

Describe the issue as clearly as possible:

I have some issues with vllm + outlines. It seems the performance is WAAAY worse in the new outlines version. Could you help me? Using it with an H100. I observe that the GPU compute utilization is much lower.

Note that here I use llama 3.2 1B, but the difference increase with larger models: llama 3.3 70B in a private usecase I have has ~ 800 output tokens per second while the new outlines has ~ 70 of them.

New outlines e.g. vllm v0.6.5 (see the processed prompts speed) - outlines 0.1.8

WARNING 12-30 13:46:30 cuda.py:32] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
INFO 12-30 13:46:43 config.py:478] This model supports multiple tasks: {'score', 'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
WARNING 12-30 13:46:43 arg_utils.py:1086] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 12-30 13:46:43 config.py:1364] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 12-30 13:46:43 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='/data/models/llama-v3.2-1b-it/', speculative_config=None, tokenizer='/data/models/llama-v3.2-1b-it/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models/llama-v3.2-1b-it/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
INFO 12-30 13:46:45 selector.py:120] Using Flash Attention backend.
INFO 12-30 13:46:48 model_runner.py:1092] Starting to load model /data/models/llama-v3.2-1b-it/...
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.80it/s]
INFO 12-30 13:46:49 model_runner.py:1097] Loading model weights took 2.3185 GB
INFO 12-30 13:46:49 worker.py:241] Memory profiling takes 0.46 seconds
INFO 12-30 13:46:49 worker.py:241] the current vLLM instance can use total_gpu_memory (79.10GiB) x gpu_memory_utilization (0.90) = 71.19GiB
INFO 12-30 13:46:49 worker.py:241] model weights take 2.32GiB; non_torch_memory takes 0.17GiB; PyTorch activation peak memory takes 1.20GiB; the rest of the memory reserved for KV Cache is 67.50GiB.
INFO 12-30 13:46:50 gpu_executor.py:76] # GPU blocks: 138231, # CPU blocks: 8192
INFO 12-30 13:46:50 gpu_executor.py:80] Maximum concurrency for 131072 tokens per request: 16.87x
INFO 12-30 13:46:51 model_runner.py:1413] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-30 13:46:51 model_runner.py:1417] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 12-30 13:47:01 model_runner.py:1527] Graph capturing finished in 10 secs, took 0.21 GiB
INFO 12-30 13:47:01 llm_engine.py:446] init engine (profile, create kv cache, warmup model) took 12.39 seconds
Processed prompts: 100%|██████████| 100[/100](http://10.169.23.198:8081/100) [01:34<00:00,  1.06it[/s](http://10.169.23.198:8081/s), est. speed input: 81.68 toks[/s](http://10.169.23.198:8081/s), output: 104.59 toks[/s](http://10.169.23.198:8081/s)]

Old outlines vllm v0.6.1 - outlines 0.0.46

WARNING 12-30 13:48:29 arg_utils.py:930] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:29 config.py:1010] Chunked prefill is enabled with max_num_batched_tokens=512.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:29 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/models/llama-v3.2-1b-it/', speculative_config=None, tokenizer='/data/models/llama-v3.2-1b-it/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models/llama-v3.2-1b-it/, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:33 model_runner.py:1014] Starting to load model /data/models/llama-v3.2-1b-it/...
Loading safetensors checkpoint shards:   0% Completed | 0[/1](http://10.169.59.132:8081/1) [00:00<?, ?it[/s](http://10.169.59.132:8081/s)]
Loading safetensors checkpoint shards: 100% Completed | 1[/1](http://10.169.59.132:8081/1) [00:00<00:00,  1.98it[/s](http://10.169.59.132:8081/s)]
Loading safetensors checkpoint shards: 100% Completed | 1[/1](http://10.169.59.132:8081/1) [00:00<00:00,  1.98it[/s](http://10.169.59.132:8081/s)]
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:34 model_runner.py:1025] Loading model weights took 2.3185 GB
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:34 gpu_executor.py:122] # GPU blocks: 138166, # CPU blocks: 8192
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:36 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:36 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:46 model_runner.py:1456] Graph capturing finished in 10 secs.
Compiling FSM index for all state transitions:   0%|          | 0[/47](http://10.169.59.132:8081/47) [00:00<?, ?it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:   2%|▏         | 1[/47](http://10.169.59.132:8081/47) [00:00<00:12,  3.63it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  11%|█         | 5[/47](http://10.169.59.132:8081/47) [00:00<00:02, 15.28it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  19%|█▉        | 9[/47](http://10.169.59.132:8081/47) [00:00<00:01, 22.37it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  28%|██▊       | 13[/47](http://10.169.59.132:8081/47) [00:00<00:01, 20.41it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  36%|███▌      | 17[/47](http://10.169.59.132:8081/47) [00:00<00:01, 23.81it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  45%|████▍     | 21[/47](http://10.169.59.132:8081/47) [00:00<00:00, 26.30it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  53%|█████▎    | 25[/47](http://10.169.59.132:8081/47) [00:01<00:00, 28.93it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  62%|██████▏   | 29[/47](http://10.169.59.132:8081/47) [00:01<00:00, 31.01it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  70%|███████   | 33[/47](http://10.169.59.132:8081/47) [00:01<00:00, 32.52it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  79%|███████▊  | 37[/47](http://10.169.59.132:8081/47) [00:01<00:00, 32.99it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  87%|████████▋ | 41[/47](http://10.169.59.132:8081/47) [00:01<00:00, 26.43it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  96%|█████████▌| 45[/47](http://10.169.59.132:8081/47) [00:01<00:00, 28.13it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 100%|██████████| 47[/47](http://10.169.59.132:8081/47) [00:02<00:00, 21.38it[/s](http://10.169.59.132:8081/s)]
Processed prompts:   0%|          | 0[/100](http://10.169.59.132:8081/100) [00:00<?, ?it[/s](http://10.169.59.132:8081/s), est. speed input: 0.00 toks[/s](http://10.169.59.132:8081/s), output: 0.00 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts:   1%|          | 1[/100](http://10.169.59.132:8081/100) [00:00<00:34,  2.83it[/s](http://10.169.59.132:8081/s), est. speed input: 203.86 toks[/s](http://10.169.59.132:8081/s), output: 50.96 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts:  12%|█▏        | 12[/100](http://10.169.59.132:8081/100) [00:00<00:02, 32.77it[/s](http://10.169.59.132:8081/s), est. speed input: 1866.35 toks[/s](http://10.169.59.132:8081/s), output: 520.57 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts:  61%|██████    | 61[/100](http://10.169.59.132:8081/100) [00:00<00:00, 134.34it[/s](http://10.169.59.132:8081/s), est. speed input: 6475.71 toks[/s](http://10.169.59.132:8081/s), output: 2667.18 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts:  93%|█████████▎| 93[/100](http://10.169.59.132:8081/100) [00:00<00:00, 179.39it[/s](http://10.169.59.132:8081/s), est. speed input: 8432.94 toks[/s](http://10.169.59.132:8081/s), output: 4387.65 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts: 100%|██████████| 100[/100](http://10.169.59.132:8081/100) [00:01<00:00, 94.83it[/s](http://10.169.59.132:8081/s), est. speed input: 6828.35 toks[/s](http://10.169.59.132:8081/s), output: 3900.61 toks[/s](http://10.169.59.132:8081/s)]

Steps/code to reproduce the bug:

New outlines

"""Example of integrating `outlines` with `vllm`."""

import vllm
from pydantic import BaseModel
from transformers import AutoTokenizer
from outlines.models.vllm import adapt_tokenizer

from outlines.processors import JSONLogitsProcessor


class Person(BaseModel):
    name: str
    description: str

MODEL_ID = "/data/models/llama-v3.2-1b-it/"
llm = vllm.LLM(model=MODEL_ID)
tokenizer = adapt_tokenizer(AutoTokenizer.from_pretrained(MODEL_ID))
logits_processor = JSONLogitsProcessor(schema=Person, tokenizer=tokenizer)
result = llm.generate(
    ["""<s>[INST] <<SYS>>
    You are a json text extractor. return the following json {"name": "the game name", "description": "description of the game in around 400 words"}
    <</SYS>>
    { CD Projekt Red is ramping up production on The Witcher 4, and of course it's looking into using AI } [/INST]"""]*100,
    sampling_params=vllm.SamplingParams(
        temperature=0.6,
        max_tokens=1024,
        logits_processors=[logits_processor],
    ),
)
print(result)

Old outlines

import ray

import vllm
from pydantic import BaseModel
    
from outlines.integrations.vllm import JSONLogitsProcessor
    
class Person(BaseModel):
    name: str
    description: str
    
MODEL_ID = "/data/models/llama-v3.2-1b-it/"
llm = vllm.LLM(model=MODEL_ID)
logits_processor = JSONLogitsProcessor(schema=Person, llm=llm)
result = llm.generate(
        ["""<s>[INST] <<SYS>>
    You are a json text extractor. return the following json {"name": "the game name", "description": "description of the game"}
    <</SYS>>
    { CD Projekt Red is ramping up production on The Witcher 4, and of course it's looking into using AI } [/INST]"""]*100,
        sampling_params=vllm.SamplingParams(
            temperature=0.6,
            max_tokens=1024,
    )
print(result)



### Expected result:

```shell
A similar speed, which happens if I remove the outlines call `logits_processors=[logits_processor],`

Error message:

No response

Outlines/Python version information:

You can see that from the logs

Context for the issue:

No response

The text was updated successfully, but these errors were encountered:

saattrupdan · 2025-01-06T15:41:59Z

I second this. Further, aside from taking longer, it also accumulates way more RAM during (what seems to be) the compilation stage. Uses >50GB RAM and crashes when using vLLM with ~2k samples. The previous version is way faster and only uses ~4GB RAM

rlouf · 2025-01-15T12:39:04Z

Is this still the case on the latest release?

kayvane1 · 2025-01-15T13:52:16Z

Yes it is, tried to upgrade to the latest version last week and had to revert the change as a result

denadai2 · 2025-01-28T15:48:43Z

any update on this?

yvan-sraka · 2025-01-28T18:42:09Z

I haven’t had the chance to debug this yet, but I wonder if it might be a good idea to track performance regressions in CI, WDYT?

denadai2 · 2025-01-29T22:19:30Z

I am switching temporanly to xgrammar :/ it is not solved

BTW maybe related vllm-project/vllm#12122

saattrupdan · 2025-02-02T21:35:51Z

Just for the record, this is still an issue for me, meaning that the newest version of Outlines I can use is v0.0.46, as all newer versions cause me to use >50GB memory, crashing the script.

rlouf · 2025-02-03T07:16:54Z

Just for the record, this is still an issue for me, meaning that the newest version of Outlines I can use is v0.0.46, as all newer versions cause me to use >50GB memory, crashing the script.

Could you share an example so I can try to reproduce the issue locally?

rlouf · 2025-02-03T14:52:01Z

#1386 integrates the last outlines-core update, and we haven't seen any performance of memory issue with it. I can take a look at the example you're struggling with @saattrupdan.

@denadai2 The Outlines integration in vLLM is completely non-optimal, we are working on updating it using outlines-core directly.

denadai2 · 2025-02-05T09:23:29Z

#1386 integrates the last outlines-core update, and we haven't seen any performance of memory issue with it. I can take a look at the example you're struggling with @saattrupdan.

@denadai2 The Outlines integration in vLLM is completely non-optimal, we are working on updating it using outlines-core directly.

thanks so much. I love outlines and at Spotify we were enjoying it!

saattrupdan · 2025-02-13T10:44:57Z

#1386 integrates the last outlines-core update, and we haven't seen any performance of memory issue with it. I can take a look at the example you're struggling with @saattrupdan.

@denadai2 The Outlines integration in vLLM is completely non-optimal, we are working on updating it using outlines-core directly.

I'm using Outlines with vLLM as well, so I suppose my issue is the same as @denadai2's.

@rlouf Is there a way to use Outlines with vLLM which makes the integration "optimal"? Using logits processors instead of the built-in vLLM structured generation, perhaps?

denadai2 added the bug label Dec 30, 2024

This was referenced Jan 6, 2025

Fix/ner oom ScandEval/ScandEval#689

Merged

[Bug]: Very slow guided decoding with Outlines backend since v0.6.5 vllm-project/vllm#12005

Open

yvan-sraka mentioned this issue Jan 28, 2025

Update the integration of outlines-core in vLLM dottxt-ai/outlines-core#145

Open

denadai2 mentioned this issue Jan 29, 2025

[Bug]: XGrammar-based CFG decoding degraded after 0.6.5 vllm-project/vllm#12122

Open

1 task

saattrupdan mentioned this issue Feb 13, 2025

[MODEL EVALUATION REQUEST] allenai/OLMo-2-1124-13B ScandEval/ScandEval#658

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance with the latest outlines + latest (or previous than latest vllm) #1351

Slow performance with the latest outlines + latest (or previous than latest vllm) #1351

denadai2 commented Dec 30, 2024 •

edited

Loading

saattrupdan commented Jan 6, 2025

rlouf commented Jan 15, 2025

kayvane1 commented Jan 15, 2025

denadai2 commented Jan 28, 2025

yvan-sraka commented Jan 28, 2025

denadai2 commented Jan 29, 2025

saattrupdan commented Feb 2, 2025

rlouf commented Feb 3, 2025

rlouf commented Feb 3, 2025

denadai2 commented Feb 5, 2025

saattrupdan commented Feb 13, 2025 •

edited

Loading

Slow performance with the latest outlines + latest (or previous than latest vllm) #1351

Slow performance with the latest outlines + latest (or previous than latest vllm) #1351

Comments

denadai2 commented Dec 30, 2024 • edited Loading

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Error message:

Outlines/Python version information:

Context for the issue:

saattrupdan commented Jan 6, 2025

rlouf commented Jan 15, 2025

kayvane1 commented Jan 15, 2025

denadai2 commented Jan 28, 2025

yvan-sraka commented Jan 28, 2025

denadai2 commented Jan 29, 2025

saattrupdan commented Feb 2, 2025

rlouf commented Feb 3, 2025

rlouf commented Feb 3, 2025

denadai2 commented Feb 5, 2025

saattrupdan commented Feb 13, 2025 • edited Loading

denadai2 commented Dec 30, 2024 •

edited

Loading

saattrupdan commented Feb 13, 2025 •

edited

Loading