From 1bb383ce698c8e17875b6cb9c183a7f52a19d4eb Mon Sep 17 00:00:00 2001 From: Cong Liu Date: Mon, 6 Jan 2025 15:10:36 -0800 Subject: [PATCH] Add model server protocol proposal --- .../003-model-server-protocol/protocol.md | 72 +++++++++++++++++++ 1 file changed, 72 insertions(+) create mode 100644 docs/proposals/003-model-server-protocol/protocol.md diff --git a/docs/proposals/003-model-server-protocol/protocol.md b/docs/proposals/003-model-server-protocol/protocol.md new file mode 100644 index 00000000..34d1142e --- /dev/null +++ b/docs/proposals/003-model-server-protocol/protocol.md @@ -0,0 +1,72 @@ +# Model Server Protocol for Gateway API Inference Extension + +## Inference API Protocol + +The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions) +and [Chat](https://platform.openai.com/docs/api-reference/chat) API. In the future we are open to +supporting more API protocols. + +To explain this in more detail, the extension makes intelligent request scheduling decisions based +on certain information from the request body, such as the `model` field. + + +## Metrics Reporting + +The inference extension scrapes metrics from the model servers to make optimal request scheduling +decisions. The PREFERRED metrics format is Prometheus. We do not intend to dictate the exact metric +naming and format, especially if the corresponding metric already exists. We will leverage the +[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk) +effort to bring as much unification as possible across model server communities. + +We also show the metrics in vLLM, which is already integrated into the inference extension. We are +working on integrating with more model servers. + + + +| Metric | Type | Description | vLLM metric | +| ----- | ---- | ---- | ---- | +| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| +| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| +| MaxActiveModels| Gauge | Maximum number of models/adapters that can be loaded to GPU memory to serve a batch. Requests will be queued if the model server has reached MaxActiveModels and cannot load the requested model/adapter.| `vllm:lora_requests_info.max_lora`| +| ActiveModels| String (can be a label of a Prometheus Gauge metric) | Comma separated list of models/adapters that are currently loaded into GPU memory and therefore new requests of the same models/adapters don't require eviction of models/adapters. | `vllm:lora_requests_info.running_lora_adapters`| + +The following metrics MAY be needed in the future for further optimization. + +| Metric |Type | Description | vLLM metric | +| ----- | ---- | ---- | ---- | +| TotalTokensInCurrentBatch | Gauge | Number of tokens in the current batch.| `vllm:num_tokens_running`| +| TotalQueuedTokens| Gauge | The current total number of tokens in the queued requests.| `vllm:num_tokens_waiting` (need to be added)| +| MaxTokenCapacity| Gauge | The total size of the KV cache in number of tokens.| `vllm:max_token_capacity`
NOTE: This info is available indirectly in [`cache_config_info`](https://github.com/vllm-project/vllm/blob/15702038642192002cd8973cf8948751b750fd07/vllm/engine/metrics.py#L551) metric already , and also proposed in, can be added [here](https://github.com/vllm-project/vllm/blob/22f5851b807376a836eb3551903c7fc6c81eaa9b/vllm/engine/llm_engine.py#L1588). | +| AvailableModels| String | All the available models/adapters that the model server is able to serve, otherwise an error may be returned.| This is already available from the /models API.| +| TimePerPrefillToken | Histogram | The prefill latency per token in the last W seconds. W will be decided by simulation/benchmarking. In time series metric the latency is typically reported as Histogram and we can derive the average from the Histogram. | `vllm:time_to_first_token_seconds` | +| TimePerDecodeToken | Histogram | The decode latency per token in the last W seconds. W will be decided by simulation/benchmarking. | `vllm:time_per_output_token_seconds` | + +## LoRA Adapter Serving + + +### Dynamic LoRA Serving + +Model servers that support dynamic LoRA serving can gain additional benefit from the inference +extension's LoRA affinity algorithm. Generally we expect model servers to: + +* Support running multiple LoRA adapters in parallel in the same decode batch. +* Dynamically load/unload adapters in GPU memory from/to host memory depending on the requested + adapters in the current batch. + + +#### Register/Unregister Adapters + +Model servers SHOULD have APIs to dynamically register/unregister models (usually LoRA adapters). This enables platform teams to multiplex multiple LoRA adapters on shared model servers and dynamically rollout LoRA adapters. + +NOTE this is not a strict requirement from the inference extension, but a critical feature for CI/CD integration. + +While we don’t intend to dictate how model servers should implement this API, a reference REST API can look this: + +``` +POST ${server_endpoint}/adapters/{adapter-id} +{ +        "path": "path/to/my/adapter" +} + +DELETE ${server_endpoint}/adapters/{adapter-id} +```