Add model server protocol proposal

kubernetes-sigs · Jan 7, 2025 · 1bb383c · 1bb383c
1 parent adad31c
commit 1bb383c
Showing 1 changed file with 72 additions and 0 deletions.
diff --git a/docs/proposals/003-model-server-protocol/protocol.md b/docs/proposals/003-model-server-protocol/protocol.md
@@ -0,0 +1,72 @@
+# Model Server Protocol for Gateway API Inference Extension
+
+## Inference API Protocol
+
+The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions)
+and [Chat](https://platform.openai.com/docs/api-reference/chat) API. In the future we are open to
+supporting more API protocols.
+
+To explain this in more detail, the extension makes intelligent request scheduling decisions based
+on certain information from the request body, such as the `model` field.
+
+
+## Metrics Reporting
+
+The inference extension scrapes metrics from the model servers to make optimal request scheduling
+decisions. The PREFERRED metrics format is Prometheus. We do not intend to dictate the exact metric
+naming and format, especially if the corresponding metric already exists. We will leverage the
+[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
+effort to bring as much unification as possible across model server communities.
+
+We also show the metrics in vLLM, which is already integrated into the inference extension. We are
+working on integrating with more model servers.
+
+
+
+| Metric | Type | Description | vLLM metric |
+| ----- | ---- | ---- | ---- |
+| TotalQueuedRequests         | Gauge     | The current total number of requests in the queue.| `vllm:num_requests_waiting`|
+| KVCacheUtilization| Gauge     | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`|
+| MaxActiveModels| Gauge     | Maximum number of models/adapters that can be loaded to GPU memory to serve a batch. Requests will be queued if the model server has reached MaxActiveModels and cannot load the requested model/adapter.| `vllm:lora_requests_info.max_lora`|
+| ActiveModels| String (can be a label of a Prometheus Gauge metric)     | Comma separated list of models/adapters that are currently loaded into GPU memory and therefore new requests of the same models/adapters don't require eviction of models/adapters. | `vllm:lora_requests_info.running_lora_adapters`|
+
+The following metrics MAY be needed in the future for further optimization.
+
+| Metric |Type | Description | vLLM   metric |
+| ----- | ---- | ---- | ---- |
+| TotalTokensInCurrentBatch   | Gauge     | Number of tokens in the current batch.| `vllm:num_tokens_running`|
+| TotalQueuedTokens| Gauge     | The current total number of tokens in the queued requests.| `vllm:num_tokens_waiting` (need to be added)|
+| MaxTokenCapacity| Gauge     | The total size of the KV cache in number of tokens.| `vllm:max_token_capacity` <br> NOTE: This info is available indirectly in [`cache_config_info`](https://github.com/vllm-project/vllm/blob/15702038642192002cd8973cf8948751b750fd07/vllm/engine/metrics.py#L551) metric already , and also  proposed in, can be added [here](https://github.com/vllm-project/vllm/blob/22f5851b807376a836eb3551903c7fc6c81eaa9b/vllm/engine/llm_engine.py#L1588). |
+| AvailableModels| String     | All the available models/adapters that the model server is able to serve, otherwise an error may be returned.| This is already available from the /models API.|
+| TimePerPrefillToken | Histogram | The prefill latency per token in the last W seconds. W will be decided by simulation/benchmarking.  In time series metric the latency is typically reported as Histogram and we can derive the average from the Histogram. | `vllm:time_to_first_token_seconds` | 
+| TimePerDecodeToken | Histogram | The decode latency per token in the last W seconds. W will be decided by simulation/benchmarking. | `vllm:time_per_output_token_seconds` | 
+
+## LoRA Adapter Serving
+
+
+### Dynamic LoRA Serving
+
+Model servers that support dynamic LoRA serving can gain additional benefit from the inference
+extension's LoRA affinity algorithm. Generally we expect model servers to:
+
+* Support running multiple LoRA adapters in parallel in the same decode batch.
+* Dynamically load/unload adapters in GPU memory from/to host memory depending on the requested
+  adapters in the current batch.
+
+
+#### Register/Unregister Adapters
+
+Model servers SHOULD have APIs to dynamically register/unregister models (usually LoRA adapters). This enables platform teams to multiplex multiple LoRA adapters on shared model servers and dynamically rollout LoRA adapters. 
+
+NOTE this is not a strict requirement from the inference extension, but a critical feature for CI/CD integration.
+
+While we don’t intend to dictate how model servers should implement this API, a reference REST API can look this:
+
+```
+POST ${server_endpoint}/adapters/{adapter-id}
+{
+        "path": "path/to/my/adapter"
+}
+
+DELETE ${server_endpoint}/adapters/{adapter-id}
+```