Skip to content

Commit

Permalink
Add model server protocol proposal
Browse files Browse the repository at this point in the history
  • Loading branch information
liu-cong committed Jan 7, 2025
1 parent adad31c commit 1bb383c
Showing 1 changed file with 72 additions and 0 deletions.
72 changes: 72 additions & 0 deletions docs/proposals/003-model-server-protocol/protocol.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Model Server Protocol for Gateway API Inference Extension

## Inference API Protocol

The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions)
and [Chat](https://platform.openai.com/docs/api-reference/chat) API. In the future we are open to
supporting more API protocols.

To explain this in more detail, the extension makes intelligent request scheduling decisions based
on certain information from the request body, such as the `model` field.


## Metrics Reporting

The inference extension scrapes metrics from the model servers to make optimal request scheduling
decisions. The PREFERRED metrics format is Prometheus. We do not intend to dictate the exact metric
naming and format, especially if the corresponding metric already exists. We will leverage the
[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
effort to bring as much unification as possible across model server communities.

We also show the metrics in vLLM, which is already integrated into the inference extension. We are
working on integrating with more model servers.



| Metric | Type | Description | vLLM metric |
| ----- | ---- | ---- | ---- |
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`|
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`|
| MaxActiveModels| Gauge | Maximum number of models/adapters that can be loaded to GPU memory to serve a batch. Requests will be queued if the model server has reached MaxActiveModels and cannot load the requested model/adapter.| `vllm:lora_requests_info.max_lora`|
| ActiveModels| String (can be a label of a Prometheus Gauge metric) | Comma separated list of models/adapters that are currently loaded into GPU memory and therefore new requests of the same models/adapters don't require eviction of models/adapters. | `vllm:lora_requests_info.running_lora_adapters`|

The following metrics MAY be needed in the future for further optimization.

| Metric |Type | Description | vLLM metric |
| ----- | ---- | ---- | ---- |
| TotalTokensInCurrentBatch | Gauge | Number of tokens in the current batch.| `vllm:num_tokens_running`|
| TotalQueuedTokens| Gauge | The current total number of tokens in the queued requests.| `vllm:num_tokens_waiting` (need to be added)|
| MaxTokenCapacity| Gauge | The total size of the KV cache in number of tokens.| `vllm:max_token_capacity` <br> NOTE: This info is available indirectly in [`cache_config_info`](https://github.com/vllm-project/vllm/blob/15702038642192002cd8973cf8948751b750fd07/vllm/engine/metrics.py#L551) metric already , and also proposed in, can be added [here](https://github.com/vllm-project/vllm/blob/22f5851b807376a836eb3551903c7fc6c81eaa9b/vllm/engine/llm_engine.py#L1588). |
| AvailableModels| String | All the available models/adapters that the model server is able to serve, otherwise an error may be returned.| This is already available from the /models API.|
| TimePerPrefillToken | Histogram | The prefill latency per token in the last W seconds. W will be decided by simulation/benchmarking. In time series metric the latency is typically reported as Histogram and we can derive the average from the Histogram. | `vllm:time_to_first_token_seconds` |
| TimePerDecodeToken | Histogram | The decode latency per token in the last W seconds. W will be decided by simulation/benchmarking. | `vllm:time_per_output_token_seconds` |

## LoRA Adapter Serving


### Dynamic LoRA Serving

Model servers that support dynamic LoRA serving can gain additional benefit from the inference
extension's LoRA affinity algorithm. Generally we expect model servers to:

* Support running multiple LoRA adapters in parallel in the same decode batch.
* Dynamically load/unload adapters in GPU memory from/to host memory depending on the requested
adapters in the current batch.


#### Register/Unregister Adapters

Model servers SHOULD have APIs to dynamically register/unregister models (usually LoRA adapters). This enables platform teams to multiplex multiple LoRA adapters on shared model servers and dynamically rollout LoRA adapters.

NOTE this is not a strict requirement from the inference extension, but a critical feature for CI/CD integration.

While we don’t intend to dictate how model servers should implement this API, a reference REST API can look this:

```
POST ${server_endpoint}/adapters/{adapter-id}
{
        "path": "path/to/my/adapter"
}
DELETE ${server_endpoint}/adapters/{adapter-id}
```

0 comments on commit 1bb383c

Please sign in to comment.