Add model server configurations to InferencePool #163

liu-cong · 2025-01-07T04:37:24Z

Add ModelServerAttributes field to capture the following information:

Whether LoRA is enabled: We need this info to decide whether to scrape LoRA metrics and apply LoRA affinity algo. Today, if LoRA is not enabled, the LoRA metrics are not available. The ext-proc will print spammy error logs.
Type of the model server. This allows us to pick any bespoke logic for each model server. For example, vLLM has the LoRA adapter implemented as a Gauge metric with LoRA adapters as a label, and value as the timestamp. While we would like all model servers to have the same implementation, we cannot guarantee that.

k8s-ci-robot · 2025-01-07T04:37:30Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liu-cong
Once this PR has been reviewed and has the lgtm label, please assign arangogutierrez for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-01-07T04:39:32Z

@liu-cong: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-gateway-api-inference-extension-verify-main	`b0df805`	link	true	`/test pull-gateway-api-inference-extension-verify-main`
pull-gateway-api-inference-extension-test-unit-main	`b0df805`	link	true	`/test pull-gateway-api-inference-extension-test-unit-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

kfswain · 2025-01-07T16:38:31Z

I think we should apply caution to adjusting the API.

I think both your bullet points are handled by a well defined model server protocol.

kfswain · 2025-01-07T16:39:23Z

Additionally, a user could, in theory, mix and match their model servers and we should not be concerned with that, since they would all implement the same protocol.

liu-cong · 2025-01-07T17:27:45Z

I think both your bullet points are handled by a well defined model server protocol

Valid point. The challenge is that it will take time to get the protocol implemented. In the meantime we don't want to block the development of the extension.

What do you think about this tradeoff solution of adding these as flags to the ext-proc binary? This way we don't need to change the API, while unblocking short-term devlopment?

kfswain · 2025-01-07T17:34:22Z

What do you think about this tradeoff solution of adding these as flags to the ext-proc binary? This way we don't need to change the API, while unblocking short-term devlopment?

Yeah I think that's a great tradeoff. @ahg-g WDYT?

liu-cong · 2025-01-07T18:38:22Z

/close

Given the discussion above, this can be handled via ext proc flag in the short term. And as much as possible will be handled by the model server protocol in the long term #164

ahg-g · 2025-01-07T19:16:33Z

+1; we certainly don't want the API to explicitly list the different model servers, we need to define a protocol, and if the protocol requires setting some parameters, then those could be part of the api; but those should not be specific to the model server.

For example, we could hypothetically allow configuring the metrics names in the api (the metric name that represent the kv-cache utilization). But even those, could be configured in a generic key/value map in the api because different extensions may rely on completely different set of metrics, and the model server protocol could define the keys that define known metrics for example.

Add model server configurations to InferencePool

b0df805

k8s-ci-robot requested review from ahg-g and kfswain January 7, 2025 04:37

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 7, 2025

liu-cong closed this Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add model server configurations to InferencePool #163

Add model server configurations to InferencePool #163

liu-cong commented Jan 7, 2025

k8s-ci-robot commented Jan 7, 2025

k8s-ci-robot commented Jan 7, 2025

kfswain commented Jan 7, 2025

kfswain commented Jan 7, 2025

liu-cong commented Jan 7, 2025

kfswain commented Jan 7, 2025

liu-cong commented Jan 7, 2025

ahg-g commented Jan 7, 2025 •

edited

Loading

Add model server configurations to InferencePool #163

Add model server configurations to InferencePool #163

Conversation

liu-cong commented Jan 7, 2025

k8s-ci-robot commented Jan 7, 2025

k8s-ci-robot commented Jan 7, 2025

kfswain commented Jan 7, 2025

kfswain commented Jan 7, 2025

liu-cong commented Jan 7, 2025

kfswain commented Jan 7, 2025

liu-cong commented Jan 7, 2025

ahg-g commented Jan 7, 2025 • edited Loading

ahg-g commented Jan 7, 2025 •

edited

Loading