Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add model server configurations to InferencePool #163

Closed
wants to merge 1 commit into from

Conversation

liu-cong
Copy link
Contributor

@liu-cong liu-cong commented Jan 7, 2025

Add ModelServerAttributes field to capture the following information:

  • Whether LoRA is enabled: We need this info to decide whether to scrape LoRA metrics and apply LoRA affinity algo. Today, if LoRA is not enabled, the LoRA metrics are not available. The ext-proc will print spammy error logs.
  • Type of the model server. This allows us to pick any bespoke logic for each model server. For example, vLLM has the LoRA adapter implemented as a Gauge metric with LoRA adapters as a label, and value as the timestamp. While we would like all model servers to have the same implementation, we cannot guarantee that.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liu-cong
Once this PR has been reviewed and has the lgtm label, please assign arangogutierrez for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from ahg-g and kfswain January 7, 2025 04:37
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 7, 2025
@k8s-ci-robot
Copy link
Contributor

@liu-cong: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gateway-api-inference-extension-verify-main b0df805 link true /test pull-gateway-api-inference-extension-verify-main
pull-gateway-api-inference-extension-test-unit-main b0df805 link true /test pull-gateway-api-inference-extension-test-unit-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kfswain
Copy link
Collaborator

kfswain commented Jan 7, 2025

I think we should apply caution to adjusting the API.

I think both your bullet points are handled by a well defined model server protocol.

@kfswain
Copy link
Collaborator

kfswain commented Jan 7, 2025

Additionally, a user could, in theory, mix and match their model servers and we should not be concerned with that, since they would all implement the same protocol.

@liu-cong
Copy link
Contributor Author

liu-cong commented Jan 7, 2025

I think both your bullet points are handled by a well defined model server protocol

Valid point. The challenge is that it will take time to get the protocol implemented. In the meantime we don't want to block the development of the extension.

What do you think about this tradeoff solution of adding these as flags to the ext-proc binary? This way we don't need to change the API, while unblocking short-term devlopment?

@kfswain
Copy link
Collaborator

kfswain commented Jan 7, 2025

What do you think about this tradeoff solution of adding these as flags to the ext-proc binary? This way we don't need to change the API, while unblocking short-term devlopment?

Yeah I think that's a great tradeoff. @ahg-g WDYT?

@liu-cong
Copy link
Contributor Author

liu-cong commented Jan 7, 2025

/close

Given the discussion above, this can be handled via ext proc flag in the short term. And as much as possible will be handled by the model server protocol in the long term #164

@liu-cong liu-cong closed this Jan 7, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Jan 7, 2025

+1; we certainly don't want the API to explicitly list the different model servers, we need to define a protocol, and if the protocol requires setting some parameters, then those could be part of the api; but those should not be specific to the model server.

For example, we could hypothetically allow configuring the metrics names in the api (the metric name that represent the kv-cache utilization). But even those, could be configured in a generic key/value map in the api because different extensions may rely on completely different set of metrics, and the model server protocol could define the keys that define known metrics for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants