Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3243: Update the design to mutate the label selector based on matchLabelKeys at api-server instead of the scheduler handling it #5033

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mochizuki875
Copy link
Member

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 10, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mochizuki875
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 10, 2025
@mochizuki875
Copy link
Member Author

/cc @sanposhiho

@sanposhiho
Copy link
Member

sanposhiho commented Jan 10, 2025

/cc @alculquicondor
Do we need /lead-opt-in at #3243?

@wojtek-t (or @alculquicondor) do we need PRR review in this case? (I suppose Yes?)

@sanposhiho
Copy link
Member

/retitle KEP-3243: Update the design to mutate the label selector based on matchLabelKeys at api-server instead of the scheduler handling it

@k8s-ci-robot k8s-ci-robot changed the title KEP-3243: Update description related to a labelSelector KEP-3243: Update the design to mutate the label selector based on matchLabelKeys at api-server instead of the scheduler handling it Jan 10, 2025
@alculquicondor
Copy link
Member

cc @dom4ha

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the content based on the conclusion kubernetes/kubernetes#129480 (comment)

existing pods over which spreading will be calculated.

A new field named `MatchLabelKeys` will be introduced to`TopologySpreadConstraint`:
A new optional field named `MatchLabelKeys` will be introduced to`TopologySpreadConstraint`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep this part.

Suggested change
A new optional field named `MatchLabelKeys` will be introduced to`TopologySpreadConstraint`.
A new optional field named `MatchLabelKeys` will be introduced to`TopologySpreadConstraint`.
Currently, when scheduling a pod, the `LabelSelector` defined in the pod is used
to identify the group of pods over which spreading will be calculated.
`MatchLabelKeys` adds another constraint to how this group of pods is identified

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to update other sections such as Test Plan and other PRR questions.

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add the current implementation for Alternative section and describe why we decided to move to a new approach.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 14, 2025
@mochizuki875
Copy link
Member Author

I appreciate your comment.
I've addressed.

@mochizuki875 mochizuki875 force-pushed the matchlabelkeys-to-podtopologyspread branch from 71f1c8c to c6e0a76 Compare January 14, 2025 09:04
Use `pod.generateName` to distinguish new/old pods that belong to the
revisions of the same workload in scheduler plugin. It's decided not to
support because of the following reason: scheduler needs to ensure universal
and scheduler plugin shouldn't have special treatment for any labels/fields.

### remove MatchLabelKeys implementation from the scheduler plugin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### remove MatchLabelKeys implementation from the scheduler plugin
### implement MatchLabelKeys in only either the scheduler plugin or kube-apiserver

Then, briefly mention why we have to implement it in kube-apiserver too.

Comment on lines +376 to +378
kube-scheduler will also be aware of `matchLabelKeys` and gracefully handle the same labels.
This is for the Cluster-level default constraints by
`matchLabelKeys: ["pod-template-hash"]`.([#129198](https://github.com/kubernetes/kubernetes/issues/129198))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kube-scheduler will also be aware of `matchLabelKeys` and gracefully handle the same labels.
This is for the Cluster-level default constraints by
`matchLabelKeys: ["pod-template-hash"]`.([#129198](https://github.com/kubernetes/kubernetes/issues/129198))
Also, kube-scheduler handles `matchLabelKeys` if the cluster-level default constraints is configured with `matchLabelKeys`.

disabled, the field `matchLabelKeys` is preserved if it was already set in the
persisted Pod object, otherwise it is silently dropped; moreover, kube-scheduler
will ignore the field and continue to behave as before.
disabled, the field `matchLabelKeys` and corresponding`labelSelector` are preserved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
disabled, the field `matchLabelKeys` and corresponding`labelSelector` are preserved
disabled, the field `matchLabelKeys` and corresponding `labelSelector` are preserved

Comment on lines +383 to +384
creation will be rejected by kube-apiserver; moreover, kube-scheduler will ignore the
field and continue to behave as before.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kube-scheduler cannot determine which label selector(s) is generated by matchLabelKeys at kube-apiserver, and hence it couldn't ignore matchLabelKeys even after the downgrade. the cluster-level default constraints configuration is the exception though

Suggested change
creation will be rejected by kube-apiserver; moreover, kube-scheduler will ignore the
field and continue to behave as before.
creation will be rejected by kube-apiserver.
Also, kube-scheduler will ignore matchLabelKeys in the cluster-level default constraints configuration.

In the event of a downgrade, kube-scheduler will ignore `MatchLabelKeys` even if it was set.
In the event of a downgrade, kube-apiserver will reject pod creation with `matchLabelKeys` in `TopologySpreadConstraint`.
But, regarding existing pods, we leave `matchLabelKeys` and generated `LabelSelector` even after downgraded.
kube-scheduler will ignore `MatchLabelKeys` even if it was set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Suggested change
kube-scheduler will ignore `MatchLabelKeys` even if it was set.
kube-scheduler will ignore `MatchLabelKeys` if it was set in the cluster-level default constraints configuration.

Comment on lines +654 to +655
disabling the feature gate, however kube-scheduler will not take the MatchLabelKeys
field into account.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
disabling the feature gate, however kube-scheduler will not take the MatchLabelKeys
field into account.
disabling the feature gate.

Comment on lines +937 to +939
kube-scheduler also looks up the label values from the pod and checks if those labels
are included in `LabelSelector`. If not, kube-scheduler will take those labels and AND
with `LabelSelector`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kube-scheduler also looks up the label values from the pod and checks if those labels
are included in `LabelSelector`. If not, kube-scheduler will take those labels and AND
with `LabelSelector`.
kube-scheduler also handles matchLabelKeys if the cluster-level default constraints has it.

Comment on lines +187 to +190
kube-scheduler will also look up the label values from the pod and check if those
labels are included in `LabelSelector`. If not, kube-scheduler will take those labels
and AND with `LabelSelector` to identify the group of existing pods over which the
spreading skew will be calculated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kube-scheduler will also look up the label values from the pod and check if those
labels are included in `LabelSelector`. If not, kube-scheduler will take those labels
and AND with `LabelSelector` to identify the group of existing pods over which the
spreading skew will be calculated.
kube-scheduler will also handle it if the cluster-level default constraints have the one with `MatchLabelKeys`.

which the spreading skew will be calculated.
`TopologySpreadConstraint` which represent a set of label keys only.
kube-apiserver will use those keys to look up label values from the incoming pod
and those labels are merged to `LabelSelector`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and those labels are merged to `LabelSelector`.
and those key-value labels are ANDed with `LabelSelector` to identify the group of existing pods over
which the spreading skew will be calculated.

@sanposhiho
Copy link
Member

/assign @wojtek-t

for a PRR part reviewing. Please assign another person if needed.

@wojtek-t
Copy link
Member

for a PRR part reviewing. Please assign another person if needed.

Queued - although I will wait for SIG approval to happen first.

@mochizuki875
Copy link
Member Author

mochizuki875 commented Jan 17, 2025

@sanposhiho
Thank you for your comments.

I have a question.
I understand that kube-scheduler has two roles:

  1. Read matchLabelKeys of cluster-level default constraints from scheduler configuration, get key-value labels corresponding to matchLabelKeys from pod labels, and logically combine them with labelSelector internally to schedule pod.
  2. Check if the key-value labels corresponding to matchLabelKeys are set in labelSelector by kube-apiserver or users, and if not, logically combine them with labelSelector internally to schedule pod.

Are my understandings correct?
I'm concerned about 2.

For example, if the following Pod manifest is applied

apiVersion: v1
kind: Pod
metadata:
  name: sample
  labels:
    app: sample
...
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector: {}
    matchLabelKeys:
    - app

kube-apiserver will usually merge key-value labels corresponding to matchLabelKeys into labelSelector, so kube-scheduler will not handle matchLabelKeys.

  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchExpressions:
+      - key: app
+        operator: In
+        values:
+        - sample
    matchLabelKeys:
    - app

It is the same if user sets key-value labels corresponding to matchLabelKeys in labelSelector

apiVersion: v1
kind: Pod
metadata:
  name: sample
  labels:
    app: sample
...
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchExpressions:
      - key: app
        operator: In
        values:
        - sample
    matchLabelKeys:
    - app

However, if key-value labels corresponding to matchLabelKeys are not merged into labelSelector by kube-apiserver for some reason, kube-scheduler will get key-value labels from pod labels and logically combine them with labelSelector internally to schedule pod.(I'm wondering if this case should be considered.)

  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector: 
      matchExpressions:
      - key: foo
        operator: In
        values:
        - bar
    matchLabelKeys:
    - app

@sanposhiho
Copy link
Member

sanposhiho commented Jan 17, 2025

As my suggested changes above imply, the scheduler only has one role, which is to handle matchLabelKeys within the cluster wide default constraints. In other cases (like a user created a Pod with matchLabelKeys), we assume kube-apiserver handles it and kube-scheduler doesn't have to worry about matchLabelKeys, right?

... and you know what, I found I forgot about the update path issue that I raised myself while I'm writing this comment. 😅
I'll revisit my suggestions made above.

So, answering your questions,

However, if key-value labels corresponding to matchLabelKeys are not merged into labelSelector by kube-apiserver for some reason, kube-scheduler will get key-value labels from pod labels and logically combine them with labelSelector internally to schedule pod.(I'm wondering if this case should be considered.)

Usually, no. Like my above strikethrough-ed comment argued, we can just assume kube-apiserver handles all matchLabelKeys that users added. And, the scheduler has to only worry about the cluster wide default constraints.
HOWEVER, there's an exception which is during the cluster upgrade. We need to consider scenarios kube-apiserver not handling matchLabelKeys because, at the time running the cluster upgrade, there could be unscheduled pods that are made through the old kube-apiserver, and have to be handled by the new version kube-scheduler. In this case, kube-scheduler needs to handle matchLabelKeys even though those matchLabelKeys came from users, not from the cluster wide default constraints.
That being said, after one release with this new design, so basically at the next release cycle, we can change kube-scheduler to only handle matchLabelKeys from the cluster wide default constraints.

disabled, the field `matchLabelKeys` and corresponding`labelSelector` are preserved
if it was already set in the persisted Pod object, otherwise new Pod with the field
creation will be rejected by kube-apiserver; moreover, kube-scheduler will ignore the
field and continue to behave as before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think here you can add another section [v1.33] design change and a safe upgrade path, and describe:

  • For a safe upgrade path from v1.32 to v1.33, kube-scheduler would handle not only matchLabelKeys from the default constraints, but all in-coming pods during v1.33. And, you also need to mention the reason described like my comment.
  • So, matchLabelKeys within in-coming pods are handled by both kube-apiserver and kube-scheduler at v1.33.
  • We'll change kube-scheduler to only concern matchLabelKeys from the default constraints at v1.34 for efficiency, assuming matchLabelKeys of all in-coming pods are handled by kube-apiserver.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is exactly what concerned me while revising the KEP draft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

5 participants