-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3243: Update the design to mutate the label selector based on matchLabelKeys at api-server instead of the scheduler handling it #5033
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -109,6 +109,8 @@ tags, and then generate with `hack/update-toc.sh`. | |||||||||||
- [Implementation History](#implementation-history) | ||||||||||||
- [Drawbacks](#drawbacks) | ||||||||||||
- [Alternatives](#alternatives) | ||||||||||||
- [use pod generateName](#use-pod-generatename) | ||||||||||||
- [remove MatchLabelKeys implementation from the scheduler plugin](#remove-matchlabelkeys-implementation-from-the-scheduler-plugin) | ||||||||||||
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) | ||||||||||||
<!-- /toc --> | ||||||||||||
|
||||||||||||
|
@@ -179,10 +181,13 @@ which spreading is applied using a LabelSelector. This means the user should | |||||||||||
know the exact label key and value when defining the pod spec. | ||||||||||||
|
||||||||||||
This KEP proposes a complementary field to LabelSelector named `MatchLabelKeys` in | ||||||||||||
`TopologySpreadConstraint` which represent a set of label keys only. The scheduler | ||||||||||||
will use those keys to look up label values from the incoming pod; and those key-value | ||||||||||||
labels are ANDed with `LabelSelector` to identify the group of existing pods over | ||||||||||||
which the spreading skew will be calculated. | ||||||||||||
`TopologySpreadConstraint` which represent a set of label keys only. | ||||||||||||
kube-apiserver will use those keys to look up label values from the incoming pod | ||||||||||||
and those labels are merged to `LabelSelector`. | ||||||||||||
kube-scheduler will also look up the label values from the pod and check if those | ||||||||||||
labels are included in `LabelSelector`. If not, kube-scheduler will take those labels | ||||||||||||
and AND with `LabelSelector` to identify the group of existing pods over which the | ||||||||||||
spreading skew will be calculated. | ||||||||||||
Comment on lines
+187
to
+190
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||
|
||||||||||||
The main case that this new way for identifying pods will enable is constraining | ||||||||||||
skew spreading calculation to happen at the revision level in Deployments during | ||||||||||||
|
@@ -307,15 +312,10 @@ required) or even code snippets. If there's any ambiguity about HOW your | |||||||||||
proposal will be implemented, this is the place to discuss them. | ||||||||||||
--> | ||||||||||||
|
||||||||||||
A new field named `MatchLabelKeys` will be added to `TopologySpreadConstraint`. | ||||||||||||
A new optional field named `MatchLabelKeys` will be introduced to`TopologySpreadConstraint`. | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should keep this part.
Suggested change
|
||||||||||||
Currently, when scheduling a pod, the `LabelSelector` defined in the pod is used | ||||||||||||
to identify the group of pods over which spreading will be calculated. | ||||||||||||
`MatchLabelKeys` adds another constraint to how this group of pods is identified: | ||||||||||||
the scheduler will use those keys to look up label values from the incoming pod; | ||||||||||||
and those key-value labels are ANDed with `LabelSelector` to select the group of | ||||||||||||
existing pods over which spreading will be calculated. | ||||||||||||
|
||||||||||||
A new field named `MatchLabelKeys` will be introduced to`TopologySpreadConstraint`: | ||||||||||||
`MatchLabelKeys` adds another constraint to how this group of pods is identified. | ||||||||||||
```go | ||||||||||||
type TopologySpreadConstraint struct { | ||||||||||||
MaxSkew int32 | ||||||||||||
|
@@ -333,27 +333,55 @@ type TopologySpreadConstraint struct { | |||||||||||
} | ||||||||||||
``` | ||||||||||||
|
||||||||||||
Examples of use are as follows: | ||||||||||||
When a Pod is created, kube-apiserver will obtain the labels from the pod | ||||||||||||
by the keys in `matchLabelKeys` and `key in (value)` is merged to `LabelSelector` | ||||||||||||
of `TopologySpreadConstraint`. | ||||||||||||
|
||||||||||||
For example, when this sample Pod is created, | ||||||||||||
|
||||||||||||
```yaml | ||||||||||||
topologySpreadConstraints: | ||||||||||||
- maxSkew: 1 | ||||||||||||
topologyKey: kubernetes.io/hostname | ||||||||||||
whenUnsatisfiable: DoNotSchedule | ||||||||||||
matchLabelKeys: | ||||||||||||
- app | ||||||||||||
- pod-template-hash | ||||||||||||
apiVersion: v1 | ||||||||||||
kind: Pod | ||||||||||||
metadata: | ||||||||||||
name: sample | ||||||||||||
labels: | ||||||||||||
app: sample | ||||||||||||
... | ||||||||||||
topologySpreadConstraints: | ||||||||||||
- maxSkew: 1 | ||||||||||||
topologyKey: kubernetes.io/hostname | ||||||||||||
whenUnsatisfiable: DoNotSchedule | ||||||||||||
labelSelector: {} | ||||||||||||
matchLabelKeys: # ADDED | ||||||||||||
- app | ||||||||||||
``` | ||||||||||||
|
||||||||||||
kube-apiserver modifies the `labelSelector` like the following: | ||||||||||||
|
||||||||||||
```diff | ||||||||||||
topologySpreadConstraints: | ||||||||||||
- maxSkew: 1 | ||||||||||||
topologyKey: kubernetes.io/hostname | ||||||||||||
whenUnsatisfiable: DoNotSchedule | ||||||||||||
labelSelector: | ||||||||||||
+ matchExpressions: | ||||||||||||
+ - key: app | ||||||||||||
+ operator: In | ||||||||||||
+ values: | ||||||||||||
+ - sample | ||||||||||||
matchLabelKeys: | ||||||||||||
- app | ||||||||||||
``` | ||||||||||||
|
||||||||||||
The scheduler plugin `PodTopologySpread` will obtain the labels from the pod | ||||||||||||
labels by the keys in `matchLabelKeys`. The obtained labels will be merged | ||||||||||||
to `labelSelector` of `topologySpreadConstraints` to filter and group pods. | ||||||||||||
The pods belonging to the same group will be part of the spreading in | ||||||||||||
`PodTopologySpread`. | ||||||||||||
kube-scheduler will also be aware of `matchLabelKeys` and gracefully handle the same labels. | ||||||||||||
This is for the Cluster-level default constraints by | ||||||||||||
`matchLabelKeys: ["pod-template-hash"]`.([#129198](https://github.com/kubernetes/kubernetes/issues/129198)) | ||||||||||||
Comment on lines
+376
to
+378
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||
|
||||||||||||
Finally, the feature will be guarded by a new feature flag. If the feature is | ||||||||||||
disabled, the field `matchLabelKeys` is preserved if it was already set in the | ||||||||||||
persisted Pod object, otherwise it is silently dropped; moreover, kube-scheduler | ||||||||||||
will ignore the field and continue to behave as before. | ||||||||||||
disabled, the field `matchLabelKeys` and corresponding`labelSelector` are preserved | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||
if it was already set in the persisted Pod object, otherwise new Pod with the field | ||||||||||||
creation will be rejected by kube-apiserver; moreover, kube-scheduler will ignore the | ||||||||||||
field and continue to behave as before. | ||||||||||||
Comment on lines
+383
to
+384
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. kube-scheduler cannot determine which label selector(s) is generated by matchLabelKeys at kube-apiserver, and hence it couldn't ignore
Suggested change
|
||||||||||||
|
||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, I think here you can add another section
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! This is exactly what concerned me while revising the KEP draft. |
||||||||||||
### Test Plan | ||||||||||||
|
||||||||||||
|
@@ -400,8 +428,9 @@ This can inform certain test coverage improvements that we want to do before | |||||||||||
extending the production code to implement this enhancement. | ||||||||||||
--> | ||||||||||||
|
||||||||||||
- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread`: `06-07` - `86%` | ||||||||||||
- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread/plugin.go`: `06-07` - `73.1%` | ||||||||||||
- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread`: `2025-01-14 JST (The commit hash: ccd2b4e8a719dabe8605b1e6b2e74bb5352696e1)` - `87.5%` | ||||||||||||
- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread/plugin.go`: `2025-01-14 JST (The commit hash: ccd2b4e8a719dabe8605b1e6b2e74bb5352696e1)` - `84.8%` | ||||||||||||
- `k8s.io/kubernetes/pkg/registry/core/pod/strategy.go`: `2025-01-14 JST (The commit hash: ccd2b4e8a719dabe8605b1e6b2e74bb5352696e1)` - `65%` | ||||||||||||
|
||||||||||||
##### Integration tests | ||||||||||||
|
||||||||||||
|
@@ -532,7 +561,9 @@ enhancement: | |||||||||||
|
||||||||||||
In the event of an upgrade, kube-apiserver will start to accept and store the field `MatchLabelKeys`. | ||||||||||||
|
||||||||||||
In the event of a downgrade, kube-scheduler will ignore `MatchLabelKeys` even if it was set. | ||||||||||||
In the event of a downgrade, kube-apiserver will reject pod creation with `matchLabelKeys` in `TopologySpreadConstraint`. | ||||||||||||
But, regarding existing pods, we leave `matchLabelKeys` and generated `LabelSelector` even after downgraded. | ||||||||||||
kube-scheduler will ignore `MatchLabelKeys` even if it was set. | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto
Suggested change
|
||||||||||||
|
||||||||||||
### Version Skew Strategy | ||||||||||||
|
||||||||||||
|
@@ -619,8 +650,11 @@ NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. | |||||||||||
The feature can be disabled in Alpha and Beta versions by restarting | ||||||||||||
kube-apiserver and kube-scheduler with feature-gate off. | ||||||||||||
One caveat is that pods that used the feature will continue to have the | ||||||||||||
MatchLabelKeys field set even after disabling the feature gate, | ||||||||||||
however kube-scheduler will not take the field into account. | ||||||||||||
MatchLabelKeys field set and the corresponding LabelSelector even after | ||||||||||||
disabling the feature gate, however kube-scheduler will not take the MatchLabelKeys | ||||||||||||
field into account. | ||||||||||||
Comment on lines
+654
to
+655
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||
In terms of Stable versions, users can choose to opt-out by not setting | ||||||||||||
the matchLabelKeys field. | ||||||||||||
|
||||||||||||
###### What happens if we reenable the feature if it was previously rolled back? | ||||||||||||
Newly created pods need to follow this policy when scheduling. Old pods will | ||||||||||||
|
@@ -659,7 +693,8 @@ feature flags will be enabled on some API servers and not others during the | |||||||||||
rollout. Similarly, consider large clusters and how enablement/disablement | ||||||||||||
will rollout across nodes. | ||||||||||||
--> | ||||||||||||
It won't impact already running workloads because it is an opt-in feature in scheduler. | ||||||||||||
It won't impact already running workloads because it is an opt-in feature in kube-apiserver | ||||||||||||
and kube-scheduler. | ||||||||||||
But during a rolling upgrade, if some apiservers have not enabled the feature, they will not | ||||||||||||
be able to accept and store the field "MatchLabelKeys" and the pods associated with these | ||||||||||||
apiservers will not be able to use this feature. As a result, pods belonging to the | ||||||||||||
|
@@ -896,8 +931,14 @@ Think about adding additional work or introducing new steps in between | |||||||||||
|
||||||||||||
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos | ||||||||||||
--> | ||||||||||||
Yes. there is an additional work: the scheduler will use the keys in `matchLabelKeys` to look up label values from the pod and AND with `LabelSelector`. | ||||||||||||
Maybe result in a very samll impact in scheduling latency which directly contributes to pod-startup-latency SLO. | ||||||||||||
Yes. there is an additional work: | ||||||||||||
kube-apiserver uses the keys in `matchLabelKeys` to look up label values from the pod, | ||||||||||||
and change `LabelSelector` according to them. | ||||||||||||
kube-scheduler also looks up the label values from the pod and checks if those labels | ||||||||||||
are included in `LabelSelector`. If not, kube-scheduler will take those labels and AND | ||||||||||||
with `LabelSelector`. | ||||||||||||
Comment on lines
+937
to
+939
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||
The impact in the latency of pod creation request in kube-apiserver and kube-scheduler | ||||||||||||
should be negligible. | ||||||||||||
|
||||||||||||
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? | ||||||||||||
|
||||||||||||
|
@@ -937,7 +978,7 @@ details). For now, we leave it here. | |||||||||||
|
||||||||||||
###### How does this feature react if the API server and/or etcd is unavailable? | ||||||||||||
If the API server and/or etcd is not available, this feature will not be available. | ||||||||||||
This is because the scheduler needs to update the scheduling results to the pod via the API server/etcd. | ||||||||||||
This is because the kube-scheduler needs to update the scheduling results to the pod via the API server/etcd. | ||||||||||||
|
||||||||||||
###### What are other known failure modes? | ||||||||||||
|
||||||||||||
|
@@ -963,7 +1004,7 @@ N/A | |||||||||||
- Check the metric `schedule_attempts_total{result="error|unschedulable"}` to determine if the number | ||||||||||||
of attempts increased. If increased, You need to determine the cause of the failure by the event of | ||||||||||||
the pod. If it's caused by plugin `PodTopologySpread`, You can further analyze this problem by looking | ||||||||||||
at the scheduler log. | ||||||||||||
at the kube-scheduler log. | ||||||||||||
|
||||||||||||
|
||||||||||||
## Implementation History | ||||||||||||
|
@@ -996,11 +1037,20 @@ not need to be as detailed as the proposal, but should include enough | |||||||||||
information to express the idea and why it was not acceptable. | ||||||||||||
--> | ||||||||||||
|
||||||||||||
### use pod generateName | ||||||||||||
Use `pod.generateName` to distinguish new/old pods that belong to the | ||||||||||||
revisions of the same workload in scheduler plugin. It's decided not to | ||||||||||||
support because of the following reason: scheduler needs to ensure universal | ||||||||||||
and scheduler plugin shouldn't have special treatment for any labels/fields. | ||||||||||||
|
||||||||||||
### remove MatchLabelKeys implementation from the scheduler plugin | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Then, briefly mention why we have to implement it in kube-apiserver too. |
||||||||||||
Remove this implementation related to `MatchLabelKeys` from the scheduler plugin | ||||||||||||
and only kube-apiserver handles `MatchLabelKeys` and updates `LabelSelector`. | ||||||||||||
|
||||||||||||
However, this idea is rejected because: | ||||||||||||
- This approach prevents the achievement of the Cluster-level default constraints by `matchLabelKeys: ["pod-template-hash"]`.([#129198](https://github.com/kubernetes/kubernetes/issues/129198)) because kube-apiserver can't be aware of the kube-scheduler configuration. | ||||||||||||
- The current implementation of the scheduler plugin is simple, and the risk of increased maintenance overhead is low. | ||||||||||||
|
||||||||||||
## Infrastructure Needed (Optional) | ||||||||||||
|
||||||||||||
<!-- | ||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.