Skip to content

Commit 6a2cc3f

Browse files
committed
Expand on review comments
1 parent 0ff3958 commit 6a2cc3f

File tree

1 file changed

+182
-81
lines changed
  • keps/sig-scheduling/5710-workload-aware-preemption

1 file changed

+182
-81
lines changed

keps/sig-scheduling/5710-workload-aware-preemption/README.md

Lines changed: 182 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,10 @@ and many others) and bring the true value for every Kubernetes user.
143143
(e.g. caused by hardware failures)
144144
- Design rescheduling for workloads that will be preempted (rescheduling will
145145
be addressed in a separate dedicated KEP)
146+
- Change the preemption principle of avoiding preemption if a workload/pod can be scheduled without it.
147+
If we decide to change that it will be addressed in a dedicated KEP.
148+
- Propose any tradeoff between preemption and cluster scale-up.
149+
- Design workload-level preemption triggerred by external schedulers
146150

147151
## Proposal
148152

@@ -248,6 +252,10 @@ unit being an arbitrary group of pods, in majority of real world usecases this i
248252
with the scheduling unit. In other words, the group of pods that should be preempted together matches
249253
a group that was initially scheduled together as a gang.
250254

255+
Trying to formalize it, we define `WorkloadPortion` as one of {all pods in a PodGroup replica or
256+
a single pod}. With that definition both scheduling unit and preemption units can only be
257+
`WorkloadPortions`.
258+
251259
In the future, we may want to support usecases when a single scheduling unit consists of multiple
252260
preemption groups, but we leave that usecase as a future extension (it can be addressed when we
253261
decide to extend Workload API with PodSubGroup concept - for more details see
@@ -257,12 +265,25 @@ be larger than scheduling unit.
257265
Based on that, we will extend the the existing `GangSchedulingPolicy` as following:
258266

259267
```golang
268+
// PreemptionMode describes the mode in which a PodGroup can be preempted.
269+
// +enum
270+
type PreemptionMode string
271+
272+
const (
273+
// PreemptionModePod means that individual pods can be preempted independently.
274+
PreemptionModePod = "Pod"
275+
// PreemptionModePodGroup means that the whole PodGroup replica needs to be
276+
// preempted together.
277+
PreemptionModePodGroup = "PodGroup"
278+
)
279+
260280
type GangSchedulingPolicy struct {
261281
// Existing field(s).
262282

263-
// IsGangPreemptable defines whether all pods from this group should
264-
// be preempted in all-or-nothing fashion.
265-
IsGangPreemtable *bool
283+
// PreemptionMode defines the mode in which a given PodGroup can be preempted.
284+
// One of Pod, PodGroup.
285+
// Defaults to Pod if unset.
286+
PreemptionMode *PreemptionMode
266287
}
267288
```
268289

@@ -317,30 +338,78 @@ object (Workload, PodGroup, PodSubGroup, ...) corresponding to this pod.
317338

318339

319340
There is one direct implication of the above - the `pod.Spec.PriorityClassName` and `pod.Spec.Priority`
320-
may no longer reflect the actual pod priority. This can be misleading to users.
341+
may no longer reflect the actual pod priority, which could be misleading to users.
321342

322343
```
323344
<<[UNRESOLVED priority divergence]>>
324345
There are several options we can approach it (from least to most invasive):
325-
- Explain via documentation
326-
- Validating that if a pod is referencing a workload, `pod.Spec.PriorityClassName` equals
327-
`workload.Spec.PriorityClassName`. However, `Workload` object potentially may not exist
328-
yet on pod creation.
329-
- Making `pod.Spec.PriorityClassName` and `pod.Spec.Priority` mutable fields and having a
330-
controller responsible for reconciling these. However, that doesn't fully address the
331-
problems as divergence between the pod and PodTemplate in true workload object could also
332-
be misleading.
333-
334-
The validation option seems like the best option, if we can address the problem of not-yet
335-
existing `Workload` object (reversed validation?).
346+
- Describe the possible divergence via documentation
347+
- Expose the information about divergence in the API.
348+
This would require introducing a new `Conditions` field in `workload.Status` and introducing
349+
a dedicated condition like `PodsNotMatchingPriority` that will be set by either kube-scheduler
350+
or a new workload-controller whenever it observes pods referencing a given `Workload` object
351+
which priority doesn't match the priority of the workload object.
352+
- Introducing an admission to validate that if a pod is referencing a workload object, its
353+
`pod.Spec.PriorityClassName` equals `workload.Spec.PriorityClassName`. However, we allow creating
354+
pods before the workload object, and there don't see, to be an easy way to avoid races.
355+
- Making `pod.Spec.PriorityClassName` and `pod.Spec.Priority` mutable fields and having a new
356+
workload controller responsible for reconciling these. However, that could introduce another
357+
divergence between the priority of pods and the priority defined in the PodTemplate in true
358+
workload objects which would introduce a similar level of confusion to users.
359+
360+
If we could address the race in validations, that seems like a desired option. However,
361+
I don't see an easy option for it.
362+
Given that, we suggest to proceed with just exposing the information about divergence in the
363+
Workload status (second option) and potentially improving it later.
336364
<<[/UNRESOLVED]>>
337365
```
338366

339-
The similar argument holds for preemption priority, but we argue that its mutable nature
340-
makes it infeasible for reconciling this information back to pod for scalability reasons
341-
(we can absolutely handle frequent updates to `Workload.Spec.PreemptionPriorityClassName`
342-
but we can't handle updating potentially hundreds of thousands of pods within that workload
343-
that frequently). In this case, we limit ourselves to documentation.
367+
It's worth mentioning here, that we want to introduce the same defaulting rules for
368+
`workload.Spec.PriorityClassName` that we have for pods. Namely, if `PriorityClassName` is unset
369+
and there exists PriorityClass marked as `globalDefault`, we default it to that value.
370+
This consistency will allow us to properly handle when users are not setting neither pods
371+
nor workload priorities.
372+
Similarly, we will ensure that `PriorityClass.preemptionPolicy` works exactly the same way for
373+
workloads as for pods. Such level of consistency would make adoption of Workload API much easier.
374+
375+
Moving to `PreemptionPriorityClassName`, the same issue of confusion holds (the actual priority
376+
set at the pod level may not reflect priority used for preemption). We argue that its mutable
377+
nature makes it infeasible for reconsiling this information back to pods for scalability reasons
378+
(we can absolutely handle frequent updates to `Workload.Spec.PreemptionPriorityClassName`,
379+
but we can't handle updating potentially hundreds or thousands of pods within that workload
380+
that frequently). So in this case, we limit ourselves to documentation.
381+
382+
```
383+
<<[UNRESOLVED preemption cycles]>>
384+
If we would allow for arbitrary relation between scheduling priority and preemption policy,
385+
we could hit an infinite cycle of preemption. Consider an example when:
386+
- workload A has scheduling priority `high` and preemption policy `low`
387+
- workload B has scheduling priority `high` and preemption policy `low`
388+
In such case, workload A can preempt workload B (`high` > `low`), but then workload B can
389+
also preempt workload A. This is definitely not desired.
390+
We can avoid the infinite cycle by ensuring that `scheduling priority <= preemption priority`.
391+
392+
However, this also opens a question if we should allow for setting arbitrary high preemption
393+
priority for low scheduling priority workloads. Arguably we can claim that scheduling priority
394+
should be the ultimate truth and if there is a workload with higher priority it should be
395+
able to preempt it.
396+
So the alternative model that we can consider is to instead adding the concept of preemption
397+
priority, introduce a concept of "preemption cost". In such a model, the workload with
398+
higher priority can always preempt lower priority ones, but if we need to choose between
399+
two workloads to preempt, such preemption cost may result in choosing the one with higher
400+
priority amongst these two. Consider the following example:
401+
- we want to schedule workload A with scheduling priority `high`
402+
- it needs to preempt one of the already running workloads
403+
- workload B has scheduling priority `med` but preemption cost `low`
404+
- workload C has scheduling priority `low` but preemption cost `high`
405+
In such case, the preemption cost would result in choosing workload B for preemption. But
406+
if it gets recreated, it will preempt workload C causing unnecessary cascading preemption.
407+
This is the reason why a cost-based model was discarded.
408+
409+
So for now, we suggest introducing only additional validation of scheduling priority to be
410+
not higher then preemption policy.
411+
<<[/UNRESOLVED]>>
412+
```
344413

345414
```
346415
<<[UNRESOLVED priority status]>>
@@ -356,6 +425,7 @@ We should introduce/describe `workload.status` to reflect:
356425
We start with describing at the high-level how existing pod-level preemption algorithm works.
357426
Below, we will show how to generalize it to workloads.
358427

428+
If a pod P can be scheduled without triggering preemption, we don't consider preemption at all.
359429
To check if a pod P can be scheduled on a given node with preemption we:
360430

361431
1. Identify the list of potential victims - all running pods with priority lower than the new pod P.
@@ -368,8 +438,8 @@ To check if a pod P can be scheduled on a given node with preemption we:
368438
1. From remaining potential victims, we start to reprieve pods starting from the highest priority
369439
and working down until the set of remaining victims still keeps the node feasible.
370440

371-
Once we compute the feasibility and list of victims for all nodes, we score that and choose the
372-
best options.
441+
Once we find enough nodes feasible for preemption and list of victims for them, we score that and
442+
choose the best options.
373443

374444
The above algorithm achieves our principles, as by eliminating highest priority pods first, it
375445
effectively tries to minimize the cascading preemptions later.
@@ -380,61 +450,86 @@ moving to the level of `Workload`, but also no longer operating at the level of
380450
We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient
381451
becomes a challenge, thus we modify to the approach below.
382452

383-
To check if a workload W can be scheduled on a given cluster with preemption we:
384-
385-
1. Identify the list of potential victims:
386-
- all running workloads with (preemption) priority lower than the new workload W
387-
- all individual pods (not being part of workloads) with priority lower than the new workload W
388-
389-
1. If removing all the potential victims would not make the new workload W schedulable,
390-
the workload is unschedulable even with preemption.
391-
392-
```
393-
<<[UNRESOLVED PodDisruptionBudget violations]>>
394-
How critical is reprieving workloads and pods violating PodDisruptionBudgets? We no longer can
395-
afford full workload scheduling trying to reprieve every individual pod and workload.
396-
397-
We could consider finding the first one that can't be reprieved using binary search, but if we
398-
can't reprieve any of those, learning about that would require O(N) full workload schedulings
399-
with N being number of workload/pods violating PDB.
400-
<<[/UNRESOLVED]>>
401-
```
402-
403-
1. For remaining potential victims, using binary search across priorities find the minimal priority P
404-
for which scheduling the new workload W doesn't require preempting any workloads and/or pods with
405-
priority higher than P. This allows to reduce the potential cascading preemptions later.
406-
407-
```
408-
<<[UNRESOLVED minimizing preemptions]>>
409-
The following algorithm is by far no optimal, but is simple to reason about and I would suggest it as
410-
a starting point:
411-
- assume that all potential victims on the list are removed and schedule the new workload W
412-
- go over the remaining potential victims starting from the highest priority and check if these can
413-
be placed in the place they are currently running; if so remove from the potential victims
414-
415-
As a bonus we may consider few potential placements of the new workload W here and choose the one that
416-
somehow optimizes the number of victims. But that will become more critical once we get to
417-
Topology-Aware-Scheduling and I would leave that optimization until then.
418-
<<[/UNRESOLVED]>>
419-
```
420-
421-
```
422-
<<[UNRESOLVED sharing algorithms]>>
423-
The remaining question is to what extent we want to unify the preemption mechanism across
424-
pod-triggerred (existing algorithm) and workload-triggerred preemption.
425-
426-
It might be tempting to start with a dedicated new implementation to reduce the risk. But the above
427-
proposal was structured such way to facilitate sharing:
428-
- once the new workload W is placed, going over the remaining potential victims and trying to
429-
place them where they are currently running, is pretty much exactly what the current algorithm is
430-
doing
431-
- considering "few potential placements" in the pod-triggerred case can be used as "try every node"
432-
so effectively it's also the existing algorithm (just viewed from a slightly different angle
433-
434-
So I would actually argue to we should refactor the existing preemption code and use that in both
435-
cases.
436-
<<[/UNRESOLVED]
437-
```
453+
At the same time, we need to support four cases:
454+
- individual pod as preemptor, individual pod(s) as victim(s)
455+
- individual pod as preemptor, pod group(s) (and individual pod(s)) as victim(s)
456+
- pod group as preemptor, individual pod(s) as victim(s)
457+
- pod group as preemptor, pod group(s) (and individual pod(s)) as victim(s)
458+
459+
To achieve that, we don't want to multiply preemption algorithms and rather want to have a
460+
unified high-level approach (with potential minor tweaks per option).
461+
462+
To check if a given preemptor (either (gang) PodGroup G or an individual pod P) can be scheduled
463+
with preemption:
464+
465+
1. Split the cluster into mutually-exclusive domains where a preemptor will be put:
466+
- for pod P, it will always be individual nodes
467+
- for pod group G, we will start with just one "whole cluster"; eventually once we will have
468+
topology-aware scheduling, we will most probably inject some domain-based split here
469+
470+
1. For every domain computed above run the following points:
471+
472+
1. Identify the list of all potential victims in that domain:
473+
- all running workloads with (preemption) priority lower then preemptor priority; note that
474+
some pods from that workload may be running outside of currently considered domain D - they
475+
need to contribute to scoring, but they won't contribute to feasibility of domain D.
476+
- all individual pods with priority lower the preemptor priority
477+
478+
1. If removing all potential victims would not make the preemptor schedulable, the preemptor
479+
is unschedulable with preemption in currently considered domain D.
480+
481+
1. Sort all the potential victims to reflect their "importance" (from the most important to the
482+
least ones). Tentatively, the function will sort first by their priority, and within a single
483+
priority prioritizing workloads over individual pods.
484+
485+
1. Perform best-effort reprieval of workloads and pods violating PodDisruptionBudgets. We achieve
486+
it but scheduling and assuming the preemptor (assuming that all potential victims are removed),
487+
and then iterating over potential victims that would violate PodDisruptionBudget to check if
488+
these can be placed in the exact same place they are running now. If they can we simply leave
489+
them where they are running now and remove from the potential victims list.
490+
491+
```
492+
<<[UNRESOLVED PodDisruptionBudget violations]>>
493+
The above reprieval works identically to current algorithm if the domain D is a single node.
494+
For larger domains, different placements of a preemptor are potentially possible and may result
495+
in potentially different sets of victims violating PodDisruptionBudgets to remain feasible.
496+
This means that the above algorithm is not optimizing for minimizing the number of victims that
497+
would violate their PodDisruptionBudgets.
498+
However, we claim that algorithm optimizing for it would be extremely expensive computationally
499+
and propose to stick with this simple version at least for a foreseable future.
500+
<<[/UNRESOLVED]
501+
```
502+
503+
1. For the remaining potential victims, using binary search across priorities find the minimal
504+
priority N for which scheduling the preemptor can be achieved without preempting any victims
505+
with priority higher than N. This allows to reduce the potential cascaiding preemptions later.
506+
507+
1. Eliminate all victims from the potential victims list that have priority higher than N.
508+
509+
1. Schedule and assume the preemptor (assuming that all remaining potential victims are removed).
510+
511+
1. Iterate over the list of potential victims (in the order achieved with sorting above) checking
512+
if they can be placed where they are currently running. If so assume it back and remove from
513+
potential victims list.
514+
515+
```
516+
<<[UNRESOLVED minimizing preemptions]>>
517+
The above algorithm is definitely non optimal, but is (a) compatible with the current pod-based
518+
algorithm (b) computationally feasible (c) simple to reason about.
519+
As a result, I suggest that we proceed with it at least as a starting point.
520+
521+
As a bonus we may consider few potential placements of the preemptor and choose the one that
522+
somehow optimizes the number of victims. However, that will appear to be more critical once we
523+
get to Topology-Aware-Scheduling and I would leave that improvement until then.
524+
<<[/UNRESOLVED]>>
525+
```
526+
527+
1. We score scheduling decisions for each of the domains and choose the best one. The exact criteria
528+
for that will be figured out during the implementation phase.
529+
530+
It's worth noting that as structured, this algorithm addresses all four cases mentioned above that
531+
we want to support and is compatible with the current pod-based preemption algorithm. This means
532+
we will be able to achieve in-place replacement with relatively localized changes.
438533
439534
### Delayed preemption
440535
@@ -445,15 +540,21 @@ Should we leave it as part of this KEP or should this be moved to the Gang-Sched
445540
```
446541
447542
As part of minimizing preemptions goal, arguably the most important thing to do is to avoid unnecessary
448-
preemptions. However, this is not true for the current gang scheduling implementation.
449-
In the current implementation, preemption is triggered in the `PostFiler`. However, it's entirely
450-
possible that a given pod may actually not even proceed to binding, because we can't schedule the
451-
whole gang. In such case, the preemption ended up being a completely unnecessary disruption.
543+
preemptions. However, with the current model of preemption when preemption is triggered immediately
544+
after the victims are decided (in `PostFilter`) doesn't achieve this goal. The reason for that is
545+
that the proposed placement (nomination) can actually appear to be invalid and not be proceeded with.
546+
In such case we will not even proceed to binding and the preemption will be completely unnessary
547+
disruption.
548+
Note that this problem already exists in the current gang scheduling implementation. A given gang may
549+
not proceed with binding if the `minCount` pods from it can't be scheduled. But the preemptions are
550+
currently triggerred immediately after choosing a place for individual pods. So similarly as above,
551+
we may end up with completely unnecessary disruptions.
452552
453553
We will address it with what we call `delayed preemption` mechanism as following:
454554
455555
1. We will modify the `DefaultPreemption` plugin to just compute preemptions, without actuating those.
456556
At this point these should only be stored in kube-scheduler's memory.
557+
We advice maintainers of custom PostFilter implementations to do the same.
457558
458559
```
459560
<<[UNRESOLVED storing victims]>>

0 commit comments

Comments
 (0)