@@ -143,6 +143,10 @@ and many others) and bring the true value for every Kubernetes user.
143143 (e.g. caused by hardware failures)
144144- Design rescheduling for workloads that will be preempted (rescheduling will
145145 be addressed in a separate dedicated KEP)
146+ - Change the preemption principle of avoiding preemption if a workload/pod can be scheduled without it.
147+ If we decide to change that it will be addressed in a dedicated KEP.
148+ - Propose any tradeoff between preemption and cluster scale-up.
149+ - Design workload-level preemption triggerred by external schedulers
146150
147151## Proposal
148152
@@ -248,6 +252,10 @@ unit being an arbitrary group of pods, in majority of real world usecases this i
248252with the scheduling unit. In other words, the group of pods that should be preempted together matches
249253a group that was initially scheduled together as a gang.
250254
255+ Trying to formalize it, we define ` WorkloadPortion ` as one of {all pods in a PodGroup replica or
256+ a single pod}. With that definition both scheduling unit and preemption units can only be
257+ ` WorkloadPortions ` .
258+
251259In the future, we may want to support usecases when a single scheduling unit consists of multiple
252260preemption groups, but we leave that usecase as a future extension (it can be addressed when we
253261decide to extend Workload API with PodSubGroup concept - for more details see
@@ -317,30 +325,78 @@ object (Workload, PodGroup, PodSubGroup, ...) corresponding to this pod.
317325
318326
319327There is one direct implication of the above - the ` pod.Spec.PriorityClassName ` and ` pod.Spec.Priority `
320- may no longer reflect the actual pod priority. This can be misleading to users.
328+ may no longer reflect the actual pod priority, which could be misleading to users.
321329
322330```
323331<<[UNRESOLVED priority divergence]>>
324332There are several options we can approach it (from least to most invasive):
325- - Explain via documentation
326- - Validating that if a pod is referencing a workload, `pod.Spec.PriorityClassName` equals
327- `workload.Spec.PriorityClassName`. However, `Workload` object potentially may not exist
328- yet on pod creation.
329- - Making `pod.Spec.PriorityClassName` and `pod.Spec.Priority` mutable fields and having a
330- controller responsible for reconciling these. However, that doesn't fully address the
331- problems as divergence between the pod and PodTemplate in true workload object could also
332- be misleading.
333-
334- The validation option seems like the best option, if we can address the problem of not-yet
335- existing `Workload` object (reversed validation?).
333+ - Describe the possible divergence via documentation
334+ - Expose the information about divergence in the API.
335+ This would require introducing a new `Conditions` field in `workload.Status` and introducing
336+ a dedicated condition like `PodsNotMatchingPriority` that will be set by either kube-scheduler
337+ or a new workload-controller whenever it observes pods referencing a given `Workload` object
338+ which priority doesn't match the priority of the workload object.
339+ - Introducing an admission to validate that if a pod is referencing a workload object, its
340+ `pod.Spec.PriorityClassName` equals `workload.Spec.PriorityClassName`. However, we allow creating
341+ pods before the workload object, and there don't see, to be an easy way to avoid races.
342+ - Making `pod.Spec.PriorityClassName` and `pod.Spec.Priority` mutable fields and having a new
343+ workload controller responsible for reconciling these. However, that could introduce another
344+ divergence between the priority of pods and the priority defined in the PodTemplate in true
345+ workload objects which would introduce a similar level of confusion to users.
346+
347+ If we could address the race in validations, that seems like a desired option. However,
348+ I don't see an easy option for it.
349+ Given that, we suggest to proceed with just exposing the information about divergence in the
350+ Workload status (second option) and potentially improving it later.
336351<<[/UNRESOLVED]>>
337352```
338353
339- The similar argument holds for preemption priority, but we argue that its mutable nature
340- makes it infeasible for reconciling this information back to pod for scalability reasons
341- (we can absolutely handle frequent updates to ` Workload.Spec.PreemptionPriorityClassName `
342- but we can't handle updating potentially hundreds of thousands of pods within that workload
343- that frequently). In this case, we limit ourselves to documentation.
354+ It's worth mentioning here, that we want to introduce the same defaulting rules for
355+ ` workload.Spec.PriorityClassName ` that we have for pods. Namely, if ` PriorityClassName ` is unset
356+ and there exists PriorityClass marked as ` globalDefault ` , we default it to that value.
357+ This consistency will allow us to properly handle when users are not setting neither pods
358+ nor workload priorities.
359+ Similarly, we will ensure that ` PriorityClass.preemptionPolicy ` works exactly the same way for
360+ workloads as for pods. Such level of consistency would make adoption of Workload API much easier.
361+
362+ Moving to ` PreemptionPriorityClassName ` , the same issue of confusion holds (the actual priority
363+ set at the pod level may not reflect priority used for preemption). We argue that its mutable
364+ nature makes it infeasible for reconsiling this information back to pods for scalability reasons
365+ (we can absolutely handle frequent updates to ` Workload.Spec.PreemptionPriorityClassName ` ,
366+ but we can't handle updating potentially hundreds or thousands of pods within that workload
367+ that frequently). So in this case, we limit ourselves to documentation.
368+
369+ ```
370+ <<[UNRESOLVED preemption cycles]>>
371+ If we would allow for arbitrary relation between scheduling priority and preemption policy,
372+ we could hit an infinite cycle of preemption. Consider an example when:
373+ - workload A has scheduling priority `high` and preemption policy `low`
374+ - workload B has scheduling priority `high` and preemption policy `low`
375+ In such case, workload A can preempt workload B (`high` > `low`), but then workload B can
376+ also preempt workload A. This is definitely not desired.
377+ We can avoid the infinite cycle by ensuring that `scheduling priority <= preemption priority`.
378+
379+ However, this also opens a question if we should allow for setting arbitrary high preemption
380+ priority for low scheduling priority workloads. Arguably we can claim that scheduling priority
381+ should be the ultimate truth and if there is a workload with higher priority it should be
382+ able to preempt it.
383+ So the alternative model that we can consider is to instead adding the concept of preemption
384+ priority, introduce a concept of "preemption cost". In such a model, the workload with
385+ higher priority can always preempt lower priority ones, but if we need to choose between
386+ two workloads to preempt, such preemption cost may result in choosing the one with higher
387+ priority amongst these two. Consider the following example:
388+ - we want to schedule workload A with scheduling priority `high`
389+ - it needs to preempt one of the already running workloads
390+ - workload B has scheduling priority `med` but preemption cost `low`
391+ - workload C has scheduling priority `low` but preemption cost `high`
392+ In such case, the preemption cost would result in choosing workload B for preemption. But
393+ if it gets recreated, it will preempt workload C causing unnecessary cascading preemption.
394+ This is the reason why a cost-based model was discarded.
395+
396+ So for now, we suggest introducing only additional validation of scheduling priority to be
397+ not higher then preemption policy.
398+ <<[/UNRESOLVED]>>
399+ ```
344400
345401```
346402<<[UNRESOLVED priority status]>>
@@ -356,6 +412,7 @@ We should introduce/describe `workload.status` to reflect:
356412We start with describing at the high-level how existing pod-level preemption algorithm works.
357413Below, we will show how to generalize it to workloads.
358414
415+ If a pod P can be scheduled without triggering preemption, we don't consider preemption at all.
359416To check if a pod P can be scheduled on a given node with preemption we:
360417
3614181 . Identify the list of potential victims - all running pods with priority lower than the new pod P.
@@ -368,8 +425,8 @@ To check if a pod P can be scheduled on a given node with preemption we:
3684251 . From remaining potential victims, we start to reprieve pods starting from the highest priority
369426 and working down until the set of remaining victims still keeps the node feasible.
370427
371- Once we compute the feasibility and list of victims for all nodes , we score that and choose the
372- best options.
428+ Once we find enough nodes feasible for preemption and list of victims for them , we score that and
429+ choose the best options.
373430
374431The above algorithm achieves our principles, as by eliminating highest priority pods first, it
375432effectively tries to minimize the cascading preemptions later.
@@ -380,61 +437,86 @@ moving to the level of `Workload`, but also no longer operating at the level of
380437We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient
381438becomes a challenge, thus we modify to the approach below.
382439
383- To check if a workload W can be scheduled on a given cluster with preemption we:
384-
385- 1 . Identify the list of potential victims:
386- - all running workloads with (preemption) priority lower than the new workload W
387- - all individual pods (not being part of workloads) with priority lower than the new workload W
388-
389- 1 . If removing all the potential victims would not make the new workload W schedulable,
390- the workload is unschedulable even with preemption.
391-
392- ```
393- <<[UNRESOLVED PodDisruptionBudget violations]>>
394- How critical is reprieving workloads and pods violating PodDisruptionBudgets? We no longer can
395- afford full workload scheduling trying to reprieve every individual pod and workload.
396-
397- We could consider finding the first one that can't be reprieved using binary search, but if we
398- can't reprieve any of those, learning about that would require O(N) full workload schedulings
399- with N being number of workload/pods violating PDB.
400- <<[/UNRESOLVED]>>
401- ```
402-
403- 1 . For remaining potential victims, using binary search across priorities find the minimal priority P
404- for which scheduling the new workload W doesn't require preempting any workloads and/or pods with
405- priority higher than P. This allows to reduce the potential cascading preemptions later.
406-
407- ```
408- <<[UNRESOLVED minimizing preemptions]>>
409- The following algorithm is by far no optimal, but is simple to reason about and I would suggest it as
410- a starting point:
411- - assume that all potential victims on the list are removed and schedule the new workload W
412- - go over the remaining potential victims starting from the highest priority and check if these can
413- be placed in the place they are currently running; if so remove from the potential victims
414-
415- As a bonus we may consider few potential placements of the new workload W here and choose the one that
416- somehow optimizes the number of victims. But that will become more critical once we get to
417- Topology-Aware-Scheduling and I would leave that optimization until then.
418- <<[/UNRESOLVED]>>
419- ```
420-
421- ```
422- <<[UNRESOLVED sharing algorithms]>>
423- The remaining question is to what extent we want to unify the preemption mechanism across
424- pod-triggerred (existing algorithm) and workload-triggerred preemption.
425-
426- It might be tempting to start with a dedicated new implementation to reduce the risk. But the above
427- proposal was structured such way to facilitate sharing:
428- - once the new workload W is placed, going over the remaining potential victims and trying to
429- place them where they are currently running, is pretty much exactly what the current algorithm is
430- doing
431- - considering "few potential placements" in the pod-triggerred case can be used as "try every node"
432- so effectively it's also the existing algorithm (just viewed from a slightly different angle
433-
434- So I would actually argue to we should refactor the existing preemption code and use that in both
435- cases.
436- <<[/UNRESOLVED]
437- ```
440+ At the same time, we need to support four cases:
441+ - individual pod as preemptor, individual pod(s) as victim(s)
442+ - individual pod as preemptor, pod group(s) (and individual pod(s)) as victim(s)
443+ - pod group as preemptor, individual pod(s) as victim(s)
444+ - pod group as preemptor, pod group(s) (and individual pod(s)) as victim(s)
445+
446+ To achieve that, we don't want to multiply preemption algorithms and rather want to have a
447+ unified high-level approach (with potential minor tweaks per option).
448+
449+ To check if a given preemptor (either (gang) PodGroup G or an individual pod P) can be scheduled
450+ with preemption:
451+
452+ 1 . Split the cluster into mutually-exclusive domains where a preemptor will be put:
453+ - for pod P, it will always be individual nodes
454+ - for pod group G, we will start with just one "whole cluster"; eventually once we will have
455+ topology-aware scheduling, we will most probably inject some domain-based split here
456+
457+ 1 . For every domain computed above run the following points:
458+
459+ 1 . Identify the list of all potential victims in that domain:
460+ - all running workloads with (preemption) priority lower then preemptor priority; note that
461+ some pods from that workload may be running outside of currently considered domain D - they
462+ need to contribute to scoring, but they won't contribute to feasibility of domain D.
463+ - all individual pods with priority lower the preemptor priority
464+
465+ 1 . If removing all potential victims would not make the preemptor schedulable, the preemptor
466+ is unschedulable with preemption in currently considered domain D.
467+
468+ 1 . Sort all the potential victims to reflect their "importance" (from the most important to the
469+ least ones). Tentatively, the function will sort first by their priority, and within a single
470+ priority prioritizing workloads over individual pods.
471+
472+ 1 . Perform best-effort reprieval of workloads and pods violating PodDisruptionBudgets. We achieve
473+ it but scheduling and assuming the preemptor (assuming that all potential victims are removed),
474+ and then iterating over potential victims that would violate PodDisruptionBudget to check if
475+ these can be placed in the exact same place they are running now. If they can we simply leave
476+ them where they are running now and remove from the potential victims list.
477+
478+ ```
479+ <<[UNRESOLVED PodDisruptionBudget violations]>>
480+ The above reprieval works identically to current algorithm if the domain D is a single node.
481+ For larger domains, different placements of a preemptor are potentially possible and may result
482+ in potentially different sets of victims violating PodDisruptionBudgets to remain feasible.
483+ This means that the above algorithm is not optimizing for minimizing the number of victims that
484+ would violate their PodDisruptionBudgets.
485+ However, we claim that algorithm optimizing for it would be extremely expensive computationally
486+ and propose to stick with this simple version at least for a foreseable future.
487+ <<[/UNRESOLVED]
488+ ```
489+
490+ 1. For the remaining potential victims, using binary search across priorities find the minimal
491+ priority N for which scheduling the preemptor can be achieved without preempting any victims
492+ with priority higher than N. This allows to reduce the potential cascaiding preemptions later.
493+
494+ 1. Eliminate all victims from the potential victims list that have priority higher than N.
495+
496+ 1. Schedule and assume the preemptor (assuming that all remaining potential victims are removed).
497+
498+ 1. Iterate over the list of potential victims (in the order achieved with sorting above) checking
499+ if they can be placed where they are currently running. If so assume it back and remove from
500+ potential victims list.
501+
502+ ```
503+ <<[ UNRESOLVED minimizing preemptions] >>
504+ The above algorithm is definitely non optimal, but is (a) compatible with the current pod-based
505+ algorithm (b) computationally feasible (c) simple to reason about.
506+ As a result, I suggest that we proceed with it at least as a starting point.
507+
508+ As a bonus we may consider few potential placements of the preemptor and choose the one that
509+ somehow optimizes the number of victims. However, that will appear to be more critical once we
510+ get to Topology-Aware-Scheduling and I would leave that improvement until then.
511+ <<[ /UNRESOLVED] >>
512+ ```
513+
514+ 1. We score scheduling decisions for each of the domains and choose the best one. The exact criteria
515+ for that will be figured out during the implementation phase.
516+
517+ It's worth noting that as structured, this algorithm addresses all four cases mentioned above that
518+ we want to support and is compatible with the current pod-based preemption algorithm. This means
519+ we will be able to achieve in-place replacement with relatively localized changes.
438520
439521### Delayed preemption
440522
@@ -445,15 +527,21 @@ Should we leave it as part of this KEP or should this be moved to the Gang-Sched
445527```
446528
447529As part of minimizing preemptions goal, arguably the most important thing to do is to avoid unnecessary
448- preemptions. However, this is not true for the current gang scheduling implementation.
449- In the current implementation, preemption is triggered in the ` PostFiler ` . However, it's entirely
450- possible that a given pod may actually not even proceed to binding, because we can't schedule the
451- whole gang. In such case, the preemption ended up being a completely unnecessary disruption.
530+ preemptions. However, with the current model of preemption when preemption is triggered immediately
531+ after the victims are decided (in `PostFilter`) doesn't achieve this goal. The reason for that is
532+ that the proposed placement (nomination) can actually appear to be invalid and not be proceeded with.
533+ In such case we will not even proceed to binding and the preemption will be completely unnessary
534+ disruption.
535+ Note that this problem already exists in the current gang scheduling implementation. A given gang may
536+ not proceed with binding if the `minCount` pods from it can't be scheduled. But the preemptions are
537+ currently triggerred immediately after choosing a place for individual pods. So similarly as above,
538+ we may end up with completely unnecessary disruptions.
452539
453540We will address it with what we call `delayed preemption` mechanism as following:
454541
4555421. We will modify the `DefaultPreemption` plugin to just compute preemptions, without actuating those.
456543 At this point these should only be stored in kube-scheduler's memory.
544+ We advice maintainers of custom PostFilter implementations to do the same.
457545
458546```
459547<<[ UNRESOLVED storing victims] >>
0 commit comments