KEP-5732: Topology-aware workload scheduling #5733

44past4 · 2025-12-11T00:16:13Z

One-line PR description: Initial version of Topology-aware workload scheduling KEP
Issue link: Topology-aware workload scheduling #5732

/sig scheduling

k8s-ci-robot · 2025-12-11T00:16:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: 44past4
Once this PR has been reviewed and has the lgtm label, please assign dom4ha for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

44past4 · 2025-12-11T07:40:30Z

/cc @wojtek-t @erictune @johnbelamaric

wojtek-t · 2025-12-11T12:27:59Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+    // Level specifies the key of the node label representing the topology domain.
+    // All pods within the PodGroup must be colocated within the same domain instance.
+    // Examples: "topology.kubernetes.io/rack"
+    Level string


Let's specify what happens if a PodGroup is replicated (there are multiple PodGroup instances/replicas).
I'm assuming that for each of those domain at that level is chosen, but there is not coordination between those (i.e. different pod group instances may be scheduled in different domains, but they can also share them).

wojtek-t · 2025-12-11T12:38:23Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+type DRAConstraint struct {
+    // ResourceClaimName specifies the name of a specific ResourceClaim
+    // within the PodGroup's pods that this constraint applies to.
+    ResourceClaimName *string


What does ResourceClaimName mean if a given PodGroup is replicated (there are multiple podgroup instances/replicas)?

This would effectively mean sharing the same RC across multiple instances, which in many cases would be highly misleading.
However, arguably there can be usecases for it too, but then the algorithm effectively should consider all podgroup instances in a single round, but for that we don't know how many groups we even have.
@macsko - FYI (as this is slightly colliding with the kep-4671 update)

So thinking about that more, I'm wondering if we can introduce that without further enhancing the API now (i.e. adding the replicas field to PodGroup).

Another alternative would be to very explicitly split the pod-group-replica constraints from the constraints across all pod-group-replicas and (at least for Alpha) focus only on the former.
So something more like (exact names and structures to be refined):

type PodGroupAllReplicasSchedulingConstraints { ResourceClaimName *string // This one is supported only if Replicas=1 } type PodGroupReplicaSchedulingConstraints { ResourceClaimTemplateName *string // Separate RC is created from this template for every replica. }

In case if the PodGroup is replicated the meaning of ResourceClaimName will depend on whether we will be scheduling those replicas together or not. If they will be scheduled separately then scheduling of the first replica will lock the referenced ResourceClaim and the subsequent replicas will not have any freedom when it comes to its allocation - there will be only one possible placement for them. When scheduling multiple replicas at once we can try to choose a DRA allocation which allows us to schedule the highest number of replicas (assuming that we do not provide all-or-nothing semantics for multiple replicas).

I wasn't asking about the implementation aspect.
I wanted to take a step back and understand what is the actual usecase we're trying to address and figure out if/how we should represent it to make it intuitive to users when they have replicated PodGroup. I feel that the API as currently described can be pretty confusing in this case.

wojtek-t · 2025-12-11T12:38:47Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+
+    // ResourceClaimTemplateName specifies the name of a ResourceClaimTemplate.
+    // This applies to all ResourceClaim instances generated from this template.
+    ResourceClaimTemplateName *string


Who creates and manages lifecycle of the RC created from that template?

Here we are assuming that the lifecycle of RC is managed outside of kube-scheduler. One option for this is to have it managed by the specific workload controller like for instance LeaderWorkerSet which could create a RC when creating a new replica. This would be very inconvenient so probably we should have a single controller which could do this just by watching Workload objects. We had a discussion with @johnbelamaric about this. Either way this should be outside of the scope of this KEP.

The lifecycle is what I plan to address in #5729.

OK - so this matches my thinking.

But the primary question now is - why do we need it then?
If we have some external entity (whether it's dedicated controller or e.g. LWS controller) that will create RC whenever it is needed (it should create it before we will actually do the scheduling), then what scheduler really needs to be aware and is an input for it is that RC (that it will be finding a best allocation for) not the template itself. It doesn't care about the template.

So I think we're aligned on the intention, but I don't really understand how that will be used.

wojtek-t · 2025-12-11T12:44:20Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+// PodGroupInfo holds information about a specific PodGroup within a Workload,
+// including a reference to the Workload, the PodGroup's name, and its replica index.
+// This struct is designed to be extensible with more fields in the future.
+type PodGroupInfo struct {


PodGroupInfo was already introduced in scheduler as part of initial gang-scheduling implementation. However, this is now focused on the pods and their state:
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/backend/workloadmanager/podgroupinfo.go#L52

Do you suggest reuse that structure or create a second one for it?
@macsko

The existing PodGroupInfo is much closer to the PodSetInfo proposed below so if we will consider using it we should probably rename the proposed PodGroupInfo to something like PodGroupMetadata or WorkloadPodGroupRefrence.

It's closer regarding what kind of information it keeps, but the granularity is different (it may contain pods of different signatures).

So my point is - we need to align that. Having two different things with same names will be pretty misleading.

wojtek-t · 2025-12-11T12:45:33Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+// PodSetInfo holds information about a specific PodSet within a PodGroup,
+// primarily the list of Pods.
+// This struct is designed to be extensible with more fields in the future.
+type PodSetInfo struct {


Should PodGroupInfo keep a list of its PodSetInfos?

In my mind PodGroupInfo should only provide basic information about the PodGroup configuration and not the actual state of the PodGroup/pods which are part of this PodGroup so I do not see the need to have a reference from PodGroupInfo to PodSetInfos.

Should PodSetInfo then have a link to PodGroupInfo?

I think we will need a way to iterate over podsets within podgroup and this seems like a good place to allow for it.

wojtek-t · 2025-12-11T12:54:45Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+// PodSetAssignment represents the assignment of pods to nodes within a PodSet for a specific Placement.
+type PodSetAssignment struct {
+    // PodToNodeMap maps a Pod name (string) to a Node name (string).
+    PodToNodeMap map[string]string


Do we need dra assignments too?

This is a good question. We might need them for the PodGroup pods binding phase which comes after the selection of the placement for a PodGroup has been finished. So provided that we can capture those when we are checking the placement feasibility then yes, we should have DRA assignments here as well.

wojtek-t · 2025-12-11T12:58:05Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+    // DRA's AllocationResult from DRAAllocations.
+    // All pods within the PodSet, when being evaluated against this Placement,
+    // are restricted to the nodes matching this NodeAffinity.
+    NodeAffinity *corev1.NodeAffinity


Implementation detail - given NodeAffinity, finding the nodes that match it is O(N) operation with N being the set of all nodes in the cluster. We together with NodeAffinity here, we should probably also store the exact list of nodes to avoid recomputing it over and over again.

wojtek-t · 2025-12-11T13:01:39Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+   with fallbacks (e.g., prefer Rack, fallback to Block). This would introduce
+   a Rank field to the Placement struct.
+
+2. **Optional/Preferred Scheduling Constraints:** Constraints that serve purely


I would make it out-of-scope for this KEP even for further releases - we can do that in a follow-up KEP if needed.

wojtek-t · 2025-12-11T13:03:08Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+2. **Optional/Preferred Scheduling Constraints:** Constraints that serve purely
+   as scoring mechanisms without hard requirements.
+
+3. **Multi-level Scheduling Constraints:** Handling nested constraints (e.g.,


Same here - let's land TAS in a reasonably small scope first and iterate on that in a followups.

I guess rephrasing these two comment - I think the extensions generally make sense to me, but I would claim that we should make them explicitly non-goals for this KEP and mark them more as future follow-up extensions.

That makes sense I will update the description that those are potential extensions for this feature which can be implemented after alpha but they will require separate KEPs.

wojtek-t · 2025-12-11T13:05:14Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+   Block -> Rack). This would involve iterative placement generation and a
+   Parent field in the Placement struct.
+
+4. **Pod Group Replicas Support:** Optimizing scheduling for identical


The way it's described it's not really pod group replicas support - we need to support pod group replicas from the very beginning.
What you're saying here is that we can optimize the scheduling latency for them, right?

Yes, this is correct. Replicas will be supported but they will be scheduled one by one. With some changes to the algorithm we can try to optimize this process but this is out of scope for this KEP. I will rephrase it to make it clear.

macsko · 2025-12-11T10:41:31Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+
+- **Selection:** Select the Placement with the highest score.
+
+- **Binding:** Proceed to bind pods to the assigned nodes and resources.


We might need to move the pods once again through their pod-by-pod cycles. They have to call Reserve, Permit etc. to move to the binding successfully. Will the expected placement be expressed using NominatedNodeNames or differently?

macsko · 2025-12-11T13:53:27Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+    // -- Add other fields below for future extensions --
+}
+
+// PodSetInfo holds information about a specific PodSet within a PodGroup,


Can you define in the KEP what is a PodSet and any information about homogeneity?

nojnhuh · 2025-12-11T20:29:04Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+    // ResourceClaimName specifies the name of a specific ResourceClaim
+    // within the PodGroup's pods that this constraint applies to.
+    ResourceClaimName *string
+
+    // ResourceClaimTemplateName specifies the name of a ResourceClaimTemplate.
+    // This applies to all ResourceClaim instances generated from this template.
+    ResourceClaimTemplateName *string


How do these fields relate to the ResourceClaim references that Pods already have? What happens if the sets of claims referenced by a Workload and its Pods are different?

+1 to this question, it needs to be answered here

wojtek-t · 2025-12-12T09:16:10Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+type PodGroupSchedulingConstraints struct {
+    // TopologyConstraints specifies desired topological placements for all pods
+    // within this PodGroup.
+    TopologyConstraints []TopologyConstraint


Does multiple topology constraints actually make sense here? What would be the usecase for it?

wojtek-t · 2025-12-12T09:18:31Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+type DRAConstraint struct {
+    // ResourceClaimName specifies the name of a specific ResourceClaim
+    // within the PodGroup's pods that this constraint applies to.
+    ResourceClaimName *string


I wasn't asking about the implementation aspect.
I wanted to take a step back and understand what is the actual usecase we're trying to address and figure out if/how we should represent it to make it intuitive to users when they have replicated PodGroup. I feel that the API as currently described can be pretty confusing in this case.

wojtek-t · 2025-12-12T09:22:12Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+
+    // ResourceClaimTemplateName specifies the name of a ResourceClaimTemplate.
+    // This applies to all ResourceClaim instances generated from this template.
+    ResourceClaimTemplateName *string


OK - so this matches my thinking.

But the primary question now is - why do we need it then?
If we have some external entity (whether it's dedicated controller or e.g. LWS controller) that will create RC whenever it is needed (it should create it before we will actually do the scheduling), then what scheduler really needs to be aware and is an input for it is that RC (that it will be finding a best allocation for) not the template itself. It doesn't care about the template.

So I think we're aligned on the intention, but I don't really understand how that will be used.

wojtek-t · 2025-12-12T09:23:14Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+    // ResourceClaimName specifies the name of a specific ResourceClaim
+    // within the PodGroup's pods that this constraint applies to.
+    ResourceClaimName *string
+
+    // ResourceClaimTemplateName specifies the name of a ResourceClaimTemplate.
+    // This applies to all ResourceClaim instances generated from this template.
+    ResourceClaimTemplateName *string


+1 to this question, it needs to be answered here

wojtek-t · 2025-12-12T09:25:25Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+// PodGroupInfo holds information about a specific PodGroup within a Workload,
+// including a reference to the Workload, the PodGroup's name, and its replica index.
+// This struct is designed to be extensible with more fields in the future.
+type PodGroupInfo struct {


It's closer regarding what kind of information it keeps, but the granularity is different (it may contain pods of different signatures).

So my point is - we need to align that. Having two different things with same names will be pretty misleading.

wojtek-t · 2025-12-12T09:27:34Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+// PodSetInfo holds information about a specific PodSet within a PodGroup,
+// primarily the list of Pods.
+// This struct is designed to be extensible with more fields in the future.
+type PodSetInfo struct {


Should PodSetInfo then have a link to PodGroupInfo?

I think we will need a way to iterate over podsets within podgroup and this seems like a good place to allow for it.

wojtek-t · 2025-12-12T11:19:57Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+
+    // DRAConstraints specifies constraints on how Dynamic Resources are allocated
+    // across the PodGroup.
+    DRAConstraints []DRAConstraint


Continuing my thoughts from other comments here.

The primary goal that we wanted to ensure with this KEP are:

building the foundations for TAS and having the first version of the algorithm

proving that the algorithm is compatible with both DRA and topology-based requirements
I think this KEP is achieving it.

However, the more I think about it, the more concerns I have about this kind of API. Up until now I thought that we can actually decouple and postpone the discussion of lifecycle of pod-group-owned (or workload-owned) RCs to later, but some of my comments below already suggest it's not that clear and may influence the API.

So I actually started thinking if (for the sake of faster and incremental progress), we shouldn't slightly revise the scope and goals of this KEP, in particular:

remove the "DRAConstraints" from the scope (and couple it the lifecycle of PodGroup/RC discussion we'll have in DRA: ResourceClaim Support for Workloads #5729 - @nojnhuh )

ensure that the proposal is compatible with DRA-based constraints at lower level;
namely, scheduler should not really manage the lifecycle of RC and those RC should just be an input to scheduler (whether on PodGroup level, Workload-level or some to-be-introduced level).
So what if instead we would prove that it works by simply:

ensuring that some internal interface in scheduler (or maybe a scheduler-framework level one?) can actually accept RCs as an additional constraint to the WorkloadCycle

we add a test at that level, that scheduling works if we pass topology constraints as RCs

That would allow us to decouple the core of the changes in that KEP from all the discussions about how to represent it in the API, how is it coupled with lifecycle etc. And hopefully unblock this KEP much faster and still proving the core of what we need.

@johnbelamaric @erictune @44past4 @dom4ha @sanposhiho @macsko - for your thoughts too

I think that makes sense. Decoupling can help execution. We would treat the lifecycle and allocation of RCs in #5729. Allocation implies the constraint. #5194 should also merge with #5729, I think. It was conceived prior to the existence of the Workload API and I think #5729 encompasses a more holistic set of functionality.

Agree with decoupling.

It is possible to implement #5729 without #5732.
Even if we only implement one of the two for 1.36, we still learn something.

macsko · 2025-12-15T10:46:04Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+- **Action:** Iterate over distinct values of the topology label (TAS) or
+  available ResourceSlices (DRA).
+
+- **Output:** A list of Placement objects.


Will this phase generate all possible placements or only a subset of those? To reduce scheduling latency, in pod-by-pod scheduling we limit the number of feasible nodes to process (percentageOfNodesToScore setting). It may be important if the placements could overlap.

This phase is assumed to generate all possible placements. However what we could do is to stop checking them in phase 2 if enough feasible placements has been found. The minimum number of feasible placements could be configurable and it could limit the scheduling latency in case of large number of placements. However I would treat this as a future optimization which could be done after some scalability testing.

sanposhiho · 2025-12-15T13:41:06Z

/assign

I'm a small bandwidth-ed these days, but will take a look at this one for sure..

erictune

Looks great overall!

erictune · 2025-12-15T21:55:09Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+- **State:** Temporarily assigns AllocationResults to ResourceClaims during
+  the Assume phase.
+
+**PlacementBinPackingPlugin (New)** Implements `PlacementScorer`. Scores


I think this plugin can prevent the current PodGroup from fragmenting larger Levels, but it cannot prevent the current PodGroup from fragmenting smaller levels. If the current podgroup uses fewer than all the nodes in this Placement, then there could be multiple podsAssignment options, and different options may have different fragmentation effects. Since pod-at-a-time scheduling within the Placement is greedy, we won't consider multiple podsAssignment options.

Its not clear to me that you can influence this enough using the per-pod Score plugins.

erictune · 2025-12-15T22:01:47Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+    Name() string
+
+    // GeneratePlacements generates a list of potential Placements for the given PodGroup and PodSet.
+    // Each Placement represents a candidate set of resources (e.g., nodes matching a selector)


Consider saying that the GeneratePlacements interface does not have any compatibility guarantees across versions. If/when we later add Prioritized Placement Scheduling, or Multi-level Scheduling Constraints, we will want to change GeneratePlacements.

erictune · 2025-12-15T22:02:34Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+// Placement represents a candidate domain for scheduling a PodSet.
+// It defines a set of nodes and/or proposed Dynamic Resource Allocation (DRA)
+// resource bindings necessary to satisfy the PodSet's requirements within that domain.
+type Placement struct {


What is the valid lifetime of a Placement object? Is it only one cycle of Workload Cycle?
If so, then we don't need to worry about updating Placements when we update our view of the cluster.
Maybe state this.

erictune · 2025-12-15T22:24:21Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+
+5. **Explicit Topology Definition:** Using a Custom Resource (NodeTopology) to
+   define and alias topology levels, removing the need for users to know exact
+   node label keys.


Explicit Topology Information also provided these things:

An explicit total order on levels within one Topology object (needed for Multi-level and Prioritized Placement Scheduling)

An implicit label hierarchy requirement

A level n label's nodes must be a subset of only one level n+1 label.

Useful for Multi-level placement, and for the hierarchical aggregated capacity optimization.

A way to limit the number of levels

Limit by validating the list length in a Topology object.

Limiting levels limits one term of algorithm complexity.

A way to discourage creation of too many Topology objects

Only admins or cloud providers should create these usually.

Taken together, these properties make it easier to avoid the case where there are many more TAS-relevant labels (key/value pairs) than there are nodes.

Also, while the initial algorithm is going to be greedy, in the sense that it examines one workload at a time, future algorithms may want to examine multiple workloads at once to find jointly optimal placements. By allowing excess complexity in the structure of topology labels at the outset, we will limit our ability to do future global optimizations.

I think it is fine to leave Explicit Topology Definition out of Alpha. However, before GA, we should either have beta Explicit Topology Definition, or have documented requirement for (1) the maximum number of label keys used for TAS, (2) partial order requirement over all TAS keys, and (3) nesting requirement for TAS labels.

Otherwise, it will be hard to enforce those later.

erictune · 2025-12-15T23:10:30Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+
+    // DRAConstraints specifies constraints on how Dynamic Resources are allocated
+    // across the PodGroup.
+    DRAConstraints []DRAConstraint


Agree with decoupling.

It is possible to implement #5729 without #5732.
Even if we only implement one of the two for 1.36, we still learn something.

Topology-aware workload scheduling KEP

d458e3f

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 11, 2025

github-project-automation bot added this to SIG Scheduling Dec 11, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 11, 2025

k8s-ci-robot requested review from dom4ha and macsko December 11, 2025 00:16

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 11, 2025

44past4 mentioned this pull request Dec 11, 2025

Topology-aware workload scheduling #5732

Open

4 tasks

Fixed Toc

493cc9a

k8s-ci-robot requested review from erictune, johnbelamaric and wojtek-t December 11, 2025 07:40

Added KEP reviewers and approvers

52fa7c9

wojtek-t reviewed Dec 11, 2025

View reviewed changes

macsko reviewed Dec 11, 2025

View reviewed changes

nojnhuh reviewed Dec 11, 2025

View reviewed changes

wojtek-t reviewed Dec 12, 2025

View reviewed changes

macsko reviewed Dec 15, 2025

View reviewed changes

k8s-ci-robot assigned sanposhiho Dec 15, 2025

erictune reviewed Dec 15, 2025

View reviewed changes

andreyvelich mentioned this pull request Dec 16, 2025

feat: KEP 2841 Flux Policy to support Flux Framework kubeflow/trainer#2909

Open

1 task


		- Selection: Select the Placement with the highest score.

		- Binding: Proceed to bind pods to the assigned nodes and resources.

KEP-5732: Topology-aware workload scheduling #5733

Are you sure you want to change the base?

KEP-5732: Topology-aware workload scheduling #5733

Conversation

44past4 commented Dec 11, 2025

Uh oh!

k8s-ci-robot commented Dec 11, 2025

Uh oh!

44past4 commented Dec 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnbelamaric Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

johnbelamaric Dec 15, 2025 •

edited

Loading

erictune Dec 15, 2025 •

edited

Loading