Tasks

Tasks
Administer a Cluster
Access Clusters Using the Kubernetes API
Access Services Running on Clusters
Advertise Extended Resources for a Node
Autoscale the DNS Service in a Cluster
Change the default StorageClass
Change the Reclaim Policy of a PersistentVolume
Cluster Management
Configure Multiple Schedulers
Configure Out of Resource Handling
Configure Quotas for API Objects
Control CPU Management Policies on the Node
Control Topology Management Policies on a node
Customizing DNS Service
Debugging DNS Resolution
Declare Network Policy
Developing Cloud Controller Manager
Enabling EndpointSlices
Enabling Service Topology
Encrypting Secret Data at Rest
Guaranteed Scheduling For Critical Add-On Pods
IP Masquerade Agent User Guide
Kubernetes Cloud Controller Manager
Limit Storage Consumption
Namespaces Walkthrough
Operating etcd clusters for Kubernetes
Reconfigure a Node's Kubelet in a Live Cluster
Reserve Compute Resources for System Daemons
Safely Drain a Node while Respecting the PodDisruptionBudget
Securing a Cluster
Set Kubelet parameters via a config file
Set up High-Availability Kubernetes Masters
Share a Cluster with Namespaces
Using a KMS provider for data encryption
Using CoreDNS for Service Discovery
Using NodeLocal DNSCache in Kubernetes clusters
Using sysctls in a Kubernetes Cluster
Extend kubectl with plugins
Manage HugePages
Schedule GPUs

Edit This Page

Control Topology Management Policies on a node

FEATURE STATE: Kubernetes v1.18 beta

An increasing number of systems leverage a combination of CPUs and hardware accelerators to support latency-critical execution and high-throughput parallel computation. These include workloads in fields such as telecommunications, scientific computing, machine learning, financial services and data analytics. Such hybrid systems comprise a high performance environment.

In order to extract the best performance, optimizations related to CPU isolation, memory and device locality are required. However, in Kubernetes, these optimizations are handled by a disjoint set of components.

Topology Manager is a Kubelet component that aims to co-ordinate the set of components that are responsible for these optimizations.

Before you begin

You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. If you do not already have a cluster, you can create one by using Minikube, or you can use one of these Kubernetes playgrounds:

Your Kubernetes server must be version v1.18. To check the version, enter kubectl version.

How Topology Manager Works

Prior to the introduction of Topology Manager, the CPU and Device Manager in Kubernetes make resource allocation decisions independently of each other. This can result in undesirable allocations on multiple-socketed systems, performance/latency sensitive applications will suffer due to these undesirable allocations. Undesirable in this case meaning for example, CPUs and devices being allocated from different NUMA Nodes thus, incurring additional latency.

The Topology Manager is a Kubelet component, which acts as a source of truth so that other Kubelet components can make topology aligned resource allocation choices.

The Topology Manager provides an interface for components, called Hint Providers, to send and receive topology information. Topology Manager has a set of node level policies which are explained below.

The Topology manager receives Topology information from the Hint Providers as a bitmask denoting NUMA Nodes available and a preferred allocation indication. The Topology Manager policies perform a set of operations on the hints provided and converge on the hint determined by the policy to give the optimal result, if an undesirable hint is stored the preferred field for the hint will be set to false. In the current policies preferred is the narrowest preferred mask. The selected hint is stored as part of the Topology Manager. Depending on the policy configured the pod can be accepted or rejected from the node based on the selected hint. The hint is then stored in the Topology Manager for use by the Hint Providers when making the resource allocation decisions.

Enable the Topology Manager feature

Support for the Topology Manager requires TopologyManager feature gate to be enabled. It is enabled by default starting with Kubernetes 1.18.

Topology Manager Policies

The Topology Manager currently:

  • Works on Nodes with the static CPU Manager Policy enabled. See control CPU Management Policies
  • Works on Pods making CPU requests or Device requests via extended resources

If these conditions are met, Topology Manager will align the requested resources.

Topology Manager supports four allocation policies. You can set a policy via a Kubelet flag, --topology-manager-policy. There are four supported policies:

  • none (default)
  • best-effort
  • restricted
  • single-numa-node

none policy

This is the default policy and does not perform any topology alignment.

best-effort policy

For each container in a Guaranteed Pod, kubelet, with best-effort topology management policy, calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager will store this and admit the pod to the node anyway.

The Hint Providers can then use this information when making the resource allocation decision.

restricted policy

For each container in a Guaranteed Pod, kubelet, with restricted topology management policy, calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager will reject this pod from the node. This will result in a pod in a Terminated state with a pod admission failure.

Once the pod is in a Terminated state, the Kubernetes scheduler will not attempt to reschedule the pod. It is recommended to use a ReplicaSet or Deployment to trigger a redeploy of the pod. An external control loop could be also implemented to trigger a redeployment of pods that have the Topology Affinity error.

If the pod is admitted, the Hint Providers can then use this information when making the resource allocation decision.

single-numa-node policy

For each container in a Guaranteed Pod, kubelet, with single-numa-node topology management policy, calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. If it is, Topology Manager will store this and the Hint Providers can then use this information when making the resource allocation decision. If, however, this is not possible then the Topology Manager will reject the pod from the node. This will result in a pod in a Terminated state with a pod admission failure.

Once the pod is in a Terminated state, the Kubernetes scheduler will not attempt to reschedule the pod. It is recommended to use a Deployment with replicas to trigger a redeploy of the Pod. An external control loop could be also implemented to trigger a redeployment of pods that have the Topology Affinity error.

Pod Interactions with Topology Manager Policies

Consider the containers in the following pod specs:

spec:
  containers:
  - name: nginx
    image: nginx

This pod runs in the BestEffort QoS class because no resource requests or limits are specified.

spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        memory: "200Mi"
      requests:
        memory: "100Mi"

This pod runs in the Burstable QoS class because requests are less than limits.

If the selected policy is anything other than none , Topology Manager would not consider either of these Pod specifications.

spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        memory: "200Mi"
        cpu: "2"
        example.com/device: "1"
      requests:
        memory: "200Mi"
        cpu: "2"
        example.com/device: "1"

This pod runs in the Guaranteed QoS class because requests are equal to limits.

spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        example.com/deviceA: "1"
        example.com/deviceB: "1"
      requests:
        example.com/deviceA: "1"
        example.com/deviceB: "1"

This pod runs in the BestEffort QoS class because there are no CPU and memory requests.

The Topology Manager would consider both of the above pods. The Topology Manager would consult the Hint Providers, which are CPU and Device Manager to get topology hints for the pods. In the case of the Guaranteed pod the static CPU Manager policy would return hints relating to the CPU request and the Device Manager would send back hints for the requested device.

In the case of the BestEffort pod the CPU Manager would send back the default hint as there is no CPU request and the Device Manager would send back the hints for each of the requested devices.

Using this information the Topology Manager calculates the optimal hint for the pod and stores this information, which will be used by the Hint Providers when they are making their resource assignments.

Known Limitations

  1. The maximum number of NUMA nodes that Topology Manager allows is 8. With more than 8 NUMA nodes there will be a state explosion when trying to enumerate the possible NUMA affinities and generating their hints.

  2. The scheduler is not topology-aware, so it is possible to be scheduled on a node and then fail on the node due to the Topology Manager.

  3. The Device Manager and the CPU Manager are the only components to adopt the Topology Manager’s HintProvider interface. This means that NUMA alignment can only be achieved for resources managed by the CPU Manager and the Device Manager. Memory or Hugepages are not considered by the Topology Manager for NUMA alignment.

Feedback