Introduction: Overcoming GPU control issues
In Part 1 of this blog series, we explored the challenges of hosting large language models (LLM) on CPU-based workloads on an EKS cluster. We discussed the inefficiencies associated with using CPUs for such tasks, primarily due to large model sizes and slower inference speeds. The introduction of GPU resources offered a significant increase in performance, but also that it is about the need to efficiently manage these high-cost resources.
In this second part, we’ll dive deeper into how to optimize GPU usage for these workloads. We will cover the following key areas:
- NVIDIA device plugin settings: This section will explain the importance of the NVIDIA device plugin for Kubernetes and detail its role in resource discovery, contribution, and isolation.
- Time slicing: We will discuss how over time slicing allows for efficient sharing of GPU resources and maximum utilization.
- Carpenter’s naked auto-match: This section will describe how to scale Karpenter node nodes based on real-time demand, optimizing resource utilization and reducing costs.
Challenges solved
- Efficient GPU management: Ensuring that GPUs are fully utilized to offset their high costs.
- Manipulation of competition: Allowing multiple workloads to efficiently share GPU resources.
- Dynamic scaling: Automatically adjust the number of nodes based on the workload requirement.
Section 1: Introduction to the NVIDIA Device Plugin
The NVIDIA Device Plugin for KuberNetes is a component that simplifies the management and use of NVIDIA GPUs in Kubernetes clusters. It allows Kubernetes to recognize and allocate GPU resources to PODS, enabling GPU-accelerated workloads.
Why we need NVIDIA device plugin
- Resource discovery: Automatically detects NVIDIA GPU resources on each node.
- Resource allowance: Distribute GPU resources to PODs based on their requests.
- Isolation: Ensures safe and efficient use of GPU resources between different pods.
NVIDIA Appliance Simplifies GPU Management in Kubernetes Clusters. It automates the installation Nvidia driver,, A container toolkitand MiraclesEnsure GPU resources are available for workloads without requiring manual setup.
- Nvidia driver: Required for NVIDIA-SMI and basic GPU operations. Interface with GPU hardware. The screenshot below shows the output of the NVIDIA-SMI command, which shows key information such as driver version, CUDA version, and detailed GPU configuration, which refers to the GPU being properly configured and ready to use
- Nvidia container toolkit: Required to use GPU with container. Below we see the version version of the Toolkit container and the status of the service running in the instance
#Installed Version rpm -qa | grep -i nvidia-container-toolkit nvidia-container-toolkit-base-1.15.0-1.x86_64 nvidia-container-toolkit-1.15.0-1.x86_64
- Miracles: Required for GPU-accelerated applications and libraries. Below is the output of the NVCC command, showing the version of CUDA installed on the system:
/usr/local/cuda/bin/nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:02:13_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0
NVIDIA device plugin settings
To ensure that DaemonSet runs exclusively on GPU-based instances, we mark the node with the key “nvidia.com/gpu” and the value “true”. This is achieved by using node affinity, Node selector and Taints and tolerances.
Let’s now look into each of these components.
- Node Affinity: Node affinity allows PODs to be scheduled on nodes based on node labels Required only: The scheduler cannot schedule a POD unless a rule is specified and the key is “nvidia.com/gpu” and the operator is “in” and the values are “true”.
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: feature.node.kubernetes.io/pci-10de.present operator: In values: - "true" - matchExpressions: - key: feature.node.kubernetes.io/cpu-model.vendor_id operator: In values: - NVIDIA - matchExpressions: - key: nvidia.com/gpu operator: In values: - "true"
- The number of nodes: NThe ODE option is a simple recommendation form to restrict the selection of nodes nvidia.com/gpu: “True”
- Taints and Tolerances: Tolerations are added to the daemon set to ensure that it can be scheduled on toned GPU nodes (nvidia.com/gpu=true:noschedule).
kubectl taint node ip-10-20-23-199.us-west-1.compute.internal nvidia.com/gpu=true:Noschedule kubectl describe node ip-10-20-23-199.us-west-1.compute.internal | grep -i taint Taints: nvidia.com/gpu=true:NoSchedule tolerations: - effect: NoSchedule key: nvidia.com/gpu operator: Exists
After implementing node tagging, affinity, node selector, and TAINS/tolerances, we can make the daren suite run exclusively on GPU-based instances. We can verify the deployment of the NVIDIA device plugin using the following command:
kubectl get ds -n kube-system NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE nvidia-device-plugin 1 1 1 1 1 nvidia.com/gpu=true 75d nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu=true,nvidia.com/mps.capable=true 75d
However, the challenge here is that GPUs are so existing and we need to make sure that the maximum use of GPUs and explore more about GPUs competitive.
GPU cause:
It refers to the ability to execute multiple tasks or threads simultaneously on the GPU
- Single Process: In a single process setup, Onely One Application or container uses the GPU at a time. These approaches are simple, but can lead to underutilization of GPU resources if the application fully loads the GPU.
- Multiprocess Service (MPS): NVIDIA’s Multiprocess Service (MPS) allows multiple CUDA applications to compete with a single GPU, improving GPU utilization and reducing context switching overhead.
- Time slicing: Time slicing involves GPU time diving between different process in other words that multiple processes take turns on GPU (ROBIN context switching)
- Multi Instance GPU (MIG): MIG is a feature available on the NVIDIA A100 GPU that allows a single GPU to be split into multiple smaller isolated instances, each of which behaves as a separate gpurate.
- Virtualization: GPU virtualization allows a single physical GPU to be shared between multiple virtual machines (VMS) or containers and provides each virtual GPU.
Section 2: Time Implementation on GPUS
NVIDIA’s GPU and Kubernetes related time sink refers to the sharing of a physical GPU between multiple containers or pods in a Kubernetes cluster. This technology involves dividing GPU processing time into smaller intervals and allocating those intervals to different containers or pods.
- Time slice allocation: The GPU scheduler allocates time slices to each VGPU configured on the physical GPU.
- Preemption and context switching: At the end of a VGPU time slice, the GPU scheduler submits its execution, saves the context, and moves to the next VGPU context.
- Context switching: The GPU scheduler ensures smooth context switching between VGPUs, minimizing overhead and ensuring efficient use of GPU resources.
- Task Completion: Containerized processes complete their tasks from the GPU accelerator within their allocated time slices.
- Resource management and monitoring
- Releasing resources: When tasks are completed, GPU resources are released back to Kubernetes for redistribution to other pods or containers
Why do we need slicing time?
- Cargo efficiency: Ensures GPUs with high load are not underutilized.
- Competitor: Allows multiple applications to use the GPU simultaneously.
Example for configuring time slicing
Use the time slicing configuration using the configuration map as shown below. Here Replicas: 3 Specifies the number of replicas for GPU resources, meaning that a GPU resource can be sliced into 3 share instances
apiVersion: v1 kind: ConfigMap metadata: name: nvidia-device-plugin namespace: kube-system data: any: |- version: v1 flags: migStrategy: none sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 3 #We can verify the GPU resources available on your nodes using the following command: kubectl get nodes -o json | jq -r '.items() | select(.status.capacity."nvidia.com/gpu" != null) | {name: .metadata.name, capacity: .status.capacity}' { "name": "ip-10-20-23-199.us-west-1.compute.internal", "capacity": { "cpu": "4", "ephemeral-storage": "104845292Ki", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "16069060Ki", "nvidia.com/gpu": "3", "pods": "110" } } #The above output shows that the node ip-10-20-23-199.us-west-1. compute.internal has 3 virtual GPUs available. #We can request GPU resources in their pod specifications by setting resource limits resources: limits: cpu: "1" memory: 2G nvidia.com/gpu: "1" requests: cpu: "1" memory: 2G nvidia.com/gpu: "1"
In our case we may be able to host 3 PODs in one IP-10-20-23-199.US-WEST-1 node. Calculate. Internal and be time to slice these pods can use 3 virtual GPUs as shown below
The GPUs were effectively shared between the pods and we can see the PIDs assigned for each of the processes below.
Now that we’ve optimized the GPU at the POD level, let’s focus on optimizing the GPU resources at the node level. We can achieve this using a solution called Cluster Autoscaling Solution Carpenter. This is especially important because educational labs may not always have constant load or user activity, and GPUs are extremely expensive. Lever Carpenterwe can dynamically scale GPU nodes up or down based on demand, ensuring cost efficiency and optimal resource utilization.
Section 3: Node Autoscaling with Carpenter
Carpenter is open source node lifecycle management for Kubernetes. Automates node provisioning and demovising based on pod scheduling needs, enabling efficient scaling and cost optimization
- Dynamic Node Provisioning: Automatically provision nodes based on demand.
- Optimizes resource utilization: matches node capacity with workload needs.
- Operating costs reduce: Minimize unnecessary resource.
- Improves efficient cluster: overall performance and responsiveness increases.
Why use Karpenter for dynamic scaling
- Dynamic scaling: Automatically timed number of nodes per Worklod request.
- Cost optimization: Ensures resources are secured only when needed, reducing overhead.
- Effective resource management: PODS tracks cannot be scheduled due to lack of resources, review their requirements, node nodes to accommodate, schedule pods and decorate nodes when reducing.
Installation of Carpenter:
#Install Karpenter using HELM: helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" --namespace "${KARPENTER_NAMESPACE}" --create-namespace --set "settings.clusterName=${CLUSTER_NAME}" --set "settings.interruptionQueue=${CLUSTER_NAME}" --set controller.resources.requests.cpu=1 --set controller.resources.requests.memory=1Gi --set controller.resources.limits.cpu=1 --set controller.resources.limits.memory=1Gi #Verify Karpenter Installation: kubectl get pod -n kube-system | grep -i karpenter karpenter-7df6c54cc-rsv8s 1/1 Running 2 (10d ago) 53d karpenter-7df6c54cc-zrl9n 1/1 Running 0 53d
Configuring Karpenter with nodepools and nodeClasses:
Karpenter can be configured with Nodepools and Nodeclasses Automate node provisioning and scaling based on the specific needs of your workload
- Carpenter nodepool: A nodepool is a custom resource that defines a set of nodes with shared specialties and constraints in a Kubernetes cluster. Karpenter Udepools to dynamically manage and scale node resources based on workload launch requirements
apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: g4-nodepool spec: template: metadata: labels: nvidia.com/gpu: "true" spec: taints: - effect: NoSchedule key: nvidia.com/gpu value: "true" requirements: - key: kubernetes.io/arch operator: In values: ("amd64") - key: kubernetes.io/os operator: In values: ("linux") - key: karpenter.sh/capacity-type operator: In values: ("on-demand") - key: node.kubernetes.io/instance-type operator: In values: ("g4dn.xlarge" ) nodeClassRef: apiVersion: karpenter.k8s.aws/v1beta1 kind: EC2NodeClass name: g4-nodeclass limits: cpu: 1000 disruption: expireAfter: 120m consolidationPolicy: WhenUnderutilized
- Nodeclasses There are configurations that define the characteristics and parameters for the nodes that Karpenter can provide in a Kubernetes cluster. NodeClass specifies basic infrastructure details for nodes, such as instance types, launch template configurations, and specific cloud provider settings.
Note: The userData section contains scripts that bootstrap an EC2 instance, included pulling a TensorFlow GPU Docker image and configuring the instance to connect to a Kubernetes cluster.
apiVersion: karpenter.k8s.aws/v1beta1 kind: EC2NodeClass metadata: name: g4-nodeclass spec: amiFamily: AL2 launchTemplate: name: "ack_nodegroup_template_new" version: "7" role: "KarpenterNodeRole" subnetSelectorTerms: - tags: karpenter.sh/discovery: "nextgen-learninglab" securityGroupSelectorTerms: - tags: karpenter.sh/discovery: "nextgen-learninglab" blockDeviceMappings: - deviceName: /dev/xvda ebs: volumeSize: 100Gi volumeType: gp3 iops: 10000 encrypted: true deleteOnTermination: true throughput: 125 tags: Name: Learninglab-Staging-Auto-GPU-Node userData: | MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="//" --// Content-Type: text/x-shellscript; charset="us-ascii" set -ex sudo ctr -n=k8s.io image pull docker.io/tensorflow/tensorflow:2.12.0-gpu --// Content-Type: text/x-shellscript; charset="us-ascii" B64_CLUSTER_CA=" " API_SERVER_URL="" /etc/eks/bootstrap.sh nextgen-learninglab-eks --kubelet-extra-args '--node-labels=eks.amazonaws.com/capacityType=ON_DEMAND --pod-max-pids=32768 --max-pods=110' -- b64-cluster-ca $B64_CLUSTER_CA --apiserver-endpoint $API_SERVER_URL --use-max-pods false --// Content-Type: text/x-shellscript; charset="us-ascii" KUBELET_CONFIG=/etc/kubernetes/kubelet/kubelet-config.json echo "$(jq ".podPidsLimit=32768" $KUBELET_CONFIG)" > $KUBELET_CONFIG --// Content-Type: text/x-shellscript; charset="us-ascii" systemctl stop kubelet systemctl daemon-reload systemctl start kubelet --//--
In this scenario, each node (e.g. IP-10-20-23-199.US-WEST-1.compute.intel) can hold up to three pods. If the deployment is scaled to add more PODs, resources will be insufficient, causing the new pod to remain in a pending state.
Carpenter monitors these scheduled modules and evaluates their resource requirements to act accordingly. There will be a NodeClaim that claims a node from the nodepool and Karpenter and thus provides a node on demand.
Conclusion: Effective management of GPU resources in Kubernetes
With the increasing demand for GPU workloads in Kubernetes, GPU resource management is essential. Combination NVIDIA Device Plugin,, Time slicingand Carpenter It provides a powerful approach to managing, optimizing, and scaling GPU resources in a Kubernetes cluster, delivering high performance with efficient resource utilization. This solution was implemented to host GPU-enabled pilot learning labs Developer.cisco.com/learningProviding a GPU-powered learning experience.
Share: