Optimizing AI Workloads with NVIDIA GPUs, Slicing Time and Karpenter (Part 2)

Introduction: Overcoming GPU control issues

In Part 1 of this blog series, we explored the challenges of hosting large language models (LLM) on CPU-based workloads on an EKS cluster. We discussed the inefficiencies associated with using CPUs for such tasks, primarily due to large model sizes and slower inference speeds. The introduction of GPU resources offered a significant increase in performance, but also that it is about the need to efficiently manage these high-cost resources.

In this second part, we’ll dive deeper into how to optimize GPU usage for these workloads. We will cover the following key areas:

NVIDIA device plugin settings: This section will explain the importance of the NVIDIA device plugin for Kubernetes and detail its role in resource discovery, contribution, and isolation.
Time slicing: We will discuss how over time slicing allows for efficient sharing of GPU resources and maximum utilization.
Carpenter’s naked auto-match: This section will describe how to scale Karpenter node nodes based on real-time demand, optimizing resource utilization and reducing costs.

Challenges solved

Efficient GPU management: Ensuring that GPUs are fully utilized to offset their high costs.
Manipulation of competition: Allowing multiple workloads to efficiently share GPU resources.
Dynamic scaling: Automatically adjust the number of nodes based on the workload requirement.

Section 1: Introduction to the NVIDIA Device Plugin

The NVIDIA Device Plugin for KuberNetes is a component that simplifies the management and use of NVIDIA GPUs in Kubernetes clusters. It allows Kubernetes to recognize and allocate GPU resources to PODS, enabling GPU-accelerated workloads.

Why we need NVIDIA device plugin

Resource discovery: Automatically detects NVIDIA GPU resources on each node.
Resource allowance: Distribute GPU resources to PODs based on their requests.
Isolation: Ensures safe and efficient use of GPU resources between different pods.

NVIDIA Appliance Simplifies GPU Management in Kubernetes Clusters. It automates the installation Nvidia driver,, A container toolkitand MiraclesEnsure GPU resources are available for workloads without requiring manual setup.

Nvidia driver: Required for NVIDIA-SMI and basic GPU operations. Interface with GPU hardware. The screenshot below shows the output of the NVIDIA-SMI command, which shows key information such as driver version, CUDA version, and detailed GPU configuration, which refers to the GPU being properly configured and ready to use

Nvidia container toolkit: Required to use GPU with container. Below we see the version version of the Toolkit container and the status of the service running in the instance

#Installed Version 
rpm -qa | grep -i nvidia-container-toolkit 
nvidia-container-toolkit-base-1.15.0-1.x86_64 
nvidia-container-toolkit-1.15.0-1.x86_64

Miracles: Required for GPU-accelerated applications and libraries. Below is the output of the NVCC command, showing the version of CUDA installed on the system:

/usr/local/cuda/bin/nvcc --version 
nvcc: NVIDIA (R) Cuda compiler driver 
Copyright (c) 2005-2023 NVIDIA Corporation 
Built on Tue_Aug_15_22:02:13_PDT_2023 
Cuda compilation tools, release 12.2, V12.2.140 
Build cuda_12.2.r12.2/compiler.33191640_0

NVIDIA device plugin settings

To ensure that DaemonSet runs exclusively on GPU-based instances, we mark the node with the key “nvidia.com/gpu” and the value “true”. This is achieved by using node affinity, Node selector and Taints and tolerances.

Let’s now look into each of these components.

Node Affinity: Node affinity allows PODs to be scheduled on nodes based on node labels Required only: The scheduler cannot schedule a POD unless a rule is specified and the key is “nvidia.com/gpu” and the operator is “in” and the values are “true”.

affinity: 
    nodeAffinity: 
        requiredDuringSchedulingIgnoredDuringExecution: 
            nodeSelectorTerms: 
                - matchExpressions: 
                    - key: feature.node.kubernetes.io/pci-10de.present 
                      operator: In 
                      values: 
                        - "true" 
                - matchExpressions: 
                    - key: feature.node.kubernetes.io/cpu-model.vendor_id 
                      operator: In 
                      values: 
                      - NVIDIA 
                - matchExpressions: 
                    - key: nvidia.com/gpu 
                      operator: In 
                      values: 
                    - "true"

The number of nodes: NThe ODE option is a simple recommendation form to restrict the selection of nodes nvidia.com/gpu: “True”
Taints and Tolerances: Tolerations are added to the daemon set to ensure that it can be scheduled on toned GPU nodes (nvidia.com/gpu=true:noschedule).

kubectl taint node ip-10-20-23-199.us-west-1.compute.internal nvidia.com/gpu=true:Noschedule 
kubectl describe node ip-10-20-23-199.us-west-1.compute.internal | grep -i taint 
Taints: nvidia.com/gpu=true:NoSchedule 

tolerations: 
  - effect: NoSchedule 
    key: nvidia.com/gpu 
    operator: Exists

After implementing node tagging, affinity, node selector, and TAINS/tolerances, we can make the daren suite run exclusively on GPU-based instances. We can verify the deployment of the NVIDIA device plugin using the following command:

kubectl get ds -n kube-system 
NAME                                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE  NODE SELECTOR                                     AGE 

nvidia-device-plugin                      1         1         1       1            1          nvidia.com/gpu=true                               75d 
nvidia-device-plugin-mps-control-daemon   0         0         0       0            0          nvidia.com/gpu=true,nvidia.com/mps.capable=true   75d

However, the challenge here is that GPUs are so existing and we need to make sure that the maximum use of GPUs and explore more about GPUs competitive.

GPU cause:

It refers to the ability to execute multiple tasks or threads simultaneously on the GPU

Single Process: In a single process setup, Onely One Application or container uses the GPU at a time. These approaches are simple, but can lead to underutilization of GPU resources if the application fully loads the GPU.
Multiprocess Service (MPS): NVIDIA’s Multiprocess Service (MPS) allows multiple CUDA applications to compete with a single GPU, improving GPU utilization and reducing context switching overhead.
Time slicing: Time slicing involves GPU time diving between different process in other words that multiple processes take turns on GPU (ROBIN context switching)
Multi Instance GPU (MIG): MIG is a feature available on the NVIDIA A100 GPU that allows a single GPU to be split into multiple smaller isolated instances, each of which behaves as a separate gpurate.
Virtualization: GPU virtualization allows a single physical GPU to be shared between multiple virtual machines (VMS) or containers and provides each virtual GPU.

Section 2: Time Implementation on GPUS

NVIDIA’s GPU and Kubernetes related time sink refers to the sharing of a physical GPU between multiple containers or pods in a Kubernetes cluster. This technology involves dividing GPU processing time into smaller intervals and allocating those intervals to different containers or pods.

Time slice allocation: The GPU scheduler allocates time slices to each VGPU configured on the physical GPU.
Preemption and context switching: At the end of a VGPU time slice, the GPU scheduler submits its execution, saves the context, and moves to the next VGPU context.
Context switching: The GPU scheduler ensures smooth context switching between VGPUs, minimizing overhead and ensuring efficient use of GPU resources.
Task Completion: Containerized processes complete their tasks from the GPU accelerator within their allocated time slices.
Resource management and monitoring
Releasing resources: When tasks are completed, GPU resources are released back to Kubernetes for redistribution to other pods or containers

Why do we need slicing time?

Cargo efficiency: Ensures GPUs with high load are not underutilized.
Competitor: Allows multiple applications to use the GPU simultaneously.

Example for configuring time slicing

Use the time slicing configuration using the configuration map as shown below. Here Replicas: 3 Specifies the number of replicas for GPU resources, meaning that a GPU resource can be sliced into 3 share instances

apiVersion: v1 
kind: ConfigMap 
metadata: 
  name: nvidia-device-plugin 
  namespace: kube-system 
data: 
  any: |- 
    version: v1 
    flags: 
      migStrategy: none 
    sharing: 
      timeSlicing: 
        resources: 
        - name: nvidia.com/gpu 
          replicas: 3 
#We can verify the GPU resources available on your nodes using the following command:     
kubectl get nodes -o json | jq -r '.items() | select(.status.capacity."nvidia.com/gpu" != null) 
| {name: .metadata.name, capacity: .status.capacity}' 
{ 
  "name": "ip-10-20-23-199.us-west-1.compute.internal", 
  "capacity": { 
    "cpu": "4", 
    "ephemeral-storage": "104845292Ki", 
    "hugepages-1Gi": "0", 
    "hugepages-2Mi": "0", 
    "memory": "16069060Ki", 
    "nvidia.com/gpu": "3", 
    "pods": "110" 
  } 
} 
#The above output shows that the node ip-10-20-23-199.us-west-1. compute.internal has 3 virtual GPUs available. 
#We can request GPU resources in their pod specifications by setting resource limits 
resources: 
      limits: 
        cpu: "1" 
        memory: 2G 
        nvidia.com/gpu: "1" 
      requests: 
        cpu: "1" 
        memory: 2G 
        nvidia.com/gpu: "1"

In our case we may be able to host 3 PODs in one IP-10-20-23-199.US-WEST-1 node. Calculate. Internal and be time to slice these pods can use 3 virtual GPUs as shown below

The GPUs were effectively shared between the pods and we can see the PIDs assigned for each of the processes below.

Now that we’ve optimized the GPU at the POD level, let’s focus on optimizing the GPU resources at the node level. We can achieve this using a solution called Cluster Autoscaling Solution Carpenter. This is especially important because educational labs may not always have constant load or user activity, and GPUs are extremely expensive. Lever Carpenterwe can dynamically scale GPU nodes up or down based on demand, ensuring cost efficiency and optimal resource utilization.

Section 3: Node Autoscaling with Carpenter

Carpenter is open source node lifecycle management for Kubernetes. Automates node provisioning and demovising based on pod scheduling needs, enabling efficient scaling and cost optimization

Dynamic Node Provisioning: Automatically provision nodes based on demand.
Optimizes resource utilization: matches node capacity with workload needs.
Operating costs reduce: Minimize unnecessary resource.
Improves efficient cluster: overall performance and responsiveness increases.

Why use Karpenter for dynamic scaling

Dynamic scaling: Automatically timed number of nodes per Worklod request.
Cost optimization: Ensures resources are secured only when needed, reducing overhead.
Effective resource management: PODS tracks cannot be scheduled due to lack of resources, review their requirements, node nodes to accommodate, schedule pods and decorate nodes when reducing.

Installation of Carpenter:

 #Install Karpenter using HELM:
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" 
--namespace "${KARPENTER_NAMESPACE}" --create-namespace   --set "settings.clusterName=${CLUSTER_NAME}"    
--set "settings.interruptionQueue=${CLUSTER_NAME}"    --set controller.resources.requests.cpu=1    
--set controller.resources.requests.memory=1Gi    --set controller.resources.limits.cpu=1    
--set controller.resources.limits.memory=1Gi 

#Verify Karpenter Installation: 
kubectl get pod -n kube-system | grep -i karpenter 
karpenter-7df6c54cc-rsv8s             1/1     Running   2 (10d ago)   53d 
karpenter-7df6c54cc-zrl9n             1/1     Running   0             53d

Configuring Karpenter with nodepools and nodeClasses:

Karpenter can be configured with Nodepools and Nodeclasses Automate node provisioning and scaling based on the specific needs of your workload

Carpenter nodepool: A nodepool is a custom resource that defines a set of nodes with shared specialties and constraints in a Kubernetes cluster. Karpenter Udepools to dynamically manage and scale node resources based on workload launch requirements

apiVersion: karpenter.sh/v1beta1 
kind: NodePool 
metadata: 
  name: g4-nodepool 
spec: 
  template: 
    metadata: 
      labels: 
        nvidia.com/gpu: "true" 
    spec: 
      taints: 
        - effect: NoSchedule 
          key: nvidia.com/gpu 
          value: "true" 
      requirements: 
        - key: kubernetes.io/arch 
          operator: In 
          values: ("amd64") 
        - key: kubernetes.io/os 
          operator: In 
          values: ("linux") 
        - key: karpenter.sh/capacity-type 
          operator: In 
          values: ("on-demand") 
        - key: node.kubernetes.io/instance-type 
          operator: In 
          values: ("g4dn.xlarge" ) 
      nodeClassRef: 
        apiVersion: karpenter.k8s.aws/v1beta1 
        kind: EC2NodeClass 
        name: g4-nodeclass 
  limits: 
    cpu: 1000 
  disruption: 
    expireAfter: 120m 
    consolidationPolicy: WhenUnderutilized

Nodeclasses There are configurations that define the characteristics and parameters for the nodes that Karpenter can provide in a Kubernetes cluster. NodeClass specifies basic infrastructure details for nodes, such as instance types, launch template configurations, and specific cloud provider settings.

Note: The userData section contains scripts that bootstrap an EC2 instance, included pulling a TensorFlow GPU Docker image and configuring the instance to connect to a Kubernetes cluster.

apiVersion: karpenter.k8s.aws/v1beta1 
kind: EC2NodeClass 
metadata: 
  name: g4-nodeclass 
spec: 
  amiFamily: AL2 
  launchTemplate: 
    name: "ack_nodegroup_template_new" 
    version: "7"  
  role: "KarpenterNodeRole" 
  subnetSelectorTerms: 
    - tags: 
        karpenter.sh/discovery: "nextgen-learninglab" 
  securityGroupSelectorTerms: 
    - tags: 
        karpenter.sh/discovery: "nextgen-learninglab"     
  blockDeviceMappings: 
    - deviceName: /dev/xvda 
      ebs: 
        volumeSize: 100Gi 
        volumeType: gp3 
        iops: 10000 
        encrypted: true 
        deleteOnTermination: true 
        throughput: 125 
  tags: 
    Name: Learninglab-Staging-Auto-GPU-Node 
  userData: | 
        MIME-Version: 1.0 
        Content-Type: multipart/mixed; boundary="//" 
        --// 
        Content-Type: text/x-shellscript; charset="us-ascii" 
        set -ex 
        sudo ctr -n=k8s.io image pull docker.io/tensorflow/tensorflow:2.12.0-gpu 
        --// 
        Content-Type: text/x-shellscript; charset="us-ascii" 
        B64_CLUSTER_CA=" " 
        API_SERVER_URL="" 
        /etc/eks/bootstrap.sh nextgen-learninglab-eks --kubelet-extra-args '--node-labels=eks.amazonaws.com/capacityType=ON_DEMAND 
--pod-max-pids=32768 --max-pods=110' -- b64-cluster-ca $B64_CLUSTER_CA --apiserver-endpoint $API_SERVER_URL --use-max-pods false 
         --// 
        Content-Type: text/x-shellscript; charset="us-ascii" 
        KUBELET_CONFIG=/etc/kubernetes/kubelet/kubelet-config.json 
        echo "$(jq ".podPidsLimit=32768" $KUBELET_CONFIG)" > $KUBELET_CONFIG 
        --// 
        Content-Type: text/x-shellscript; charset="us-ascii" 
        systemctl stop kubelet 
        systemctl daemon-reload 
        systemctl start kubelet
        --//--

In this scenario, each node (e.g. IP-10-20-23-199.US-WEST-1.compute.intel) can hold up to three pods. If the deployment is scaled to add more PODs, resources will be insufficient, causing the new pod to remain in a pending state.

Carpenter monitors these scheduled modules and evaluates their resource requirements to act accordingly. There will be a NodeClaim that claims a node from the nodepool and Karpenter and thus provides a node on demand.

Conclusion: Effective management of GPU resources in Kubernetes

With the increasing demand for GPU workloads in Kubernetes, GPU resource management is essential. Combination NVIDIA Device Plugin,, Time slicingand Carpenter It provides a powerful approach to managing, optimizing, and scaling GPU resources in a Kubernetes cluster, delivering high performance with efficient resource utilization. This solution was implemented to host GPU-enabled pilot learning labs Developer.cisco.com/learningProviding a GPU-powered learning experience.

Leave a Comment Cancel reply