Co-Authored by Jonathan Kinred and Paul Myjavec
For this article, we’ll learn about the process of setting up KubeVirt with Cluster Autoscaler on EKS. In addition, we’ll be using bare metal nodes to host KubeVirt VMs.
Required Base Knowledge
This article will talk about how to make various software systems work together but introducing each one in detail is outside of its scope. Thus, you must already:
- Know how to administer a Kubernetes cluster;
- Be familiar with AWS, specifically IAM and EKS; and
- Have some experience with KubeVirt.
Companion Code
All the code used in this article may also be found at github.com/relaxdiego/kubevirt-cas-baremetal.
Set Up the Cluster
Shared environment variables
First let’s set some environment variables:
1
2
3
4
5
6
7
8
9
10
11
12
# The name of the EKS cluster we're going to create
export RD_CLUSTER_NAME=my-cluster
# The region where we will create the cluster
export RD_REGION=us-west-2
# Kubernetes version to use
export RD_K8S_VERSION=1.27
# The name of the keypair that we're going to inject into the nodes. You
# must create this ahead of time in the correct region.
export RD_EC2_KEYPAIR_NAME=eks-my-cluster
Prepare the cluster.yaml file
Using eksctl, prepare an EKS cluster config:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
eksctl create cluster \
--dry-run \
--name=${RD_CLUSTER_NAME} \
--nodegroup-name ng-infra \
--node-type m5.xlarge \
--nodes 2 \
--nodes-min 2 \
--nodes-max 2 \
--node-labels workload=infra \
--region=${RD_REGION} \
--ssh-access \
--ssh-public-key ${RD_EC2_KEYPAIR_NAME} \
--version ${RD_K8S_VERSION} \
--vpc-nat-mode HighlyAvailable \
--with-oidc \
> cluster.yaml
--dry-run
means the command will not actually create the cluster but will
instead output a config to stdout which we then write to cluster.yaml
.
Open the file and look at what it has produced.
For more info on the schema used by
cluster.yaml
, see the Config file schema page from eksctl.io
This cluster will start out with a node group that we will use to host our
“infra” services. This is why we are using the cheaper m5.xlarge
rather than
a baremetal instance type. However, we also need to ensure that none of our VMs
will ever be scheduled in these nodes. Thus we need to taint them. In the
generated cluster.yaml
file, append the following taint to the only node
group in the managedNodeGroups
list:
1
2
3
4
5
6
managedNodeGroups:
- amiFamily: AmazonLinux2
...
taints:
- key: CriticalAddonsOnly
effect: NoSchedule
Create the cluster
We can now create the cluster:
Example output:
2023-08-20 07:59:14 [ℹ] eksctl version ...
2023-08-20 07:59:14 [ℹ] using region us-west-2 ...
2023-08-20 07:59:14 [ℹ] subnets for us-west-2a ...
2023-08-20 07:59:14 [ℹ] subnets for us-west-2b ...
2023-08-20 07:59:14 [ℹ] subnets for us-west-2c ...
...
2023-08-20 08:14:06 [ℹ] kubectl command should work with ...
2023-08-20 08:14:06 [✔] EKS cluster "my-cluster" in "us-west-2" is ready
Once the command is done, you should be able to query the the kube API. For example:
Example output:
NAME STATUS ROLES AGE VERSION
ip-XXX.compute.internal Ready <none> 32m v1.27.4-eks-2d98532
ip-YYY.compute.internal Ready <none> 32m v1.27.4-eks-2d98532
Create the Node Groups
As per this section of the Cluster Autoscaler docs:
If you’re using Persistent Volumes, your deployment needs to run in the same AZ as where the EBS volume is, otherwise the pod scheduling could fail if it is scheduled in a different AZ and cannot find the EBS volume. To overcome this, either use a single AZ ASG for this use case, or an ASG-per-AZ while enabling
--balance-similar-node-groups
.
Based on the above, we will create a node group for each of the availability
zones that was declared in cluster.yaml
so that the Cluster Autoscaler will
always bring up a node in the AZ where a VM’s EBS-backed PV is located.
To do that, we will first prepare a template that we can then feed to
envsubst
. Save the following in node-group.yaml.template
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
---
# See: Config File Schema <https://eksctl.io/usage/schema/>
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: ${RD_CLUSTER_NAME}
region: ${RD_REGION}
managedNodeGroups:
- name: ng-${EKS_AZ}-c5-metal
amiFamily: AmazonLinux2
instanceType: c5.metal
availabilityZones:
- ${EKS_AZ}
desiredCapacity: 1
maxSize: 3
minSize: 0
labels:
alpha.eksctl.io/cluster-name: my-cluster
alpha.eksctl.io/nodegroup-name: ng-${EKS_AZ}-c5-metal
workload: vm
privateNetworking: false
ssh:
allow: true
publicKeyPath: ${RD_EC2_KEYPAIR_NAME}
volumeSize: 500
volumeIOPS: 10000
volumeThroughput: 750
volumeType: gp3
tags:
alpha.eksctl.io/nodegroup-name: ng-${EKS_AZ}-c5-metal
alpha.eksctl.io/nodegroup-type: managed
k8s.io/cluster-autoscaler/my-cluster: owned
k8s.io/cluster-autoscaler/enabled: "true"
# The following tags help CAS determine that this node group is able
# to satisfy the label and resource requirements of the KubeVirt VMs.
# See: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#auto-discovery-setup
k8s.io/cluster-autoscaler/node-template/resources/devices.kubevirt.io/kvm: "1"
k8s.io/cluster-autoscaler/node-template/resources/devices.kubevirt.io/tun: "1"
k8s.io/cluster-autoscaler/node-template/resources/devices.kubevirt.io/vhost-net: "1"
k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage: 50M
k8s.io/cluster-autoscaler/node-template/label/kubevirt.io/schedulable: "true"
The last few tags bears additional emphasis. They are required because when a virtual machine is created, it will have the following requirements:
1
2
3
4
5
6
7
requests:
devices.kubevirt.io/kvm: 1
devices.kubevirt.io/tun: 1
devices.kubevirt.io/vhost-net: 1
ephemeral-storage: 50M
nodeSelectors: kubevirt.io/schedulable=true
However, at least when scaling from zero for the first time, CAS will have no knowledge of this information unless the correct AWS tags are added to the node group. This is why we have the following added to the managed node group’s tags:
1
2
3
4
5
k8s.io/cluster-autoscaler/node-template/resources/devices.kubevirt.io/kvm: "1"
k8s.io/cluster-autoscaler/node-template/resources/devices.kubevirt.io/tun: "1"
k8s.io/cluster-autoscaler/node-template/resources/devices.kubevirt.io/vhost-net: "1"
k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage: 50M
k8s.io/cluster-autoscaler/node-template/label/kubevirt.io/schedulable: "true"
For more information on these tags, see Auto-Discovery Setup.
Create the VM Node Groups
We can now create the node group:
Deploy KubeVirt
The following was adapted from KubeVirt quickstart with cloud providers.
Deploy the KubeVirt operator:
So that the operator will know how to deploy KubeVirt, let’s add the KubeVirt
resource:
Notice how we are specifically configuring KubeVirt itself to tolerate the
CriticalAddonsOnly
taint. This is so that the KubeVirt services themselves can be scheduled in the infra nodes instead of the bare metals nodes which we want to scale down to zero when there are no VMs.
Wait until KubeVirt is in a Deployed
state:
Example output:
Deployed
Double check that all KubeVirt components are healthy:
Example output:
NAME READY STATUS RESTARTS AGE
pod/virt-api-674467958c-5chhj 1/1 Running 0 98d
pod/virt-api-674467958c-wzcmk 1/1 Running 0 5d
pod/virt-controller-6768977b-49wwb 1/1 Running 0 98d
pod/virt-controller-6768977b-6pfcm 1/1 Running 0 5d
pod/virt-handler-4hztq 1/1 Running 0 5d
pod/virt-handler-x98x5 1/1 Running 0 98d
pod/virt-operator-85f65df79b-lg8xb 1/1 Running 0 5d
pod/virt-operator-85f65df79b-rp8p5 1/1 Running 0 98d
Deploy a VM to test
The following is copied from kubevirt.io.
First create a secret from your public key:
Next, create the VM:
Check that the test VM is running:
Example output:
NAME AGE STATUS READY
testvm 30s Running True
Delete the VM:
Set Up Cluster Autoscaler
Prepare the permissions for Cluster Autoscaler
So that CAS can set the desired capacity of each node group dynamically, we must grant it limited access to certain AWS resources. The first step to this is to define the IAM policy.
This section is based off of the “Create an IAM policy and role” section of the AWS Autoscaling documentation.
Create the cluster-specific policy document
Prepare the policy document by rendering the following file.
The above should be enough for CAS to do its job. Next, create the policy:
IMPORTANT: Take note of the returned policy ARN. You will need that below.
Create the IAM role and k8s service account pair
The Cluster Autoscaler needs a service account in the k8s cluster that’s
associated with an IAM role that consumes the policy document we created in the
previous section. This is normally a two-step process but can be created in a
single command using eksctl
:
For more information on what
eksctl
is doing under the covers, see How It Works from theeksctl
documentation for IAM Roles for Service Accounts.
Double check that the cluster-autoscaler
service account has been correctly
annotated with the IAM role that was created by eksctl
in the same step:
Example output:
arn:aws:iam::365499461711:role/eksctl-my-cluster-addon-iamserviceaccount-...
Check from the AWS Console if the above role contains the policy that we created earlier.
Deploy Cluster Autoscaler
First, find the most recent Cluster Autoscaler version that has the same MAJOR and MINOR version as the kubernetes cluster you’re deploying to.
Get the kube cluster’s version:
Example output:
v1.27.4-eks-2d98532
Choose the appropriate version for CAS. You can get the latest Cluster Autoscaler versions from its Github Releases Page.
Example:
Next, deploy the cluster autoscaler using the deployment template that I prepared in the companion repo
Check the cluster autoscaler status:
Example output:
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/cluster-autoscaler 1/1 1 1 4m1s
NAME READY STATUS RESTARTS AGE
pod/cluster-autoscaler-6c58bd6d89-v8wbn 1/1 Running 0 60s
Tail the cluster-autoscaler
pod’s logs to see what’s happening:
Below are example log entries from Cluster Autoscaler terminating an unneeded node:
node ip-XXXX.YYYY.compute.internal may be removed
...
ip-XXXX.YYYY.compute.internal was unneeded for 1m3.743475455s
Once the timeout has been reached (default: 10 minutes), CAS will scale down the group:
Scale-down: removing empty node ip-XXXX.YYYY.compute.internal
Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", ...
Successfully added ToBeDeletedTaint on node ip-XXXX.YYYY.compute.internal
Terminating EC2 instance: i-ZZZZ
DeleteInstances was called: ...
For more information on how Cluster Autoscaler scales down a node group, see How does scale-down work? from the project’s FAQ.
When you try to get the list of nodes, you should see the bare metal nodes tainted such that they are no longer schedulable:
NAME STATUS ROLES AGE VERSION
ip-XXXX Ready,SchedulingDisabled <none> 70m v1.27.3-eks-a5565ad
ip-XXXX Ready,SchedulingDisabled <none> 70m v1.27.3-eks-a5565ad
ip-XXXX Ready,SchedulingDisabled <none> 70m v1.27.3-eks-a5565ad
ip-XXXX Ready <none> 112m v1.27.3-eks-a5565ad
ip-XXXX Ready <none> 112m v1.27.3-eks-a5565ad
In a few more minutes, the nodes will be deleted.
To try the scale up, just deploy a VM.
Expanding Node Group eks-ng-eacf8ebb ...
Best option to resize: eks-ng-eacf8ebb
Estimated 1 nodes needed in eks-ng-eacf8ebb
Final scale-up plan: [{eks-ng-eacf8ebb 0->1 (max: 3)}]
Scale-up: setting group eks-ng-eacf8ebb size to 1
Setting asg eks-ng-eacf8ebb size to 1
Done
At this point you should have a working, auto-scaling EKS cluster that can host VMs on bare metal nodes. If you have any questions, ask them here.