Skip to main content
Version: 5.17.0

How to Update RKE2

Structsure's RKE2 cluster implementation relies upon an AMI where the desired RKE2 version is pre-installed and baked into the AMI. By updating this AMI and replacing the nodes in the cluster, an upgrade from one version of RKE2 to another may be performed.

AMI Update

The Terragrunt rke2-cluster module offers several ways to specify the AMI to be used when creating the cluster. By default, the rke2-cluster module uses a search filter to find the latest AMI matching a customizable set of search criteria. If this functionality was used when initially deploying a cluster, the configuration may not need to be changed in order to upgrade the AMI; in this case, simply import the updated AMI (if necessary) and run terragrunt apply to generate an updated launch template version using the latest AMI.

If the cluster was created using a specific ID, in addition to the aforementioned steps, the configuration will also need to be updated to reference the new AMI ID. The AMI ID is typically specified in your env.hcl file like so:

locals {
cluster_inputs = {
ami_id = "ami-0123456789abcdef0"
}
}

Replace the AMI ID specified here with the desired AMI ID.

Terragrunt Apply

Once the necessary configuration values are updated, be sure to run terragrunt apply to create a new launch template version and apply the change. To do this, run terragrunt init and terragrunt apply through the same process (and using the same values) as when initially creating the RKE2 cluster.

In the change set for the terragrunt apply, new launch template versions will be generated for both the control plane and agent nodes using the new AMI ID.

Node Rotation

Finally, once the terragrunt apply has completed, safely rotate the control plane and agent nodes for the cluster. Generally, control plane nodes should be rotated first, then agent nodes.

In a production RKE2-based Structsure deployment, a cluster will usually have at least three control plane nodes to provide high availability. It is important to rotate these control plane nodes one at a time in order to avoid any risk of cluster instability. The following node rotation procedure should be performed on each control plane node in turn:

  • Drain as many pods as possible away from the node:

    kubectl drain --ignore-daemonsets --delete-emptydir-data NODE_NAME
  • Terminate the node. In the AWS Console, this can be done by going to EC2 -> Instance State -> Terminate Instance.

  • Wait for the node to drop out of the node list. Use the command kubectl get nodes to monitor this.

  • The autoscaling group for the cluster should automatically create a new node to replace the terminated node. Wait for the new node to appear and report Ready status in kubectl get nodes.

  • Ensure that pods on the replacement node have successfully started using the command kubectl get pods -A --field-selector spec.nodeName=NODE.

Once the first node has been replaced, wait a few minutes for the cluster to stabilize, then proceed to replace the next control plane node.

Once all control plane nodes have been rotated, the agent nodes may be rotated using the same procedure as above. Draining pods from the agent nodes will generally take longer to complete than from control plane nodes. Depending upon the resource availability in the cluster, it may be possible to rotate more than one agent node at a time, although rotating only one node at a time will be the safest course of action. If too many agent nodes are rotated at once, pods may be scheduled on the control plane nodes, negatively impacting cluster stability.