Upgrading Kubernetes

This guide covers the process of upgrading your Kubernetes cluster to a newer version in Krutrim Kubernetes Service.

Overview

Upgrading a Kubernetes cluster in KKS is a two-phase process:

  • Control Plane Upgrade: Upgrades the Kubernetes control plane components

  • Node Group Upgrade: Upgrades worker nodes in each node group individually

Important: These are separate operations. Upgrading the cluster version only upgrades the control plane. You must upgrade each node group separately to complete the cluster upgrade.

How Kubernetes Version Upgrade Works

Phase 1: Control Plane Upgrade

When you upgrade the Kubernetes version:

Control Plane Upgrade (Automatic)

✓ API Server upgraded to new version
✓ Controller Manager upgraded
✓ Scheduler upgraded
✓ etcd compatibility verified

Worker Nodes: Still running OLD version

After control plane upgrade:

  • ✅ Control plane runs the new Kubernetes version

  • ⚠️ Worker nodes still run the old version

  • ✅ Cluster remains operational (Kubernetes supports version skew)

  • ⚠️ You must upgrade node groups to complete the process

Phase 2: Node Group Upgrade

After upgrading the control plane, you must upgrade each node group:

Rolling update ensures:

  • No downtime for properly configured workloads

  • Pods are rescheduled to healthy nodes

  • One node upgraded at a time

  • Cluster capacity maintained during upgrade

Prerequisites

Before upgrading your Kubernetes cluster:

Check Version Compatibility

  • ✅ You can only upgrade to the next minor version (e.g., 1.27 → 1.28)

  • ❌ Cannot skip versions (e.g., 1.27 → 1.29)

  • ✅ Control plane must be upgraded before node groups

  • ✅ Check available versions in Krutrim platform

Review Release Notes

  • Review Kubernetes release notes for the target version

  • Check for deprecated APIs or breaking changes

  • Verify your applications are compatible with the new version

Backup Critical Data

  • Backup any critical application data

  • Document current cluster configuration

  • Take note of current cluster state

Check Cluster Health

Upgrading the Control Plane

1

Initiate Control Plane Upgrade

Upgrade the cluster's Kubernetes version through the Krutrim platform:

2

Monitor Control Plane Upgrade

3

Verify Control Plane Upgrade

After control plane upgrade:

  • ✅ Control plane is now running the new version

  • ⚠️ Node groups still need to be upgraded

  • ✅ Cluster is functional with version skew

Upgrading Node Groups

Critical: Prepare for Node Group Upgrades

Before upgrading each node group, ensure smooth operation.

Ensure Pods Can Be Rescheduled

Common issues:

  • PDB with minAvailable: 100% will block draining

  • Not enough replicas to satisfy PDB during drain

  • Single-replica deployments without PDB

Solution example (adjust PDB):

Move Critical Workloads (If Necessary)

For critical single-replica workloads or workloads that cannot tolerate disruption:

Check Node Drain Blockers

1

Upgrade Node Groups One by One

Important: Upgrade node groups one at a time to maintain cluster stability.

Recommended upgrade order:

  • Non-critical node groups first (development, testing)

  • General workload node groups (application nodes)

  • Critical node groups last (production, stateful workloads)

2

Upgrade Process for Each Node Group

3

Monitor Node Group Upgrade

During the upgrade, the platform performs a rolling update:

Example output:

Per-node process:

  1. New node with updated version is created

  2. New node joins cluster and becomes Ready

  3. Old node is cordoned (no new pods scheduled)

  4. Old node is drained (pods evicted gracefully)

  5. Old node is removed after successful drain

  6. Process repeats for next node

4

Handle Stuck Node Upgrades

Symptoms:

  • Node group upgrade stuck in UPGRADING state

  • Old node stuck in "Draining" state

  • Node group upgrade not progressing

Cause: Old node cannot be drained due to:

  • PodDisruptionBudget blocking drain

  • Pods with emptyDir volumes

  • Bare pods (no controller)

  • Pods with local storage

Diagnosis:

Resolution options:

Option 1: Fix PodDisruptionBudget

Option 2: Scale Up Application

Option 3: Delete Blocking Pods (Careful!)

Option 4: Contact Support

5

Verify Node Group Upgrade

After each node group upgrade completes:

6

Repeat for Remaining Node Groups

Repeat the previous steps for each remaining node group until all node groups are upgraded.

Best Practices for Smooth Upgrades

Do's

  • Always Upgrade Control Plane First

    • Control plane must be at the same or newer version than nodes

    • Node groups cannot be newer than control plane

  • Upgrade Node Groups One at a Time

    • Wait for each node group upgrade to complete

    • Verify workloads are healthy before proceeding

    • Maintain cluster stability

  • Prepare Your Workloads

    • Ensure multiple replicas for critical services

    • Configure appropriate PodDisruptionBudgets

    • Use Deployments/StatefulSets instead of bare pods

Example PDB:

  • Test Node Drainability Before Upgrade

  • Monitor During Upgrade

  • Schedule Upgrades During Maintenance Windows

    • Plan upgrades during low-traffic periods

    • Notify users of potential brief disruptions

    • Have rollback plan ready

  • Upgrade Non-Production Clusters First

    • Test upgrade process in dev/staging

    • Identify potential issues before production

    • Validate application compatibility

Don'ts

  • Don't Skip Kubernetes Versions

    • ❌ Cannot upgrade 1.27 → 1.29

    • ✅ Must upgrade 1.27 → 1.28 → 1.29

  • Don't Upgrade Multiple Node Groups Simultaneously

    • Can cause cluster instability

    • Harder to troubleshoot issues

    • May exceed resource limits

  • Don't Ignore PodDisruptionBudgets

    • PDBs can block node draining

    • Review and adjust PDBs before upgrade

    • Ensure PDBs allow at least some disruption

  • Don't Use Bare Pods in Production

    • Bare pods are deleted during drain (not rescheduled)

    • Always use Deployments, StatefulSets, or DaemonSets

    • Controllers ensure pods are recreated

  • Don't Upgrade Without Testing

    • Test upgrade in non-production first

    • Verify application compatibility

    • Check for deprecated APIs

  • Don't Forget About Version Skew

    • Control plane and nodes can differ by 1 minor version

    • Don't leave nodes on old version indefinitely

    • Complete all node group upgrades within reasonable time

  • Don't Ignore Failed Drains

    • Investigate why drain failed

    • Fix underlying issue

    • Don't force drain without understanding impact

Troubleshooting Upgrade Issues

Control Plane Upgrade Stuck

Symptoms:

  • Cluster stuck in UPGRADING state

  • Control plane upgrade not completing

Solution:

  • Check cluster status in Krutrim platform

  • Review error messages

  • Contact Krutrim support with cluster ID

Node Group Upgrade Not Starting

Symptoms:

  • Node group remains in current version

  • No new nodes being created

Possible Causes:

  • Control plane not upgraded yet

  • Invalid target version

  • Insufficient quotas

Solution:

Pods Failing After Upgrade

Symptoms:

  • Pods in CrashLoopBackOff after upgrade

  • Services not working correctly

Possible Causes:

  • Application incompatible with new Kubernetes version

  • Deprecated APIs removed

  • Configuration issues

Solution:

Node Stuck in NotReady After Upgrade

Symptoms:

  • New node stuck in NotReady state

  • Node not joining cluster properly

Solution:

Version Skew Policy

Kubernetes supports running control plane and nodes at different versions (within limits):

Supported Version Skew:

Recommendations:

  • Upgrade control plane first

  • Upgrade all node groups within 1-2 weeks

  • Don't leave node groups more than 1 version behind

  • Complete upgrades before next version release

Rollback Considerations

Important: Kubernetes upgrades are typically one-way operations.

Control Plane Rollback:

  • Not typically supported

  • May require cluster restore from backup

  • Contact Krutrim support for assistance

Node Group Rollback:

  • Can create new node group with old version

  • Migrate workloads to old version node group

  • Remove upgraded node group

Prevention is Better:

  • Test upgrades in non-production first

  • Verify application compatibility

  • Have rollback plan documented

  • Take backups before upgrading

Post-Upgrade Tasks

After completing the upgrade:

Verify Cluster Health

Update Documentation

  • Document the upgrade date and version

  • Note any issues encountered and resolutions

  • Update cluster documentation with new version

Update Client Tools

Review Deprecated APIs

  • Check for deprecated API warnings

  • Update manifests to use newer APIs

  • Test applications thoroughly

Monitor Cluster

  • Monitor cluster performance

  • Watch for any unusual behavior

  • Check application metrics and logs

Additional Resources

  • Kubernetes Release Notes: https://kubernetes.io/releases/

  • Krutrim Documentation: Check platform docs for version upgrade procedures

  • Version Skew Policy: https://kubernetes.io/releases/version-skew-policy/

Last updated

Was this helpful?