Managing Nodegroups

Node groups are collections of worker nodes that run your containerized applications. This guide covers creating, scaling, and managing node groups in your Kubernetes cluster.

What is a Node Group?

A node group is a set of Kubernetes worker nodes with identical configuration:

  • Instance Type: CPU, memory, and other hardware specifications

  • Disk Size: Storage capacity for each node

  • Scaling: Minimum, maximum, and desired node count

  • Labels: Key-value pairs for workload scheduling

  • Taints: Restrictions on which pods can be scheduled

  • Network: Subnet configuration

Why Node Groups Matter

Node groups allow you to:

  • Separate Workloads: System services vs. applications

  • Optimize Resources: Different instance types for different needs

  • Control Costs: Scale and size appropriately

  • Isolate Workloads: Use taints and labels for pod placement

Critical: Untainted Node Groups

Recommended First Node Group:

Creating a Node Group

Node Group Configuration

When creating a node group, you'll need to configure the following settings.

Basic Settings

Node Group Name

Choose a descriptive name.

Naming Rules:

  • Maximum 100 characters

  • Lowercase alphanumeric characters, hyphens (-), and dots (.)

  • Must start and end with alphanumeric characters

  • No consecutive dots (..) or hyphens (--)

Examples:

  • ✅ Good: general-nodes, app-workers, gpu-nodes-prod

  • ❌ Bad: ng1, nodes, test

Instance Type (Flavor)

Select the compute resources for your nodes.

Available Instance Types:

Choosing Instance Type:

Disk Configuration

Configure root disk size for each node.

Disk Size Examples:

  • Minimum: 50 GB

  • Default: 80 GB (if not specified)

  • Recommended:

    • General nodes: 80-100 GB

    • Application nodes: 100-200 GB

    • Image-heavy workloads: 200+ GB

What uses disk space?

Planning Disk Size:

Scaling Configuration

Define how your node group scales.

Scaling Configuration

Min Size: Minimum number of nodes (always running)

  • Cannot be less than 0

  • Should be at least 1 for production

  • Can be 0 for dev/test environments (but node group creation needs at least 1)

Max Size: Maximum number of nodes (limit for scaling)

  • Must be >= Min Size

  • Set based on maximum expected load

  • Consider account quota limits

Desired Size: Target number of nodes (current goal)

  • Must be between Min Size and Max Size

  • Can be adjusted later

  • Cluster autoscaler can modify this

Scaling Examples:

Validation Rules:

Network Configuration

Subnet Selection

Subnet KRN (Required):

  • Select the subnet where node network interfaces will be created

  • Must be in the same VPC as the cluster

  • Ensure sufficient IP addresses available

IP Address Planning:

Labels Configuration (Optional)

Labels are key-value pairs used for pod scheduling.

Common Label Patterns:

Usage Example:

Taints Configuration (Optional)

Taints restrict which pods can be scheduled on nodes. Pods must have matching tolerations.

Taint Structure:

Taint Effects:

  • NoSchedule:

    • Hard requirement

    • Pods without toleration will NOT be scheduled

    • Existing pods not affected

  • PreferNoSchedule:

    • Soft requirement

    • System tries to avoid scheduling

    • Will schedule if no other option

  • NoExecute:

    • Evicts running pods without toleration

    • Prevents new pods from being scheduled

    • Use with caution!

Common Taint Scenarios:

Scenario: Dedicated GPU Nodes

Scenario: High-Priority Production Nodes

Scenario: General Purpose Nodes (NO TAINTS!)

Remote Access Configuration (Optional)

Enable SSH access to nodes for debugging:

SSH Key Selection:

  • Choose from your existing SSH keys

  • Required for SSH access to nodes

  • Recommended for troubleshooting

Security Groups:

  • Select security groups for SSH access

  • Restrict SSH access to specific IPs/networks

  • Follow security best practices

When to Enable:

  • ✅ Development/testing environments

  • ✅ Troubleshooting scenarios

  • ⚠️ Production (only if necessary with strict security)

Node Repair Configuration (Optional)

Automatic node health monitoring and repair:

Node Repair Configuration:

What it does:

  • Monitors node health

  • Detects failed or unhealthy nodes

  • Automatically replaces unhealthy nodes

  • Helps maintain cluster availability

When to enable:

  • ✅ Production clusters (recommended)

  • ✅ Critical workloads

  • ✅ Long-running clusters

When to disable:

  • Development environments

  • Short-lived clusters

  • Manual node management preference

Creating the Node Group

After configuring all settings:

  • Name and instance type

  • Scaling configuration

  • Labels and taints

  • Network settings

Verification Checklist:

Submit the node group configuration to begin creation.

Node Group Lifecycle

Creation Process

CREATING → SCALINGUP → RUNNING

Timeline:

  • Initial setup: 1–2 minutes

  • Node provisioning: 3–5 minutes per node

  • Kubernetes join: 1–2 minutes per node

  • Total: 5–10 minutes for 2–3 nodes

What's happening:

1

OpenStack instance creation

OpenStack instances are created for the nodes.

2

Network configuration

Network interfaces and security settings are applied.

3

Kubernetes components installation

Kubernetes components are installed on the new nodes.

4

Nodes join the cluster

Nodes join the cluster and become Ready once initialized.

Monitoring Creation:

Node Group States

  • CREATING: Initial setup in progress

  • SCALINGUP: Adding nodes

  • RUNNING: Operational and healthy

  • SCALINGDOWN: Removing nodes

  • UPDATING: Configuration or version update

  • FAILED: Operation failed (check error message)

  • PENDING_DELETE: Deletion initiated

  • DELETING: Removal in progress

Scaling Node Groups

Manual Scaling

Update desired size to scale your node group:

Configuration Update:

  • Modify the Desired Size parameter

  • Submit the configuration change

Scaling Up (Desired > Current):

Scaling Down (Desired < Current):

Important: Nodes are drained before removal. Ensure:

Automatic Scaling

If Cluster Autoscaler add-on is installed:

How it works:

  1. Scale Up: Pods cannot be scheduled → Add nodes

  2. Scale Down: Nodes underutilized → Remove nodes

Configuration:

  • Autoscaler respects min/max size

  • Can modify desired size automatically

  • Checks for un-schedulable pods

  • Monitors node utilization

Scaling Behavior:

Updating Node Groups

Update Configuration

Update scaling or repair settings for your node group.

Updatable Settings:

  • ✅ Min size

  • ✅ Max size

  • ✅ Desired size

  • ✅ Node repair configuration

What cannot be updated:

  • ❌ Instance type (create new node group)

  • ❌ Disk size (create new node group)

  • ❌ Labels (recreate nodes)

  • ❌ Taints (recreate nodes)

  • ❌ Subnet (create new node group)

Update Kubernetes Version

Keep node group version aligned with cluster.

Rolling update process:

1

New node creation

New node is created with the new Kubernetes version.

2

New node joins

New node joins the cluster and becomes Ready.

3

Cordoning old node

Old node is cordoned (no new pods scheduled).

4

Draining old node

Old node is drained and pods are rescheduled.

5

Deleting old node

Old node is deleted.

6

Repeat

Repeat for the next node until all are updated.

Timeline: ~5–10 minutes per node

Important Considerations:

Node Group Best Practices

✅ Do's

  • Create Untainted Nodes First

    • Start with nodes that have NO taints

    • Ensure essential components can schedule

    • Wait for nodes to be Ready before creating tainted nodes

  • Separate Workloads

    • System components: Dedicated untainted nodes

    • Applications: Separate node groups by purpose

    • Specialized: GPU, high-memory, etc.

  • Plan Capacity

    • Set appropriate min/max for each node group

    • Consider peak load in max size

    • Allow headroom for updates

  • Use Meaningful Labels

    • Label nodes by purpose, environment, team

    • Document label schema

    • Use labels for pod scheduling

  • Configure Node Repair

    • Enable for production node groups

    • Improves reliability

    • Reduces manual intervention

  • Right-Size Instances

    • Match instance type to workload

    • Don't over-provision

    • Monitor and adjust

❌ Don'ts

  • Don't Taint All Nodes

    • Always have untainted nodes for essential components

    • Cilium and CoreDNS need untainted nodes

  • Don't Under-Size Nodes

    • Minimum 2vcpu-4gb for general workloads

    • Cluster components need resources

  • Don't Forget Disk Space

    • Plan for images, logs, temp storage

    • Monitor disk usage

    • Increase if nodes run out of space

  • Don't Set Min = Max

    • Allow scaling flexibility

    • Use autoscaler for efficiency

    • Unless fixed size is required

  • Don't Block Draining

    • Avoid aggressive PodDisruptionBudgets

    • Plan for node updates

    • Allow graceful termination

Common Node Group Patterns

Pattern: Standard Three-Tier

Node Group Planning Example:

Pattern: Environment Separation

Troubleshooting Node Groups

Node Group Stuck in CREATING

Symptoms:

  • Node group status remains CREATING

  • No nodes appearing in cluster

Possible Causes:

  • Cluster not in PROVISIONED state

  • Insufficient subnet IPs

  • Invalid configuration

  • OpenStack quota exceeded

Solution:

Nodes Not Reaching Ready State

Symptoms:

  • Node group status SCALINGUP

  • Nodes in NotReady state

Possible Causes:

  • CNI not installed

  • All nodes have taints (system pods can't schedule)

  • Network connectivity issues

Solution:

Scaling Down Stuck

Symptoms:

  • Node group stuck in SCALINGDOWN

  • Desired size < current size but nodes not removed

Possible Causes:

  • PodDisruptionBudget preventing drain

  • Pods with local storage

  • Pods without controller (bare pods)

Solution:

Additional Resources

Last updated

Was this helpful?