Managing Nodegroups

Node groups are collections of worker nodes that run your containerized applications. This guide covers creating, scaling, and managing node groups in your Kubernetes cluster.

What is a Node Group?

A node group is a set of Kubernetes worker nodes with identical configuration:

Instance Type: CPU, memory, and other hardware specifications
Disk Size: Storage capacity for each node
Scaling: Minimum, maximum, and desired node count
Labels: Key-value pairs for workload scheduling
Taints: Restrictions on which pods can be scheduled
Network: Subnet configuration

Why Node Groups Matter

Node groups allow you to:

Separate Workloads: System services vs. applications
Optimize Resources: Different instance types for different needs
Control Costs: Scale and size appropriately
Isolate Workloads: Use taints and labels for pod placement

Critical: Untainted Node Groups

Always create at least 1–2 nodes without taints before creating any specialized node groups.

Essential cluster components require untainted nodes:

CoreDNS: DNS resolution for services and pods
Cilium (or your CNI): Pod networking

Without untainted nodes:

Essential add-ons cannot be scheduled
Cluster will not function properly
Pods cannot start or communicate
DNS resolution will fail

Recommended First Node Group:

Name: general-nodes
Purpose: Run cluster components and general workloads
Instance Type: 2vcpu-4gb or larger
Scaling:
  Min Size: 1
  Max Size: 3
  Desired Size: 2
Taints: None (this is critical!)
Labels:
  workload-type: general

Creating a Node Group

Node Group Configuration

When creating a node group, you'll need to configure the following settings.

Basic Settings

Node Group Name

Choose a descriptive name.

Naming Rules:

Maximum 100 characters
Lowercase alphanumeric characters, hyphens (-), and dots (.)
Must start and end with alphanumeric characters
No consecutive dots (..) or hyphens (--)

Examples:

✅ Good: general-nodes, app-workers, gpu-nodes-prod
❌ Bad: ng1, nodes, test

Instance Type (Flavor)

Select the compute resources for your nodes.

Available Instance Types:

Small:
  - 2vcpu-4gb: Good for light workloads, development
  - 2vcpu-8gb: Development, small applications

Medium:
  - 4vcpu-8gb: Standard applications
  - 4vcpu-16gb: Memory-intensive applications

Large:
  - 8vcpu-16gb: Production workloads
  - 8vcpu-32gb: Large applications, databases

Choosing Instance Type:

General Nodes:
  ├─ Minimum: 2vcpu-4gb
  └─ Recommended: 2vcpu-8gb

Application Nodes:
  ├─ Development: 2vcpu-4gb to 4vcpu-8gb
  ├─ Production: 4vcpu-8gb to 8vcpu-16gb
  └─ High Performance: 8vcpu-16gb or larger

Specialized Nodes:
  ├─ Memory-intensive: Choose high memory ratio
  ├─ CPU-intensive: Choose high CPU count
  └─ GPU workloads: GPU-enabled instances

Disk Configuration

Configure root disk size for each node.

Disk Size Examples:

General Purpose:
  - Min Size: 1
  - Max Size: 5
  - Disk: 80-100 GB

Application Nodes:
  - Min Size: 2
  - Max Size: 10
  - Disk: 100-200 GB

Data Processing:
  - Min Size: 1
  - Max Size: 5
  - Disk: 200-500 GB

Minimum: 50 GB
Default: 80 GB (if not specified)
Recommended:
- General nodes: 80-100 GB
- Application nodes: 100-200 GB
- Image-heavy workloads: 200+ GB

What uses disk space?

Node Disk Usage:
├─ Operating system: ~10 GB
├─ Container images: 20-50 GB (varies)
├─ Container logs: 5-10 GB
├─ kubelet working directory: 10-20 GB
└─ Available for EmptyDir volumes: Remaining space

Planning Disk Size:

# Small node group (minimal images)
diskSize: 80

# Standard node group (moderate images)
diskSize: 100

# Large images or many containers
diskSize: 200

# Machine learning / data processing
diskSize: 500

Scaling Configuration

Define how your node group scales.

Scaling Configuration

Min Size: Minimum number of nodes (always running)

Cannot be less than 0
Should be at least 1 for production
Can be 0 for dev/test environments (but node group creation needs at least 1)

Max Size: Maximum number of nodes (limit for scaling)

Must be >= Min Size
Set based on maximum expected load
Consider account quota limits

Desired Size: Target number of nodes (current goal)

Must be between Min Size and Max Size
Can be adjusted later
Cluster autoscaler can modify this

Scaling Examples:

# General nodes (stable, always running)
minSize: 2
maxSize: 3
desiredSize: 2

# Application nodes (can scale)
minSize: 2
maxSize: 10
desiredSize: 3

# Batch processing (scale to zero when idle)
minSize: 0
maxSize: 20
desiredSize: 0  # Can't create with 0, will scale down later

# High availability production
minSize: 3
maxSize: 10
desiredSize: 5

Validation Rules:

minSize ≤ desiredSize ≤ maxSize
All values must be non-negative integers

Network Configuration

Subnet Selection

Subnet KRN (Required):

Select the subnet where node network interfaces will be created
Must be in the same VPC as the cluster
Ensure sufficient IP addresses available

IP Address Planning:

Each node requires:
├─ 1 IP for primary network interface
└─ Additional IPs for pods (from Pod CIDR, not subnet)

Example:
10 nodes in subnet 10.0.1.0/24 (254 usable IPs):
├─ 10 IPs for nodes
├─ 244 IPs remaining
└─ Plan for growth and other resources

Labels Configuration (Optional)

Labels are key-value pairs used for pod scheduling.

Common Label Patterns:

# Node role identification
node-role: system
node-role: application
node-role: database

# Environment separation
environment: production
environment: staging

# Workload type
workload-type: compute-intensive
workload-type: memory-intensive
workload-type: gpu

# Team or project
team: platform
team: data-science
project: web-app

Usage Example:

# In node group configuration
labels:
  node-role: application
  environment: production
  team: platform

# In pod specification
apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  nodeSelector:
    node-role: application
    environment: production

Taints Configuration (Optional)

Taints restrict which pods can be scheduled on nodes. Pods must have matching tolerations.

Taint Structure:

key: taint-key
value: taint-value
effect: NoSchedule|PreferNoSchedule|NoExecute

Taint Effects:

NoSchedule:
- Hard requirement
- Pods without toleration will NOT be scheduled
- Existing pods not affected
PreferNoSchedule:
- Soft requirement
- System tries to avoid scheduling
- Will schedule if no other option
NoExecute:
- Evicts running pods without toleration
- Prevents new pods from being scheduled
- Use with caution!

Common Taint Scenarios:

Scenario: Dedicated GPU Nodes

# Node group taint
key: workload
value: gpu
effect: NoSchedule

# Pod toleration (only GPU workloads)
tolerations:
- key: workload
  operator: Equal
  value: gpu
  effect: NoSchedule

Scenario: High-Priority Production Nodes

# Node group taint
key: environment
value: production
effect: NoSchedule

# Pod toleration (production pods only)
tolerations:
- key: environment
  operator: Equal
  value: production
  effect: NoSchedule

Scenario: General Purpose Nodes (NO TAINTS!)

# General node group
taints: []  # Empty - no taints

# Allows:
├─ CoreDNS to schedule
├─ Cilium to schedule
└─ All other pods without tolerations

DO NOT add taints to your first node group!

❌ If all nodes are tainted, system pods cannot schedule and the cluster becomes non-functional.
✅ Ensure at least 1–2 nodes without taints so system pods can schedule and the cluster remains functional.

Remote Access Configuration (Optional)

Enable SSH access to nodes for debugging:

SSH Key Selection:

Choose from your existing SSH keys
Required for SSH access to nodes
Recommended for troubleshooting

Security Groups:

Select security groups for SSH access
Restrict SSH access to specific IPs/networks
Follow security best practices

When to Enable:

✅ Development/testing environments
✅ Troubleshooting scenarios
⚠️ Production (only if necessary with strict security)

Node Repair Configuration (Optional)

Automatic node health monitoring and repair:

Node Repair Configuration:

enabled: true  # Enable automatic node repair

What it does:

Monitors node health
Detects failed or unhealthy nodes
Automatically replaces unhealthy nodes
Helps maintain cluster availability

When to enable:

✅ Production clusters (recommended)
✅ Critical workloads
✅ Long-running clusters

When to disable:

Development environments
Short-lived clusters
Manual node management preference

Creating the Node Group

After configuring all settings:

Name and instance type
Scaling configuration
Labels and taints
Network settings

Verification Checklist:

For first node group: NO taints configured
Scaling values are valid (min ≤ desired ≤ max)
Subnet has sufficient IPs
Instance type matches workload needs

Submit the node group configuration to begin creation.

Node Group Lifecycle

Creation Process

CREATING → SCALINGUP → RUNNING

Timeline:

Initial setup: 1–2 minutes
Node provisioning: 3–5 minutes per node
Kubernetes join: 1–2 minutes per node
Total: 5–10 minutes for 2–3 nodes

What's happening:

OpenStack instance creation

OpenStack instances are created for the nodes.

Network configuration

Network interfaces and security settings are applied.

Kubernetes components installation

Kubernetes components are installed on the new nodes.

Nodes join the cluster

Nodes join the cluster and become Ready once initialized.

Monitoring Creation:

# Monitor node group status
# Status: CREATING → SCALINGUP → RUNNING

# Watch nodes joining cluster
kubectl get nodes -w

# Check node group pods
kubectl get pods -A -o wide

Node Group States

CREATING: Initial setup in progress
SCALINGUP: Adding nodes
RUNNING: Operational and healthy
SCALINGDOWN: Removing nodes
UPDATING: Configuration or version update
FAILED: Operation failed (check error message)
PENDING_DELETE: Deletion initiated
DELETING: Removal in progress

Scaling Node Groups

Manual Scaling

Update desired size to scale your node group:

Configuration Update:

Modify the Desired Size parameter
Submit the configuration change

Scaling Up (Desired > Current):

RUNNING → SCALINGUP → RUNNING
Timeline: 3–5 minutes per new node

Scaling Down (Desired < Current):

RUNNING → SCALINGDOWN → RUNNING
Timeline: 2–4 minutes per removed node

Important: Nodes are drained before removal. Ensure:

Workloads can be rescheduled
No PodDisruptionBudgets blocking
No local data will be lost

Automatic Scaling

If Cluster Autoscaler add-on is installed:

How it works:

Scale Up: Pods cannot be scheduled → Add nodes
Scale Down: Nodes underutilized → Remove nodes

Configuration:

Autoscaler respects min/max size
Can modify desired size automatically
Checks for un-schedulable pods
Monitors node utilization

Scaling Behavior:

Pod pending (no resources):
├─ Cluster Autoscaler detects
├─ Checks: current < max size
├─ Increases desired size
└─ New nodes provisioned

Node underutilized (< 50% for 10min):
├─ Cluster Autoscaler detects
├─ Checks: current > min size
├─ Drains node safely
└─ Decreases desired size

Updating Node Groups

Update Configuration

Update scaling or repair settings for your node group.

Updatable Settings:

✅ Min size
✅ Max size
✅ Desired size
✅ Node repair configuration

What cannot be updated:

❌ Instance type (create new node group)
❌ Disk size (create new node group)
❌ Labels (recreate nodes)
❌ Taints (recreate nodes)
❌ Subnet (create new node group)

Update Kubernetes Version

Keep node group version aligned with cluster.

Rolling update process:

New node creation

New node is created with the new Kubernetes version.

New node joins

New node joins the cluster and becomes Ready.

Cordoning old node

Old node is cordoned (no new pods scheduled).

Draining old node

Old node is drained and pods are rescheduled.

Deleting old node

Old node is deleted.

Repeat

Repeat for the next node until all are updated.

Timeline: ~5–10 minutes per node

Important Considerations:

Ensure workloads can be rescheduled
Check PodDisruptionBudgets allow rolling update
Verify sufficient capacity during update
Monitor application health during update

Node Group Best Practices

✅ Do's

Create Untainted Nodes First
- Start with nodes that have NO taints
- Ensure essential components can schedule
- Wait for nodes to be Ready before creating tainted nodes
Separate Workloads
- System components: Dedicated untainted nodes
- Applications: Separate node groups by purpose
- Specialized: GPU, high-memory, etc.
Plan Capacity
- Set appropriate min/max for each node group
- Consider peak load in max size
- Allow headroom for updates
Use Meaningful Labels
- Label nodes by purpose, environment, team
- Document label schema
- Use labels for pod scheduling
Configure Node Repair
- Enable for production node groups
- Improves reliability
- Reduces manual intervention
Right-Size Instances
- Match instance type to workload
- Don't over-provision
- Monitor and adjust

❌ Don'ts

Don't Taint All Nodes
- Always have untainted nodes for essential components
- Cilium and CoreDNS need untainted nodes
Don't Under-Size Nodes
- Minimum 2vcpu-4gb for general workloads
- Cluster components need resources
Don't Forget Disk Space
- Plan for images, logs, temp storage
- Monitor disk usage
- Increase if nodes run out of space
Don't Set Min = Max
- Allow scaling flexibility
- Use autoscaler for efficiency
- Unless fixed size is required
Don't Block Draining
- Avoid aggressive PodDisruptionBudgets
- Plan for node updates
- Allow graceful termination

Common Node Group Patterns

Pattern: Standard Three-Tier

Node Group Planning Example:

# General nodes - always running
general-nodes:
  instance: 4vcpu-8gb
  min: 2, max: 5, desired: 2
  taints: none

# Application nodes - scalable
app-nodes:
  instanceType: 4vcpu-16gb
  diskSize: 150
  minSize: 3
  maxSize: 10
  desiredSize: 5
  taints: []
  labels:
    node-role: application
    workload-type: general

# Batch processing - can scale to zero
batch-nodes:
  instanceType: 8vcpu-32gb
  diskSize: 200
  minSize: 0
  maxSize: 20
  desiredSize: 2
  taints:
  - key: workload
    value: batch
    effect: NoSchedule
  labels:
    node-role: batch
    workload-type: batch-processing

Pattern: Environment Separation

# Production - highly available
prod-nodes:
  instanceType: 8vcpu-16gb
  diskSize: 200
  minSize: 5
  maxSize: 20
  desiredSize: 10
  taints:
  - key: environment
    value: production
    effect: NoSchedule
  labels:
    environment: production
    tier: application

# Staging - moderate resources
staging-nodes:
  instanceType: 4vcpu-8gb
  diskSize: 100
  minSize: 2
  maxSize: 5
  desiredSize: 3
  taints:
  - key: environment
    value: staging
    effect: NoSchedule
  labels:
    environment: staging
    tier: application

Troubleshooting Node Groups

Node Group Stuck in CREATING

Symptoms:

Node group status remains CREATING
No nodes appearing in cluster

Possible Causes:

Cluster not in PROVISIONED state
Insufficient subnet IPs
Invalid configuration
OpenStack quota exceeded

Solution:

# Check cluster status
Cluster status should be PROVISIONED

# Check subnet has available IPs

# Verify subnet capacity in VPC service

# Check account quota

# Verify instance quota in OpenStack

# Review error message

# Check node group details for specific error

Nodes Not Reaching Ready State

Symptoms:

Node group status SCALINGUP
Nodes in NotReady state

Possible Causes:

CNI not installed
All nodes have taints (system pods can't schedule)
Network connectivity issues

Solution:

# Check node status
kubectl get nodes

# Check why NotReady
kubectl describe node <node-name>

# Check CNI pods
kubectl get pods -n kube-system -l k8s-app=cilium

# Ensure you have untainted nodes

# System pods must be able to schedule

Scaling Down Stuck

Symptoms:

Node group stuck in SCALINGDOWN
Desired size < current size but nodes not removed

Possible Causes:

PodDisruptionBudget preventing drain
Pods with local storage
Pods without controller (bare pods)

Solution:

# Check what's preventing drain
kubectl get pods -A --field-selector spec.nodeName=<node-name>

# Check PodDisruptionBudgets
kubectl get pdb -A

# Manually drain if needed (understand impact!)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

Additional Resources

Installing Add-ons - Install CNI and other essential add-ons
Creating Cluster Guide - Cluster setup process
Troubleshooting Guide - Common issues and solutions

Last updated 2 months ago

Was this helpful?

hashtagWhat is a Node Group?

hashtagWhy Node Groups Matter

hashtagCritical: Untainted Node Groups

hashtagCreating a Node Group

hashtagNode Group Configuration

hashtagBasic Settings

hashtagNode Group Name

hashtagInstance Type (Flavor)

hashtagDisk Configuration

hashtagScaling Configuration

hashtagScaling Configuration

hashtagNetwork Configuration

hashtagSubnet Selection

hashtagLabels Configuration (Optional)

hashtagTaints Configuration (Optional)

hashtagRemote Access Configuration (Optional)

hashtagNode Repair Configuration (Optional)

hashtagCreating the Node Group

hashtagNode Group Lifecycle

hashtagCreation Process

hashtagOpenStack instance creation

hashtagNetwork configuration

hashtagKubernetes components installation

hashtagNodes join the cluster

hashtagNode Group States

hashtagScaling Node Groups

hashtagManual Scaling

hashtagAutomatic Scaling

hashtagUpdating Node Groups

hashtagUpdate Configuration

hashtagUpdate Kubernetes Version

hashtagNew node creation

hashtagNew node joins

hashtagCordoning old node

hashtagDraining old node

hashtagDeleting old node

hashtagRepeat

hashtagNode Group Best Practices

hashtag✅ Do's

hashtag❌ Don'ts

hashtagCommon Node Group Patterns

hashtagPattern: Standard Three-Tier

hashtagPattern: Environment Separation

hashtagTroubleshooting Node Groups

hashtagAdditional Resources

What is a Node Group?

Why Node Groups Matter

Critical: Untainted Node Groups

Creating a Node Group

Node Group Configuration

Basic Settings

Node Group Name

Instance Type (Flavor)

Disk Configuration

Scaling Configuration

Scaling Configuration

Network Configuration

Subnet Selection

Labels Configuration (Optional)

Taints Configuration (Optional)

Remote Access Configuration (Optional)

Node Repair Configuration (Optional)

Creating the Node Group

Node Group Lifecycle

Creation Process

OpenStack instance creation

Network configuration

Kubernetes components installation

Nodes join the cluster

Node Group States

Scaling Node Groups

Manual Scaling

Automatic Scaling

Updating Node Groups

Update Configuration

Update Kubernetes Version

New node creation

New node joins

Cordoning old node

Draining old node

Deleting old node

Repeat

Node Group Best Practices

✅ Do's

❌ Don'ts

Common Node Group Patterns

Pattern: Standard Three-Tier

Pattern: Environment Separation

Troubleshooting Node Groups

Additional Resources