> For the complete documentation index, see [llms.txt](https://docs.cloud.olakrutrim.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.cloud.olakrutrim.com/basics/core-infrastructure/krutrim-kubernetes-system/managing-nodegroups.md).

# Managing Nodegroups

Node groups are collections of worker nodes that run your containerized applications. This guide covers creating, scaling, and managing node groups in your Kubernetes cluster.

## What is a Node Group?

A node group is a set of Kubernetes worker nodes with identical configuration:

* **Instance Type**: CPU, memory, and other hardware specifications
* **Disk Size**: Storage capacity for each node
* **Scaling**: Minimum, maximum, and desired node count
* **Labels**: Key-value pairs for workload scheduling
* **Taints**: Restrictions on which pods can be scheduled
* **Network**: Subnet configuration

## Why Node Groups Matter

Node groups allow you to:

* **Separate Workloads**: System services vs. applications
* **Optimize Resources**: Different instance types for different needs
* **Control Costs**: Scale and size appropriately
* **Isolate Workloads**: Use taints and labels for pod placement

## Critical: Untainted Node Groups

{% hint style="danger" %}
Always create at least 1–2 nodes without taints before creating any specialized node groups.

Essential cluster components require untainted nodes:

* CoreDNS: DNS resolution for services and pods
* Cilium (or your CNI): Pod networking

Without untainted nodes:

* Essential add-ons cannot be scheduled
* Cluster will not function properly
* Pods cannot start or communicate
* DNS resolution will fail
  {% endhint %}

Recommended First Node Group:

```yaml
Name: general-nodes
Purpose: Run cluster components and general workloads
Instance Type: 2vcpu-4gb or larger
Scaling:
  Min Size: 1
  Max Size: 3
  Desired Size: 2
Taints: None (this is critical!)
Labels:
  workload-type: general
```

## Creating a Node Group

### Node Group Configuration

When creating a node group, you'll need to configure the following settings.

### Basic Settings

#### Node Group Name

Choose a descriptive name.

Naming Rules:

* Maximum 100 characters
* Lowercase alphanumeric characters, hyphens (-), and dots (.)
* Must start and end with alphanumeric characters
* No consecutive dots (..) or hyphens (--)

Examples:

* ✅ Good: `general-nodes`, `app-workers`, `gpu-nodes-prod`
* ❌ Bad: `ng1`, `nodes`, `test`

#### Instance Type (Flavor)

Select the compute resources for your nodes.

Available Instance Types:

```yaml
Small:
  - 2vcpu-4gb: Good for light workloads, development
  - 2vcpu-8gb: Development, small applications

Medium:
  - 4vcpu-8gb: Standard applications
  - 4vcpu-16gb: Memory-intensive applications

Large:
  - 8vcpu-16gb: Production workloads
  - 8vcpu-32gb: Large applications, databases
```

Choosing Instance Type:

```
General Nodes:
  ├─ Minimum: 2vcpu-4gb
  └─ Recommended: 2vcpu-8gb

Application Nodes:
  ├─ Development: 2vcpu-4gb to 4vcpu-8gb
  ├─ Production: 4vcpu-8gb to 8vcpu-16gb
  └─ High Performance: 8vcpu-16gb or larger

Specialized Nodes:
  ├─ Memory-intensive: Choose high memory ratio
  ├─ CPU-intensive: Choose high CPU count
  └─ GPU workloads: GPU-enabled instances
```

### Disk Configuration

Configure root disk size for each node.

Disk Size Examples:

```
General Purpose:
  - Min Size: 1
  - Max Size: 5
  - Disk: 80-100 GB

Application Nodes:
  - Min Size: 2
  - Max Size: 10
  - Disk: 100-200 GB

Data Processing:
  - Min Size: 1
  - Max Size: 5
  - Disk: 200-500 GB
```

* **Minimum**: 50 GB
* **Default**: 80 GB (if not specified)
* **Recommended**:
  * General nodes: 80-100 GB
  * Application nodes: 100-200 GB
  * Image-heavy workloads: 200+ GB

What uses disk space?

```
Node Disk Usage:
├─ Operating system: ~10 GB
├─ Container images: 20-50 GB (varies)
├─ Container logs: 5-10 GB
├─ kubelet working directory: 10-20 GB
└─ Available for EmptyDir volumes: Remaining space
```

Planning Disk Size:

```yaml
# Small node group (minimal images)
diskSize: 80

# Standard node group (moderate images)
diskSize: 100

# Large images or many containers
diskSize: 200

# Machine learning / data processing
diskSize: 500
```

### Scaling Configuration

Define how your node group scales.

#### Scaling Configuration

**Min Size**: Minimum number of nodes (always running)

* Cannot be less than 0
* Should be at least 1 for production
* Can be 0 for dev/test environments (but node group creation needs at least 1)

**Max Size**: Maximum number of nodes (limit for scaling)

* Must be >= Min Size
* Set based on maximum expected load
* Consider account quota limits

**Desired Size**: Target number of nodes (current goal)

* Must be between Min Size and Max Size
* Can be adjusted later
* Cluster autoscaler can modify this

Scaling Examples:

```yaml
# General nodes (stable, always running)
minSize: 2
maxSize: 3
desiredSize: 2

# Application nodes (can scale)
minSize: 2
maxSize: 10
desiredSize: 3

# Batch processing (scale to zero when idle)
minSize: 0
maxSize: 20
desiredSize: 0  # Can't create with 0, will scale down later

# High availability production
minSize: 3
maxSize: 10
desiredSize: 5
```

Validation Rules:

```
minSize ≤ desiredSize ≤ maxSize
All values must be non-negative integers
```

### Network Configuration

#### Subnet Selection

**Subnet KRN** (Required):

* Select the subnet where node network interfaces will be created
* Must be in the same VPC as the cluster
* Ensure sufficient IP addresses available

IP Address Planning:

```
Each node requires:
├─ 1 IP for primary network interface
└─ Additional IPs for pods (from Pod CIDR, not subnet)

Example:
10 nodes in subnet 10.0.1.0/24 (254 usable IPs):
├─ 10 IPs for nodes
├─ 244 IPs remaining
└─ Plan for growth and other resources
```

### Labels Configuration (Optional)

Labels are key-value pairs used for pod scheduling.

Common Label Patterns:

```yaml
# Node role identification
node-role: system
node-role: application
node-role: database

# Environment separation
environment: production
environment: staging

# Workload type
workload-type: compute-intensive
workload-type: memory-intensive
workload-type: gpu

# Team or project
team: platform
team: data-science
project: web-app
```

Usage Example:

```yaml
# In node group configuration
labels:
  node-role: application
  environment: production
  team: platform
```

```yaml
# In pod specification
apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  nodeSelector:
    node-role: application
    environment: production
```

### Taints Configuration (Optional)

Taints restrict which pods can be scheduled on nodes. Pods must have matching tolerations.

Taint Structure:

```yaml
key: taint-key
value: taint-value
effect: NoSchedule|PreferNoSchedule|NoExecute
```

Taint Effects:

* NoSchedule:
  * Hard requirement
  * Pods without toleration will NOT be scheduled
  * Existing pods not affected
* PreferNoSchedule:
  * Soft requirement
  * System tries to avoid scheduling
  * Will schedule if no other option
* NoExecute:
  * Evicts running pods without toleration
  * Prevents new pods from being scheduled
  * Use with caution!

Common Taint Scenarios:

Scenario: Dedicated GPU Nodes

```yaml
# Node group taint
key: workload
value: gpu
effect: NoSchedule

# Pod toleration (only GPU workloads)
tolerations:
- key: workload
  operator: Equal
  value: gpu
  effect: NoSchedule
```

Scenario: High-Priority Production Nodes

```yaml
# Node group taint
key: environment
value: production
effect: NoSchedule

# Pod toleration (production pods only)
tolerations:
- key: environment
  operator: Equal
  value: production
  effect: NoSchedule
```

Scenario: General Purpose Nodes (NO TAINTS!)

```yaml
# General node group
taints: []  # Empty - no taints

# Allows:
├─ CoreDNS to schedule
├─ Cilium to schedule
└─ All other pods without tolerations
```

{% hint style="warning" %}
DO NOT add taints to your first node group!

* ❌ If all nodes are tainted, system pods cannot schedule and the cluster becomes non-functional.
* ✅ Ensure at least 1–2 nodes without taints so system pods can schedule and the cluster remains functional.
  {% endhint %}

### Remote Access Configuration (Optional)

Enable SSH access to nodes for debugging:

SSH Key Selection:

* Choose from your existing SSH keys
* Required for SSH access to nodes
* Recommended for troubleshooting

Security Groups:

* Select security groups for SSH access
* Restrict SSH access to specific IPs/networks
* Follow security best practices

When to Enable:

* ✅ Development/testing environments
* ✅ Troubleshooting scenarios
* ⚠️ Production (only if necessary with strict security)

### Node Repair Configuration (Optional)

Automatic node health monitoring and repair:

Node Repair Configuration:

```yaml
enabled: true  # Enable automatic node repair
```

What it does:

* Monitors node health
* Detects failed or unhealthy nodes
* Automatically replaces unhealthy nodes
* Helps maintain cluster availability

When to enable:

* ✅ Production clusters (recommended)
* ✅ Critical workloads
* ✅ Long-running clusters

When to disable:

* Development environments
* Short-lived clusters
* Manual node management preference

### Creating the Node Group

After configuring all settings:

* Name and instance type
* Scaling configuration
* Labels and taints
* Network settings

Verification Checklist:

* [ ] For first node group: NO taints configured
* [ ] Scaling values are valid (min ≤ desired ≤ max)
* [ ] Subnet has sufficient IPs
* [ ] Instance type matches workload needs

Submit the node group configuration to begin creation.

## Node Group Lifecycle

### Creation Process

CREATING → SCALINGUP → RUNNING

Timeline:

* Initial setup: 1–2 minutes
* Node provisioning: 3–5 minutes per node
* Kubernetes join: 1–2 minutes per node
* Total: 5–10 minutes for 2–3 nodes

What's happening:

{% stepper %}
{% step %}

### OpenStack instance creation

OpenStack instances are created for the nodes.
{% endstep %}

{% step %}

### Network configuration

Network interfaces and security settings are applied.
{% endstep %}

{% step %}

### Kubernetes components installation

Kubernetes components are installed on the new nodes.
{% endstep %}

{% step %}

### Nodes join the cluster

Nodes join the cluster and become Ready once initialized.
{% endstep %}
{% endstepper %}

Monitoring Creation:

```bash
# Monitor node group status
# Status: CREATING → SCALINGUP → RUNNING

# Watch nodes joining cluster
kubectl get nodes -w

# Check node group pods
kubectl get pods -A -o wide
```

### Node Group States

* CREATING: Initial setup in progress
* SCALINGUP: Adding nodes
* RUNNING: Operational and healthy
* SCALINGDOWN: Removing nodes
* UPDATING: Configuration or version update
* FAILED: Operation failed (check error message)
* PENDING\_DELETE: Deletion initiated
* DELETING: Removal in progress

## Scaling Node Groups

### Manual Scaling

Update desired size to scale your node group:

Configuration Update:

* Modify the Desired Size parameter
* Submit the configuration change

Scaling Up (Desired > Current):

```
RUNNING → SCALINGUP → RUNNING
Timeline: 3–5 minutes per new node
```

Scaling Down (Desired < Current):

```
RUNNING → SCALINGDOWN → RUNNING
Timeline: 2–4 minutes per removed node
```

Important: Nodes are drained before removal. Ensure:

* [ ] Workloads can be rescheduled
* [ ] No PodDisruptionBudgets blocking
* [ ] No local data will be lost

### Automatic Scaling

If Cluster Autoscaler add-on is installed:

How it works:

1. Scale Up: Pods cannot be scheduled → Add nodes
2. Scale Down: Nodes underutilized → Remove nodes

Configuration:

* Autoscaler respects min/max size
* Can modify desired size automatically
* Checks for un-schedulable pods
* Monitors node utilization

Scaling Behavior:

```
Pod pending (no resources):
├─ Cluster Autoscaler detects
├─ Checks: current < max size
├─ Increases desired size
└─ New nodes provisioned

Node underutilized (< 50% for 10min):
├─ Cluster Autoscaler detects
├─ Checks: current > min size
├─ Drains node safely
└─ Decreases desired size
```

## Updating Node Groups

### Update Configuration

Update scaling or repair settings for your node group.

Updatable Settings:

* ✅ Min size
* ✅ Max size
* ✅ Desired size
* ✅ Node repair configuration

What cannot be updated:

* ❌ Instance type (create new node group)
* ❌ Disk size (create new node group)
* ❌ Labels (recreate nodes)
* ❌ Taints (recreate nodes)
* ❌ Subnet (create new node group)

### Update Kubernetes Version

Keep node group version aligned with cluster.

Rolling update process:

{% stepper %}
{% step %}

### New node creation

New node is created with the new Kubernetes version.
{% endstep %}

{% step %}

### New node joins

New node joins the cluster and becomes Ready.
{% endstep %}

{% step %}

### Cordoning old node

Old node is cordoned (no new pods scheduled).
{% endstep %}

{% step %}

### Draining old node

Old node is drained and pods are rescheduled.
{% endstep %}

{% step %}

### Deleting old node

Old node is deleted.
{% endstep %}

{% step %}

### Repeat

Repeat for the next node until all are updated.
{% endstep %}
{% endstepper %}

Timeline: \~5–10 minutes per node

Important Considerations:

* [ ] Ensure workloads can be rescheduled
* [ ] Check PodDisruptionBudgets allow rolling update
* [ ] Verify sufficient capacity during update
* [ ] Monitor application health during update

## Node Group Best Practices

### ✅ Do's

* Create Untainted Nodes First
  * Start with nodes that have NO taints
  * Ensure essential components can schedule
  * Wait for nodes to be Ready before creating tainted nodes
* Separate Workloads
  * System components: Dedicated untainted nodes
  * Applications: Separate node groups by purpose
  * Specialized: GPU, high-memory, etc.
* Plan Capacity
  * Set appropriate min/max for each node group
  * Consider peak load in max size
  * Allow headroom for updates
* Use Meaningful Labels
  * Label nodes by purpose, environment, team
  * Document label schema
  * Use labels for pod scheduling
* Configure Node Repair
  * Enable for production node groups
  * Improves reliability
  * Reduces manual intervention
* Right-Size Instances
  * Match instance type to workload
  * Don't over-provision
  * Monitor and adjust

### ❌ Don'ts

* Don't Taint All Nodes
  * Always have untainted nodes for essential components
  * Cilium and CoreDNS need untainted nodes
* Don't Under-Size Nodes
  * Minimum 2vcpu-4gb for general workloads
  * Cluster components need resources
* Don't Forget Disk Space
  * Plan for images, logs, temp storage
  * Monitor disk usage
  * Increase if nodes run out of space
* Don't Set Min = Max
  * Allow scaling flexibility
  * Use autoscaler for efficiency
  * Unless fixed size is required
* Don't Block Draining
  * Avoid aggressive PodDisruptionBudgets
  * Plan for node updates
  * Allow graceful termination

## Common Node Group Patterns

### Pattern: Standard Three-Tier

Node Group Planning Example:

```yaml
# General nodes - always running
general-nodes:
  instance: 4vcpu-8gb
  min: 2, max: 5, desired: 2
  taints: none

# Application nodes - scalable
app-nodes:
  instanceType: 4vcpu-16gb
  diskSize: 150
  minSize: 3
  maxSize: 10
  desiredSize: 5
  taints: []
  labels:
    node-role: application
    workload-type: general

# Batch processing - can scale to zero
batch-nodes:
  instanceType: 8vcpu-32gb
  diskSize: 200
  minSize: 0
  maxSize: 20
  desiredSize: 2
  taints:
  - key: workload
    value: batch
    effect: NoSchedule
  labels:
    node-role: batch
    workload-type: batch-processing
```

### Pattern: Environment Separation

```yaml
# Production - highly available
prod-nodes:
  instanceType: 8vcpu-16gb
  diskSize: 200
  minSize: 5
  maxSize: 20
  desiredSize: 10
  taints:
  - key: environment
    value: production
    effect: NoSchedule
  labels:
    environment: production
    tier: application

# Staging - moderate resources
staging-nodes:
  instanceType: 4vcpu-8gb
  diskSize: 100
  minSize: 2
  maxSize: 5
  desiredSize: 3
  taints:
  - key: environment
    value: staging
    effect: NoSchedule
  labels:
    environment: staging
    tier: application
```

## Troubleshooting Node Groups

<details>

<summary>Node Group Stuck in CREATING</summary>

Symptoms:

* Node group status remains CREATING
* No nodes appearing in cluster

Possible Causes:

* Cluster not in PROVISIONED state
* Insufficient subnet IPs
* Invalid configuration
* OpenStack quota exceeded

Solution:

```bash
# Check cluster status
Cluster status should be PROVISIONED

# Check subnet has available IPs

# Verify subnet capacity in VPC service

# Check account quota

# Verify instance quota in OpenStack

# Review error message

# Check node group details for specific error
```

</details>

<details>

<summary>Nodes Not Reaching Ready State</summary>

Symptoms:

* Node group status SCALINGUP
* Nodes in NotReady state

Possible Causes:

* CNI not installed
* All nodes have taints (system pods can't schedule)
* Network connectivity issues

Solution:

```bash
# Check node status
kubectl get nodes

# Check why NotReady
kubectl describe node <node-name>

# Check CNI pods
kubectl get pods -n kube-system -l k8s-app=cilium

# Ensure you have untainted nodes

# System pods must be able to schedule
```

</details>

<details>

<summary>Scaling Down Stuck</summary>

Symptoms:

* Node group stuck in SCALINGDOWN
* Desired size < current size but nodes not removed

Possible Causes:

* PodDisruptionBudget preventing drain
* Pods with local storage
* Pods without controller (bare pods)

Solution:

```bash
# Check what's preventing drain
kubectl get pods -A --field-selector spec.nodeName=<node-name>

# Check PodDisruptionBudgets
kubectl get pdb -A

# Manually drain if needed (understand impact!)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
```

</details>

## Additional Resources

* [Installing Add-ons](https://docs.cloud.olakrutrim.com/basics/core-infrastructure/krutrim-kubernetes-system/installing-addons) - Install CNI and other essential add-ons
* [Creating Cluster Guide](https://docs.cloud.olakrutrim.com/basics/core-infrastructure/krutrim-kubernetes-system/creating-cluster) - Cluster setup process
* [Troubleshooting Guide](broken://pages/ddb69b507463367afd9c067d63cb5341a80fa3e8) - Common issues and solutions


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.cloud.olakrutrim.com/basics/core-infrastructure/krutrim-kubernetes-system/managing-nodegroups.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
