Auto Healing
1. Overview
Auto Healing is a feature that automatically monitors the health of worker nodes in a cluster. When a node enters an unhealthy state (lost connection, node hang, unresponsive, prolonged NotReady...), the system automatically deletes the faulty node and creates a replacement β the entire process requires no manual intervention.
VKS allows users to customize several Auto Healing parameters via the Portal: enable/disable, unhealthy node detection thresholds, and safety limits to protect the cluster.
Note for IAM-user accounts:
Before configuring Auto Healing, IAM-user accounts must be granted the permissions below (an admin should grant these once before use):
UpdateAutoHealingConfig
Update Auto Healing configuration
Create new cluster / Update a running cluster
2. Key highlights
Runs 24/7, no manual intervention required. Faulty nodes are automatically detected and replaced even outside business hours.
Flexible enable/disable. Can be temporarily disabled during maintenance or manual node inspection to avoid conflicts with your operations.
Cluster protection mechanism. When too many nodes fail simultaneously β typically a sign of a widespread infrastructure incident β the system halts replacements to prevent mass deletion and total capacity loss.
False positive prevention. A node must maintain an error state continuously for a sufficient duration (Timeout) before being considered unhealthy, preventing false triggers when a node experiences only a transient issue (network flap, brief pause due to memory reclamation...).
Supports both starting and running nodes. New nodes have a default 30-minute window (
nodeStartupTimeout) to become ready β if they fail to come up within that time, they are also considered candidates for replacement.Simple configuration, suitable for any scale. Supports both absolute values (
2) and percentages (30%), usable from small clusters to large production environments.
3. How it works

4. Configurable parameters
VKS allows configuration of the following 4 parameters via the Portal:
4.1. Enable / Disable Auto Healing
What it does: Master switch for the entire feature. When disabled, the system still monitors node health but does not automatically replace nodes.
When to enable: Recommended by default for production and any cluster requiring high availability.
When to disable:
Actively inspecting/debugging nodes manually and don't want the system to intervene.
During planned maintenance: updating nodes your own way, changing node configuration, migrating workloads.
Short-lived test environments that don't require availability guarantees.
When Auto Healing is disabled, faulty nodes persist until someone handles them manually. Remember to re-enable it immediately after maintenance β this is the most important protection layer for your cluster.
4.2. Timeout
What it does: The duration a node must continuously maintain an error state before being flagged for replacement. This mechanism filters transient issues such as network flaps, brief disconnections, pauses due to memory reclamation, or temporary CPU throttling.
When to increase: Unstable network infrastructure where nodes often experience brief disconnections before recovering on their own. Increase Timeout to give nodes more time to self-heal.
When to decrease: Production clusters that need to respond quickly to failures, accepting the risk of early node replacement to restore capacity faster.
Recommendations:
Standard production
5 minutes (default)
Unstable infrastructure
10β15 minutes
The recommended minimum is 5 minutes. Below that, nodes may flip between states (NotReady β Ready within tens of seconds) for legitimate reasons β not always a real incident. A Timeout that is too low leads to unnecessary node replacements, wastes resources, and disrupts workloads.
4.3. Max Unhealthy
What it does: The maximum number of unhealthy nodes the system still allows automatic replacement for. This is a protection mechanism β when the number of faulty nodes exceeds the threshold, replacements stop for the entire cluster.
Rationale: Multiple simultaneous node failures are typically a sign of a widespread infrastructure incident (full cluster network loss, storage failure, cloud provider failure). In that case, mass replacement does not fix the root cause and instead drains the cluster of all capacity, prolonging the outage.
When to increase: Large clusters (> 20 nodes) where workloads can tolerate many simultaneous node losses.
When to decrease: Critical production clusters where each replacement carries a brief disruption risk for pods running on the deleted node.
Recommendations:
Small cluster (β€ 3 nodes)
1
Standard production
40%
Large cluster, auto-scaling
30% (percentage scales with cluster size)
Stability-first cluster
1 (replace one node at a time)
Examples:
Cluster of 10 nodes,
Max Unhealthy = 2: 1β2 faulty nodes β replacement proceeds normally. β₯ 3 faulty nodes β replacement stops,RemediationRestrictedevent is raised.Cluster of 50 nodes,
Max Unhealthy = 40%: allows replacement when β€ 20 nodes are unhealthy.
4.4. Unhealthy Range
What it does: Similar to Max Unhealthy but allows configuring both a lower and upper threshold. Replacement only occurs when the number of unhealthy nodes falls within the range [min β max].
Key difference from Max Unhealthy:
Upper threshold (
max): Same as Max Unhealthy β stops replacement when too many nodes fail.Lower threshold (
min): Skips replacement when too few nodes fail β avoids overreacting to isolated incidents.
When to use: Clusters where nodes frequently flap up and down, or workloads that should only trigger replacement when multiple nodes fail simultaneously (a single faulty node may self-recover).
Example:
Cluster of 20 nodes, Unhealthy Range = [2-5]:
1 faulty node β no replacement (may be temporary β let the node self-recover or inspect manually).
2β5 faulty nodes β replacement proceeds normally.
β₯ 6 faulty nodes β replacement stops (suspected infrastructure incident).
Comparison:
Unhealthy node count
Max Unhealthy = 3
Unhealthy Range = [2-5]
1
Replace
Skip (below min threshold)
3
Replace
Replace
5
Skip (exceeds max)
Replace
6
Skip
Skip (exceeds max)
Parameter summary
Enable / Disable
on/off
On
Master switch for Auto Healing
Timeout
minutes
5 minutes
Filters false positives before marking a node unhealthy
Max Unhealthy
number or %
100%
Upper threshold β protection mechanism for replacements
Unhealthy Range
[min-max]
β
Lower and upper threshold (takes priority over Max Unhealthy)
5. Recommended configuration by cluster type
Dev / Test (1β3 nodes)
On
5 min
1
β
Small production (3β10 nodes)
On
5 min
40%
β
Medium production (10β20 nodes)
On
5 min
30%
or [2-6]
Large production (> 20 nodes)
On
5 min
30%
or [2-N] (N β 30% of total nodes)
Stability-first (databases, stateful workloads, critical workloads)
On
10 min
1
β
Under maintenance
Off
β
β
β
Guidance on choosing values:
Small clusters: Set
Max Unhealthy = 1to replace nodes one at a time, avoiding simultaneous deletion that drains all capacity.Large clusters: Use percentages so the threshold scales automatically as the cluster grows or shrinks.
Sensitive workloads: Increase Timeout to 10 minutes for a higher buffer, reducing the risk of replacing a node during a transient incident the node could have recovered from on its own.
6. Configuration guide
6.1. Configure Auto Healing when creating a new Cluster
Step 1: Open the Create Cluster page
Log in to VNG Cloud Console
Select VKS β Clusters β Create Cluster
Step 2: Configure Auto Healing
Locate the Auto Healing section in the Create Cluster form
Enable Auto Healing (enabled by default)
Fill in the parameters as needed:
Max Unhealthy
40%
Stop replacement when more than 40% of nodes are faulty
Timeout
5
Wait 5 minutes before marking a node as unhealthy

Click Create to create the cluster
If you skip the Auto Healing section, the system automatically enables it with defaults: Max Unhealthy = 100%, Timeout = 5 minutes.
6.2. Update Auto Healing configuration for a running Cluster
Step 1: Open Cluster details
Select VKS β Clusters
Click the name of the cluster you want to update

Step 2: Edit the Auto Healing configuration
IAM-user accounts require the
UpdateAutoHealingConfigpermission β see Note for IAM-user accounts at the top of this page.
Locate the Auto Healing row on the Cluster details page β click Edit
Update the parameters as needed

Click Save to apply the changes
Configuration changes take effect immediately β no cluster or node restart required.
7. Auto Healing Events
During operation, Auto Healing emits several Kubernetes Events on the cluster to report status. You can observe these events via the Portal or when viewing node/cluster details.
DetectedUnhealthy
Warning
A node has begun showing signs of failure (lost connection, NotReady...) and the system is waiting for the full Timeout to confirm. This is an early warning β no replacement yet, the node may still self-recover.
MachineMarkedUnhealthy
Warning
The node has maintained an error state for the full Timeout and has been officially marked as unhealthy. The system is preparing to delete the node and create a replacement.
RemediationRestricted
Warning
The number of faulty nodes has exceeded Max Unhealthy or falls outside Unhealthy Range. The system has paused replacements for the entire cluster to avoid mass deletion β typically a sign of a widespread infrastructure incident requiring manual inspection.
8. Important notes
The cluster halting replacements is correct behavior, not a bug. When the number of faulty nodes exceeds the configured threshold, the system intentionally stops replacing and waits for an operator to intervene rather than performing mass deletions.
nodeStartupTimeoutdefaults to 30 minutes. New nodes have 30 minutes to start up and join the cluster. If still not ready after that time β marked as faulty and will be replaced. This value is managed by the system and cannot be configured via the Portal.Auto Healing does not preserve Pod data. When a node is replaced, pods running on it are also deleted. Workloads that need to persist data must use
PersistentVolume; workloads requiring high availability must have multiple replicas spread across multiple nodes.Configuration changes take effect immediately β no cluster or node restart required.
Auto Healing does not replace a monitoring system. This is a self-healing mechanism at the node level; you still need monitoring tools (metrics, logs, alerts) to detect issues at the application or infrastructure layer that node status alone cannot reflect.
When a node fails, the system creates a replacement. Ensure you have sufficient credit and resource quota to provision the new node.
9. Frequently asked questions
When should I disable Auto Healing?
When you are actively intervening on nodes manually: inspection/debugging, planned maintenance, infrastructure checks. Re-enable it immediately after you're done to keep the cluster protected β leaving it off for long periods removes the cluster's ability to self-heal.
My node is unhealthy but I don't see any replacement. Why?
Three common causes:
Timeout not yet elapsed β the node must maintain an error state continuously for the full Timeout (default 5 minutes) before being flagged. Wait a bit longer.
Safety threshold exceeded β the number of unhealthy nodes has exceeded
Max Unhealthyor falls outsideUnhealthy Range. The system has paused replacements β check the threshold configuration and theRemediationRestrictedevent.Auto Healing is disabled β check the toggle on the Portal.
If a node is replaced, will Pod data be lost?
Pods running directly on the node will be deleted along with the node. To preserve data, use PersistentVolume (storage decoupled from the node lifecycle). To maintain availability, deploy workloads with multiple replicas so that when one replica is lost, the remaining replicas continue serving traffic.
What configuration is right for my cluster?
Refer to the recommendations table in Section 5. Pick the row that matches your cluster's scale and workload characteristics, then fine-tune based on real-world experience.
Why does the Portal show fewer "healthy" nodes than the total node count even though the cluster is running fine?
Nodes in the initialization phase (not yet past nodeStartupTimeout) are not yet counted as "healthy" but are also not considered "unhealthy" β they are in a transitional state. The system waits for the node to finish starting up; only after 30 minutes of not becoming ready is it marked unhealthy and replaced. This is normal behavior, not a bug.
Last updated

