Skip to content

Conversation

@galal-hussein
Copy link
Collaborator

@galal-hussein galal-hussein commented Dec 30, 2025

The PR refactors the startup command, now the startup command is divided into 3 functions:

1- single_server()

This simply when k3k starts a cluster with only one node, this has to be specially handled because if the server gets its IP changes (i.e pod deleted and restarted) then cluster quorum has to be resetted.

2- ha_server()

This is when a server starts in HA mode, and we check if its the first server to start then we start the server with init config, and if its other servers then we start with normal server config

3- safe_mode()

If the pod changes IP (server pod gets recreated) and since some services (like network policy controllers) requires the node IP to be correct see k3s-io/k3s#12844, then k3k needs to handle these situation, the current safe mode disables the network policy controller and starts a temporary server until the node IP is corrected by kubelet, once corrected we exit the safe mode gracefully allowing the pod to start normally.

                # Start the loop to wait for the nodeIP to change
		info "Waiting for Node IP to update to ${POD_IP}."
		count=0
		until kubectl get nodes -o wide 2>/dev/null | grep -q "${POD_IP}"; do
			if ! kill -0 $PID 2>/dev/null; then
				fatal "safe Mode K3s process died unexpectedly!"
			fi
			sleep 2
			count=$((count+1))

			if [ $count -gt 60 ]; then
				fatal "timed out waiting for node to change IP from $CURRENT_IP to $POD_IP"
			fi
		done
		
		info "Node IP is set to ${POD_IP} successfully. Stopping Safe Mode process..."
		kill $PID
		wait $PID 2>/dev/null || true

Logs

These are example logs from different scenarios:

1- single server cluster pod being deleted and recreated:

[INFO] [Thu Jan  1 13:22:31 2026] Starting single node setup...
[INFO] [Thu Jan  1 13:22:31 2026] Existing data found in single node setup. Performing cluster-reset to ensure quorum...
[INFO] [Thu Jan  1 13:22:52 2026] Cluster reset complete. Removing Reset flag file.
[INFO] [Thu Jan  1 13:22:52 2026] Starting K3s in Safe Mode (Network Policy Disabled) to patch Node IP from 10.42.0.216 to 10.42.0.218
[INFO] [Thu Jan  1 13:22:52 2026] Waiting for Node IP to update to 10.42.0.218.
[INFO] [Thu Jan  1 13:24:33 2026] Node IP is set to 10.42.0.218 successfully. Stopping Safe Mode process...
[INFO] [Thu Jan  1 13:24:33 2026] Adding pod IP file.
time="2026-01-01T13:24:33Z" level=info msg="Starting k3s v1.32.8+k3s1 (fe896f7e)"
time="2026-01-01T13:24:33Z" level=info msg="Managed etcd cluster bootstrap already complete and initialized"

2- HA 1st server pod gets deleted and recreated

[INFO] [Thu Jan  1 13:30:36 2026] Starting pod k3k-testcluster-server-0 in HA node setup
[INFO] [Thu Jan  1 13:30:36 2026] Starting K3s in Safe Mode (Network Policy Disabled) to patch Node IP from 10.42.0.220 to 10.42.0.225
[INFO] [Thu Jan  1 13:30:36 2026] Waiting for Node IP to update to 10.42.0.225.
[INFO] [Thu Jan  1 13:30:50 2026] Node IP is set to 10.42.0.225 successfully. Stopping Safe Mode process...
[INFO] [Thu Jan  1 13:30:50 2026] Adding pod IP file.
time="2026-01-01T13:30:50Z" level=info msg="Starting k3s v1.32.11+k3s1 (81195088)"
time="2026-01-01T13:30:50Z" level=info msg="Managed etcd cluster bootstrap already complete and initialized"
time="2026-01-01T13:30:50Z" level=info msg="Reconciling bootstrap data between datastore and disk"
time="2026-01-01T13:30:50Z" level=info msg="Successfully reconciled with remote datastore"
time="2026-01-01T13:30:51Z" level=info msg="Password verified locally for node k3k-testcluster-server-0"
time="2026-01-01T13:30:51Z" level=info msg="certificate CN=k3k-testcluster-server-0 signed by CN=k3s-server-ca@1767273937: notBefore=2026-01-01 12:30:51 +0000 UTC notAfter=2027-01-01 12:30:51 +0000 UTC"

3- HA second or third server pod gets deleted and recreated

[INFO] [Thu Jan  1 13:33:42 2026] Starting pod k3k-testcluster-server-1 in HA node setup
[INFO] [Thu Jan  1 13:33:42 2026] Starting K3s in Safe Mode (Network Policy Disabled) to patch Node IP from 10.42.0.222 to 10.42.0.226
[INFO] [Thu Jan  1 13:33:42 2026] Waiting for Node IP to update to 10.42.0.226.
[INFO] [Thu Jan  1 13:34:50 2026] Node IP is set to 10.42.0.226 successfully. Stopping Safe Mode process...
[INFO] [Thu Jan  1 13:34:50 2026] Adding pod IP file.
time="2026-01-01T13:34:50Z" level=info msg="Starting k3s v1.32.11+k3s1 (81195088)"
time="2026-01-01T13:34:50Z" level=info msg="Managed etcd cluster bootstrap already complete and initialized"
time="2026-01-01T13:34:50Z" level=info msg="Reconciling bootstrap data between datastore and disk"
time="2026-01-01T13:34:50Z" level=info msg="Successfully reconciled with remote datastore"
time="2026-01-01T13:34:51Z" level=info msg="Password verified locally for node k3k-testcluster-server-1"
time="2026-01-01T13:34:51Z" level=info msg="certificate CN=k3k-testcluster-server-1 signed by CN=k3s-server-ca@1767273937: notBefore=2026-01-01 12:34:51 +0000 UTC notAfter=2027-01-01 12:34:51 +0000 UTC"

Signed-off-by: galal-hussein <[email protected]>
@codecov-commenter
Copy link

codecov-commenter commented Dec 30, 2025

Codecov Report

❌ Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 56.01%. Comparing base (93025d3) to head (46e0e31).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/cluster/client.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #598      +/-   ##
==========================================
- Coverage   59.02%   56.01%   -3.02%     
==========================================
  Files          56       55       -1     
  Lines        5316     5297      -19     
==========================================
- Hits         3138     2967     -171     
- Misses       1893     2035     +142     
- Partials      285      295      +10     
Flag Coverage Δ
cli 53.93% <71.42%> (+0.19%) ⬆️
controller 50.21% <85.71%> (-6.86%) ⬇️
e2e 50.21% <85.71%> (-6.86%) ⬇️
unit 37.70% <57.14%> (+1.67%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@galal-hussein galal-hussein changed the title Patch node ip when server pod restarts Refactor startup command to wait for node IP changes Jan 1, 2026
Copy link
Collaborator

@enrichman enrichman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of minor nits, but looks good, thanks!

Signed-off-by: galal-hussein <[email protected]>
Copy link
Collaborator

@enrichman enrichman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! 👏

@galal-hussein galal-hussein merged commit a871917 into rancher:main Jan 9, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants