Skip to content

Commit fdf8136

Browse files
authored
Add troubleshooting of guest cluster LB IP is not reachable (#909)
* Add troubleshooting of guest cluster LB IP is not reachable * Update per review comments Signed-off-by: Jian Wang <jian.wang@suse.com>
1 parent b47668b commit fdf8136

7 files changed

Lines changed: 383 additions & 1 deletion

File tree

docs/rancher/cloud-provider.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -393,7 +393,9 @@ Harvester's built-in load balancer offers both **DHCP** and **Pool** modes, and
393393
394394
:::note
395395
396-
Modifying the `IPAM` mode isn't allowed. You must create a new service if you intend to change the `IPAM` mode.
396+
- Modifying the `IPAM` mode isn't allowed. You must create a new service if you intend to change the `IPAM` mode.
397+
398+
- Refer to [Guest Cluster Loadbalancer IP is not reachable](../troubleshooting/rancher.md#guest-cluster-loadbalancer-ip-is-not-reachable).
397399
398400
:::
399401

docs/troubleshooting/rancher.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,3 +83,98 @@ Related issues:
8383
8484
- Harvester: [#7105](https://github.com/harvester/harvester/issues/7105) and [#7284](https://github.com/harvester/harvester/issues/7284)
8585
- Rancher: [#45628](https://github.com/rancher/rancher/issues/45628)
86+
87+
## Guest Cluster Loadbalancer IP is not reachable
88+
89+
### Issue Description
90+
91+
1. Create a new [guest cluster](../rancher/node/rke2-cluster.md#create-rke2-kubernetes-cluster) with the default `Container Network: Calico` and the default `Cloud Provider: Harvester`.
92+
93+
1. Deploy `nginx` on this new guest cluster via command `kubectl apply -f https://k8s.io/examples/application/deployment.yaml`.
94+
95+
1. Create a [Load Balancer](../rancher/cloud-provider.md#load-balancer-support), which selects backend nginx.
96+
97+
1. The service is ready with allocated IP from DHCP server or IPPool, but when clicking the link the page might fail to show.
98+
99+
![](/img/v1.5/troubleshooting/gc-lb-is-not-reachable.png)
100+
101+
### Root Cause
102+
103+
In below example, the guest cluster node(Harvester VM)'s IP is `10.115.1.46`, and later a new Loadbalancer IP `10.115.6.200` is added to a new interface like `vip-fd8c28ce (@enp1s0)`. However, the Loadbalancer IP is taken over by the `calio` controller. It caused the Loadbalancer IP is not reachable. Through a shell session using the original IP run the following.
104+
105+
```sh
106+
$ ip -d link show dev vxlan.calico
107+
44: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
108+
link/ether 66:a7:41:00:1d:ba brd ff:ff:ff:ff:ff:ff promiscuity 0 allmulti 0 minmtu 68 maxmtu 65535
109+
info: Using default fan map value (33)
110+
vxlan id 4096 local 10.115.6.200 dev vip-8a928fa0 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 tso_max_size 65536 tso_max_segs 65535 gro_max_size 65536
111+
112+
The IP 10.115.6.200 is from the vip-* interface.
113+
114+
```
115+
116+
### Affected versions
117+
118+
From Calico `v3.22` or even ealier version, the [IP autodetection](https://github.com/projectcalico/calico/blob/aaee80d6e09254dc8c045136c9b31114b5aea9a9/node/pkg/lifecycle/startup/autodetection/autodetection_methods.go#L30) was available, and the `first-found` was the default value.
119+
120+
SUSE RKE2 version [`v1.29`](https://www.suse.com/suse-rke2/support-matrix/all-supported-versions/rke2-v1-29/) has Calico `v3.29.2`. version [`v1.35`](https://www.suse.com/suse-rke2/support-matrix/all-supported-versions/rke2-v1-35/) has Calico `v3.31.2`.
121+
122+
It means: for most recent RKE2 clusters when they use `Calico` as the default CNI, and use `Harvester-cloud-provider` to offer `loadbalancer` type services, they might suffer this issue.
123+
124+
### Workaround
125+
126+
#### For newly created cluster
127+
128+
When creating new clusters on `Rancher Manager`, click **Add-on: Calico**, YAML configuration window will appear. Add following two lines to `.installation.calicoNetwork`.
129+
130+
![](/img/v1.5/troubleshooting/change-calico-install-params.png)
131+
132+
```yaml
133+
installation:
134+
backend: VXLAN
135+
calicoNetwork:
136+
bgp: Disabled
137+
nodeAddressAutodetectionV4: // add this line
138+
skipInterface: vip.* // add this line
139+
```
140+
141+
The `calico` controller won't take over the Loadbalancer IP accidentally.
142+
143+
#### For existing clusters
144+
145+
Run `kubectl` command `$ kubectl edit installation`, go to section `.spec.calicoNetwork.nodeAddressAutodetectionV4`, remove any existing line like `firstFound: true`, add new line `skipInterface: vip.*` and save.
146+
147+
Wait 2 minutes, the daemonset `calico-system/calico-node` is rolling updated and then the related PODs take the node IP for VXLAN to use.
148+
149+
Run following command to check the `vxlan.calico` interface, if it takes the node IP like `10.115.1.46`, not the VIP.
150+
151+
```sh
152+
$ ip -d link show dev vxlan.calico
153+
154+
45: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
155+
link/ether 66:a7:41:00:1d:ba brd ff:ff:ff:ff:ff:ff promiscuity 0 allmulti 0 minmtu 68 maxmtu 65535
156+
info: Using default fan map value (33)
157+
vxlan id 4096 local 10.115.1.46 dev enp1s0 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 tso_max_size 65536 tso_max_segs 65535 gro_max_size 65536
158+
159+
```
160+
161+
If it still uses the VIP, then check the `tigera-operator` pod log to see if there is key word `failed calling webhook`.
162+
163+
```sh
164+
$ kubectl -n tigera-operator logs tigera-operator-8566d6db5c-wfjkt
165+
...
166+
{"level":"error","ts":"2025-12-18T09:06:37Z","msg":"Reconciler error","controller":"tigera-installation-controller","object":{"name":"periodic-5m0s-reconcile-event"},"namespace":"","name":"periodic-5m0s-reconcile-event","reconcileID":"bae9d2da-a4bf-4d8b-89b8-c8a23a96f351","error":"Internal error occurred: failed calling webhook \"rancher.cattle.io.namespaces\": failed to call webhook: Post \"https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s\": context deadline exceeded"...}
167+
```
168+
169+
In case it happenes, then update the `calico-system/calico-node` daemonset to add following container parameters directly.
170+
171+
```
172+
- name: IP_AUTODETECTION_METHOD
173+
value: skip-interface=vip.*
174+
```
175+
176+
Wait 2 minutes and check the aforementioned `vxlan.calico` interface again, when VIP is not taken over by it, VIP will continue to be reachable.
177+
178+
### Related Issue
179+
180+
[#8072](https://github.com/harvester/harvester/issues/8072) and [#9767](https://github.com/harvester/harvester/issues/9767)
360 KB
Loading
311 KB
Loading

versioned_docs/version-v1.5/troubleshooting/rancher.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,3 +83,98 @@ Related issues:
8383
8484
- Harvester: [#7105](https://github.com/harvester/harvester/issues/7105) and [#7284](https://github.com/harvester/harvester/issues/7284)
8585
- Rancher: [#45628](https://github.com/rancher/rancher/issues/45628)
86+
87+
## Guest Cluster Loadbalancer IP is not reachable
88+
89+
### Issue Description
90+
91+
1. Create a new [guest cluster](../rancher/node/rke2-cluster.md#create-rke2-kubernetes-cluster) with the default `Container Network: Calico` and the default `Cloud Provider: Harvester`.
92+
93+
1. Deploy `nginx` on this new guest cluster via command `kubectl apply -f https://k8s.io/examples/application/deployment.yaml`.
94+
95+
1. Create a [Load Balancer](../rancher/cloud-provider.md#load-balancer-support), which selects backend nginx.
96+
97+
1. The service is ready with allocated IP from DHCP server or IPPool, but when clicking the link the page might fail to show.
98+
99+
![](/img/v1.5/troubleshooting/gc-lb-is-not-reachable.png)
100+
101+
### Root Cause
102+
103+
In below example, the guest cluster node(Harvester VM)'s IP is `10.115.1.46`, and later a new Loadbalancer IP `10.115.6.200` is added to a new interface like `vip-fd8c28ce (@enp1s0)`. However, the Loadbalancer IP is taken over by the `calio` controller. It caused the Loadbalancer IP is not reachable. Through a shell session using the original IP run the following.
104+
105+
```sh
106+
$ ip -d link show dev vxlan.calico
107+
44: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
108+
link/ether 66:a7:41:00:1d:ba brd ff:ff:ff:ff:ff:ff promiscuity 0 allmulti 0 minmtu 68 maxmtu 65535
109+
info: Using default fan map value (33)
110+
vxlan id 4096 local 10.115.6.200 dev vip-8a928fa0 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 tso_max_size 65536 tso_max_segs 65535 gro_max_size 65536
111+
112+
The IP 10.115.6.200 is from the vip-* interface.
113+
114+
```
115+
116+
### Affected versions
117+
118+
From Calico `v3.22` or even ealier version, the [IP autodetection](https://github.com/projectcalico/calico/blob/aaee80d6e09254dc8c045136c9b31114b5aea9a9/node/pkg/lifecycle/startup/autodetection/autodetection_methods.go#L30) was available, and the `first-found` was the default value.
119+
120+
SUSE RKE2 version [`v1.29`](https://www.suse.com/suse-rke2/support-matrix/all-supported-versions/rke2-v1-29/) has Calico `v3.29.2`. version [`v1.35`](https://www.suse.com/suse-rke2/support-matrix/all-supported-versions/rke2-v1-35/) has Calico `v3.31.2`.
121+
122+
It means: for most recent RKE2 clusters when they use `Calico` as the default CNI, and use `Harvester-cloud-provider` to offer `loadbalancer` type services, they might suffer this issue.
123+
124+
### Workaround
125+
126+
#### For newly created cluster
127+
128+
When creating new clusters on `Rancher Manager`, click **Add-on: Calico**, YAML configuration window will appear. Add following two lines to `.installation.calicoNetwork`.
129+
130+
![](/img/v1.5/troubleshooting/change-calico-install-params.png)
131+
132+
```yaml
133+
installation:
134+
backend: VXLAN
135+
calicoNetwork:
136+
bgp: Disabled
137+
nodeAddressAutodetectionV4: // add this line
138+
skipInterface: vip.* // add this line
139+
```
140+
141+
The `calico` controller won't take over the Loadbalancer IP accidentally.
142+
143+
#### For existing clusters
144+
145+
Run `kubectl` command `$ kubectl edit installation`, go to section `.spec.calicoNetwork.nodeAddressAutodetectionV4`, remove any existing line like `firstFound: true`, add new line `skipInterface: vip.*` and save.
146+
147+
Wait 2 minutes, the daemonset `calico-system/calico-node` is rolling updated and then the related PODs take the node IP for VXLAN to use.
148+
149+
Run following command to check the `vxlan.calico` interface, if it takes the node IP like `10.115.1.46`, not the VIP.
150+
151+
```sh
152+
$ ip -d link show dev vxlan.calico
153+
154+
45: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
155+
link/ether 66:a7:41:00:1d:ba brd ff:ff:ff:ff:ff:ff promiscuity 0 allmulti 0 minmtu 68 maxmtu 65535
156+
info: Using default fan map value (33)
157+
vxlan id 4096 local 10.115.1.46 dev enp1s0 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 tso_max_size 65536 tso_max_segs 65535 gro_max_size 65536
158+
159+
```
160+
161+
If it still uses the VIP, then check the `tigera-operator` pod log to see if there is key word `failed calling webhook`.
162+
163+
```sh
164+
$ kubectl -n tigera-operator logs tigera-operator-8566d6db5c-wfjkt
165+
...
166+
{"level":"error","ts":"2025-12-18T09:06:37Z","msg":"Reconciler error","controller":"tigera-installation-controller","object":{"name":"periodic-5m0s-reconcile-event"},"namespace":"","name":"periodic-5m0s-reconcile-event","reconcileID":"bae9d2da-a4bf-4d8b-89b8-c8a23a96f351","error":"Internal error occurred: failed calling webhook \"rancher.cattle.io.namespaces\": failed to call webhook: Post \"https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s\": context deadline exceeded"...}
167+
```
168+
169+
In case it happenes, then update the `calico-system/calico-node` daemonset to add following container parameters directly.
170+
171+
```
172+
- name: IP_AUTODETECTION_METHOD
173+
value: skip-interface=vip.*
174+
```
175+
176+
Wait 2 minutes and check the aforementioned `vxlan.calico` interface again, when VIP is not taken over by it, VIP will continue to be reachable.
177+
178+
### Related Issue
179+
180+
[#8072](https://github.com/harvester/harvester/issues/8072) and [#9767](https://github.com/harvester/harvester/issues/9767)

versioned_docs/version-v1.6/troubleshooting/rancher.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,3 +83,98 @@ Related issues:
8383
8484
- Harvester: [#7105](https://github.com/harvester/harvester/issues/7105) and [#7284](https://github.com/harvester/harvester/issues/7284)
8585
- Rancher: [#45628](https://github.com/rancher/rancher/issues/45628)
86+
87+
## Guest Cluster Loadbalancer IP is not reachable
88+
89+
### Issue Description
90+
91+
1. Create a new [guest cluster](../rancher/node/rke2-cluster.md#create-rke2-kubernetes-cluster) with the default `Container Network: Calico` and the default `Cloud Provider: Harvester`.
92+
93+
1. Deploy `nginx` on this new guest cluster via command `kubectl apply -f https://k8s.io/examples/application/deployment.yaml`.
94+
95+
1. Create a [Load Balancer](../rancher/cloud-provider.md#load-balancer-support), which selects backend nginx.
96+
97+
1. The service is ready with allocated IP from DHCP server or IPPool, but when clicking the link the page might fail to show.
98+
99+
![](/img/v1.5/troubleshooting/gc-lb-is-not-reachable.png)
100+
101+
### Root Cause
102+
103+
In below example, the guest cluster node(Harvester VM)'s IP is `10.115.1.46`, and later a new Loadbalancer IP `10.115.6.200` is added to a new interface like `vip-fd8c28ce (@enp1s0)`. However, the Loadbalancer IP is taken over by the `calio` controller. It caused the Loadbalancer IP is not reachable. Through a shell session using the original IP run the following.
104+
105+
```sh
106+
$ ip -d link show dev vxlan.calico
107+
44: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
108+
link/ether 66:a7:41:00:1d:ba brd ff:ff:ff:ff:ff:ff promiscuity 0 allmulti 0 minmtu 68 maxmtu 65535
109+
info: Using default fan map value (33)
110+
vxlan id 4096 local 10.115.6.200 dev vip-8a928fa0 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 tso_max_size 65536 tso_max_segs 65535 gro_max_size 65536
111+
112+
The IP 10.115.6.200 is from the vip-* interface.
113+
114+
```
115+
116+
### Affected versions
117+
118+
From Calico `v3.22` or even ealier version, the [IP autodetection](https://github.com/projectcalico/calico/blob/aaee80d6e09254dc8c045136c9b31114b5aea9a9/node/pkg/lifecycle/startup/autodetection/autodetection_methods.go#L30) was available, and the `first-found` was the default value.
119+
120+
SUSE RKE2 version [`v1.29`](https://www.suse.com/suse-rke2/support-matrix/all-supported-versions/rke2-v1-29/) has Calico `v3.29.2`. version [`v1.35`](https://www.suse.com/suse-rke2/support-matrix/all-supported-versions/rke2-v1-35/) has Calico `v3.31.2`.
121+
122+
It means: for most recent RKE2 clusters when they use `Calico` as the default CNI, and use `Harvester-cloud-provider` to offer `loadbalancer` type services, they might suffer this issue.
123+
124+
### Workaround
125+
126+
#### For newly created cluster
127+
128+
When creating new clusters on `Rancher Manager`, click **Add-on: Calico**, YAML configuration window will appear. Add following two lines to `.installation.calicoNetwork`.
129+
130+
![](/img/v1.5/troubleshooting/change-calico-install-params.png)
131+
132+
```yaml
133+
installation:
134+
backend: VXLAN
135+
calicoNetwork:
136+
bgp: Disabled
137+
nodeAddressAutodetectionV4: // add this line
138+
skipInterface: vip.* // add this line
139+
```
140+
141+
The `calico` controller won't take over the Loadbalancer IP accidentally.
142+
143+
#### For existing clusters
144+
145+
Run `kubectl` command `$ kubectl edit installation`, go to section `.spec.calicoNetwork.nodeAddressAutodetectionV4`, remove any existing line like `firstFound: true`, add new line `skipInterface: vip.*` and save.
146+
147+
Wait 2 minutes, the daemonset `calico-system/calico-node` is rolling updated and then the related PODs take the node IP for VXLAN to use.
148+
149+
Run following command to check the `vxlan.calico` interface, if it takes the node IP like `10.115.1.46`, not the VIP.
150+
151+
```sh
152+
$ ip -d link show dev vxlan.calico
153+
154+
45: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
155+
link/ether 66:a7:41:00:1d:ba brd ff:ff:ff:ff:ff:ff promiscuity 0 allmulti 0 minmtu 68 maxmtu 65535
156+
info: Using default fan map value (33)
157+
vxlan id 4096 local 10.115.1.46 dev enp1s0 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 tso_max_size 65536 tso_max_segs 65535 gro_max_size 65536
158+
159+
```
160+
161+
If it still uses the VIP, then check the `tigera-operator` pod log to see if there is key word `failed calling webhook`.
162+
163+
```sh
164+
$ kubectl -n tigera-operator logs tigera-operator-8566d6db5c-wfjkt
165+
...
166+
{"level":"error","ts":"2025-12-18T09:06:37Z","msg":"Reconciler error","controller":"tigera-installation-controller","object":{"name":"periodic-5m0s-reconcile-event"},"namespace":"","name":"periodic-5m0s-reconcile-event","reconcileID":"bae9d2da-a4bf-4d8b-89b8-c8a23a96f351","error":"Internal error occurred: failed calling webhook \"rancher.cattle.io.namespaces\": failed to call webhook: Post \"https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s\": context deadline exceeded"...}
167+
```
168+
169+
In case it happenes, then update the `calico-system/calico-node` daemonset to add following container parameters directly.
170+
171+
```
172+
- name: IP_AUTODETECTION_METHOD
173+
value: skip-interface=vip.*
174+
```
175+
176+
Wait 2 minutes and check the aforementioned `vxlan.calico` interface again, when VIP is not taken over by it, VIP will continue to be reachable.
177+
178+
### Related Issue
179+
180+
[#8072](https://github.com/harvester/harvester/issues/8072) and [#9767](https://github.com/harvester/harvester/issues/9767)

0 commit comments

Comments
 (0)