Operational Defect Database

BugZero updated this defect 47 days ago.

VMware | 93411

Calico-node kube-proxy and antrea pods may fail intermittently on Photon 3

Last update date:

4/2/2024

Affected products:

vSphere

Affected releases:

7.08.0

Fixed releases:

No fixed releases provided.

Description:

Symptoms

Intermittently, calico-node pods in Kubernetes clusters on Photon 3 may start failing readiness checks. The pods will report: calico/node is not ready: felix is not ready: readiness probe reporting 503 With the following events: "Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused" "Liveness probe failed: calico/node is not ready: Felix is not live: Get " [http://localhost:9099/liveness](http://localhost:9099/liveness "http://localhost:9099/liveness")": dial tcp 127.0.0.1:9099: connect: connection refused" "Readiness probe failed: 2023-06-07 08:53:26.150 [INFO][775] confd/health.go 180: Number of node(s) with BGP peering established = 8" When looking at the logs of the pod, the following errors will be present: [PANIC][9342] felix/table.go 769: iptables-legacy-save command failed after retries ipVersion=0x4 table="raw" panic: (*logrus.Entry) 0xc000284b90 Kube-proxy logs on affected nodes will also show errors like: 2023-07-07T14:09:28.076615713Z stderr F E0707 14:09:28.076532 1 proxier.go:859] "Failed to ensure chain exists" err="error creating chain \"KUBE-EXTERNAL-SERVICES\": exit status 3: iptables v1.8.2 (legacy): can't initialize iptables table `filter': No child processes\nPerhaps iptables or your kernel needs to be upgraded.\n" table=filter chain=KUBE-EXTERNAL-SERVICES 2023-07-07T14:09:28.076640119Z stderr F I0707 14:09:28.076553 1 proxier.go:851] "Sync failed" retryingTime="30s" antrea-agent-error logs on affected nodes will also show errors like: F0326 07:31:50.801001 1 main.go:53] Error running agent: failed to start NPL agent: error when initializing NodePortLocal port table: initialization of NPL iptables rules failed: error checking if chain ANTREA-NODE-PORT-LOCAL exists in table nat: running [/usr/sbin/iptables -t nat -S ANTREA-NODE-PORT-LOCAL 1 --wait]: exit status 3: iptables v1.8.3 (legacy): can't initialize iptables table `nat': No child processesPerhaps iptables or your kernel needs to be upgraded.

Cause

Whilst the conditions that trigger the failure are not currently determined, the failure is identified as an issue in the Linux kernel in the bpfilter filter in versions prior to 5.2-rc2.

Impact / Risks

Pod networking will fail on affected nodes.

Resolution

This is fixed in the following TKR's and newer. v1.26.13---vmware.1-fips.1-tkg.3 for vSphere 8.x v1.26.12---vmware.2-fips.1-tkg.2 for vSphere 7.x v1.27.6---vmware.1-fips.1-tkg.1 for vSphere 7.x

Workaround

1. Save the following script to a file named vsphere_7_disable-bpfilter.sh OR vsphere_8_disable-bpfilter.sh (depending on version), or download the attached script and place it on a machine with a kubeconfig for the supervisor cluster and has L3 access to the nodes to be remediated. NOTE: The difference between the attached scripts is the vspheremachine reference for vsphere 8.x vs. the wcpmachine reference for vsphere 7.x. #!/bin/bash # Check if namespace and cluster arguments are provided if [ $# -ne 2 ]; then echo "Usage: $0 <namespace> <cluster>" exit 1 fi # Set the namespace and cluster variables NAMESPACE=$1 CLUSTER=$2 # Retrieve the SSH private key from the secret and write it to a temporary file PRIVATE_KEY_FILE=$(mktemp) kubectl get secret -n "$NAMESPACE" "$CLUSTER-ssh" --template='{{index .data "ssh-privatekey" | base64decode}}' >"$PRIVATE_KEY_FILE" chmod 600 "$PRIVATE_KEY_FILE" # Embedded script to run on each node SCRIPT_TO_RUN=$( cat <<'END_SCRIPT' #!/bin/bash # Your script commands go here echo "Running bpfilter remediation script on node: $(hostname)" echo Unloading bpfilter module sudo modprobe -r bpfilter echo Disabling bpfilter module echo "blacklist bpfilter" | sudo tee /etc/modprobe.d/disable-bpfilter.conf >/dev/null echo "install bpfilter /bin/true" | sudo tee -a /etc/modprobe.d/disable-bpfilter.conf >/dev/null sudo systemctl restart systemd-modules-load.service echo "Testing disablement" sudo modprobe -n -v bpfilter sudo lsmod | grep bpfilter || echo "bpfilter is not loaded" END_SCRIPT ) # Get the list of node names using kubectl and --template NODE_NAMES=$(kubectl get vspheremachine -n "$NAMESPACE" -l "cluster.x-k8s.io/cluster-name=$CLUSTER" --template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}') # Iterate over each node for NODE_NAME in $NODE_NAMES; do # Get the node's IP address using kubectl and --template NODE_IP=$(kubectl get vspheremachine -n "$NAMESPACE" "$NODE_NAME" --template='{{.status.vmIp}}') # SSH into the node using the private key file, ignore host key checking, and run the embedded script echo "Running script on node: $NODE_NAME" ssh -i "$PRIVATE_KEY_FILE" -o StrictHostKeyChecking=no "vmware-system-user@$NODE_IP" "bash -s" <<<"$SCRIPT_TO_RUN" # Add any additional commands you want to run on each node here # ... echo "Finished running script on node: $NODE_NAME" done # Remove the temporary private key file rm "$PRIVATE_KEY_FILE" 2. Make the script executable (change the filename to reflect the downloaded version): # chmod +x ./vsphere_7_disable-bpfilter.sh 3. Run the script to disable the bpfilter module (change the filename to reflect the downloaded version): # ./vsphere_7_disable-bpfilter.sh Node reboots can also temporarily resolve the problem.

Additional Resources / Links

Share:

BugZero® Risk Score

What's this?

Coming soon

Status

Unavailable

Learn More

Search:

...