Operational Defect Database

BugZero updated this defect 34 days ago.

VMware | 97654

NSX Bare Metal Edge NIC flapping

Last update date:

4/16/2024

Affected products:

NSX

Affected releases:

4.1

Fixed releases:

No fixed releases provided.

Description:

Symptoms

NSX 4.1.1 and aboveTraffic on Bare Metal Edge experiences datapath disruptionNSX UI may report "Edge NIC Transmit Queue Overflow" alarmsEdge syslog shows a very high rate of TX hang detection followed by a NIC reset 2024-04-05T17:14:42.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 queue 3 TX hang detected2024-04-05T17:14:42.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 reset successfully2024-04-05T17:15:02.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 queue 1 TX hang detected2024-04-05T17:15:02.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 reset successfully2024-04-05T17:15:22.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 queue 7 TX hang detected2024-04-05T17:15:22.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 reset successfully2024-04-05T17:15:42.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 queue 2 TX hang detected2024-04-05T17:15:42.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 reset successfully2024-04-05T17:16:02.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 queue 1 TX hang detected2024-04-05T17:16:02.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 reset successfully2024-04-05T17:16:22.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 queue 4 TX hang detected2024-04-05T17:16:22.461Z Edge NSX 13757 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43" level="WARN"] NIC fp-eth2 reset successfully

Cause

NSX 4.1.1 introduced a check that resets an Edge NIC when a TX hang condition is detected. This mechanism works as designed on Edge VM. On Bare Metal Edge, it may incorrectly diagnose a TX hang condition resulting in frequent NIC resets.

Resolution

This is a known issue impacting NSX Bare Metal Edge.

Workaround

To immediately workaround the issue, disable the NIC reset feature.On the Bare Metal Edge, as root user# edge-appctl -t /var/run/vmware/edge/dpd.ctl stats/hung_nic_reset disableNote, after applying the workaround, the Bare Metal Edge will continue to log the following messages in syslog which can be safely ignored. "edge_nic_transmit_queue_overflow" alarm with processed packet count as 0. This can be safely ignored. 2024-04-05T19:58:43.372Z Edge1 NSX 9458 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" tid="9909" level="FATAL" eventState="On" eventFeatureName="edge_health" eventSev="critical" eventType="edge_nic_transmit_queue_overflow"] Edge NIC fp-eth2 transmit queue 15 has overflowed by 100.000000% on Edge node xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx. The missed packet count is 15855 and processed packet count is 0. "NIC fp-ethX queue X TX hang detected" messages. This can be safely ignored. var/log/syslog:2024-04-05T19:44:09.497Z Edge1 NSX 9458 FABRIC [nsx@6876comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats43"level="WARN"] NIC fp-eth0 queue 1 TX hang detectedAlso NSX UI may report "Edge NIC Transmit Queue Overflow" alarms. These can be safely ignored or can be suppressed if required.This change does not persist a reboot or datapath restart.For a persistent workaround install the script attached to this KB.Script Installation1) Copy the 2 scripts to a location on the Edge, put both scripts in the same folder e.g.# mkdir /image/disable_nic_hung_check# ls -lt-rw-r--r-- 1 root root 804 Apr 10 04:48 cron_helper.sh-rwxr-xr-x 1 root root 4579 Apr 10 04:32 disable_nic_hung_check.py2) Validate the md5 of both scripts matches these outputs# md5sum disable_nic_hung_check.py20257cba75944db1a4424fd582a35f8e disable_nic_hung_check.py# md5sum cron_helper.sh39793e102e35f2bb212f31e3ffa6096b cron_helper.sh3) Install the script# cd /image/disable_nic_hung_check# sh cron_helper.shno crontab for rootno crontab for rootThis will copy the python script to a permanent location and create two cron jobs.4) Confirm installationFile exists now in permanent location#ls -lt /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py-rwxr-xr-x 1 root root 4579 Apr 10 04:49 /opt/vmware/nsx-edge/bin/disable_nic_hung_check.pyTwo cron jobs have been created# crontab -l* * * * * /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py* * * * * sleep 30; /opt/vmware/nsx-edge/bin/disable_nic_hung_check.pyOperational ValidationCron is running# grep CRON.*disable /var/log/syslog2024-04-10T10:44:01.662Z edge01.corp.local CRON 3870920 - - (root) CMD (/opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)2024-04-10T10:44:01.538Z edge01.corp.local CRON 3870919 - - (root) CMD (sleep 30; /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)2024-04-10T10:45:01.432Z edge01.corp.local CRON 3871473 - - (root) CMD (sleep 30; /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)2024-04-10T10:45:01.073Z edge01.corp.local CRON 3871483 - - (root) CMD (/opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)2024-04-10T10:46:01.837Z edge01.corp.local CRON 3871979 - - (root) CMD (/opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)2024-04-10T10:46:01.762Z edge01.corp.local CRON 3871980 - - (root) CMD (sleep 30; /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py)If the script detects a reboot or datapath service restart, it will disable the feature and log to /var/log/syslog2024-04-10T10:44:01.803Z edge01 NSX 3870922 - [nsx@6876 comp="nsx-edge" subcomp="disable-nic-hung" username="root" level="INFO"] Datapathd bootup/restart detected. Disabled NIC TX hung reset feature...The node will continue to log the "edge_nic_transmit_queue_overflow" and "TX hang detected" after application of the script. The NSX UI may continue to report "Edge NIC Transmit Queue Overflow" alarms. These can be safely ignored.Script uninstallation1) Validate cron entries present # crontab -l* * * * * /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py* * * * * sleep 30; /opt/vmware/nsx-edge/bin/disable_nic_hung_check.py 2) In example output in step 1), the only crontab entries are for the disable_nic_hung_check.py workaround script so all can be removed with one command# crontab -r# crontab -lno crontab for rootIf other crontab entries are present then crontab -r should not be used as it will delete all of them.Instead use crontab -e to delete the 2 entries relating to the disable_nic_hung_check.py script.crontab -e opens a vi editor where "dd" command is used to delete each line and :wq saves and quits.

Additional Resources / Links

Share:

BugZero® Risk Score

What's this?

Coming soon

Status

Unavailable

Learn More

Search:

...