Operational Defect Database

BugZero updated this defect 44 days ago.

VMware | 91147

NSX Edge datapath service not working due to larger AppHA packets.

Last update date:

4/5/2024

Affected products:

NSX Transformers

Affected releases:

No affected releases provided.

Fixed releases:

No fixed releases provided.

Description:

Symptoms

BFD tunnels on the Edge are down.Edge TEP is not reachable.Larger AppHA packets can be verified by checking the existence of entries like the following in syslog where it indicates the packet size (here it says 1976b) is larger than 1472b. Anything above 1472b size is a problem. File path - /var/log# grep "AppHA-tx-Bridge" syslog2023-02-13T23:15:12.799Z 10-172-23-51 NSX 17 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="db-config" level="INFO"] AppHA-tx-Bridge(00085,00000): ANNO.REQ.0000000000:0000000000,peer=0c0f0304-9315-11ed-8c11-0050568d1c4b,1976b2023-02-13T23:15:12.803Z 10-172-23-51 NSX 17 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="db-config" level="INFO"] AppHA-tx-Bridge(00086,00000): ANNO.REQ.0000000000:0000000000,peer=0c0f0304-9315-11ed-8c11-0050568d1c4b,1976b2023-02-13T23:15:13.121Z 10-172-23-51 NSX 17 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="db-config" level="INFO"] AppHA-tx-Bridge(00087,00000): ANNO.REQ.0000000000:0000000000,peer=0c0f0304-9315-11ed-8c11-0050568d1c4b,1976b2023-02-13T23:15:13.160Z 10-172-23-51 NSX 17 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="db-config" level="INFO"] AppHA-tx-Bridge(00088,00000): ANNO.REQ.0000000000:0000000000,peer=0c0f0304-9315-11ed-8c11-0050568d1c4b,1976b No response to edge datapath commands like "get logical-routers" would not work, as an external symptom. 2022-10-23T22:45:42.329Z <edge FQDN> NSX 6534 - [nsx@6876 comp="nsx-edge" subcomp="cli" username="admin" level="INFO"] CMD: get logical-routers Error logged following command in /var/log/syslog on Edge:2022-10-23T22:45:42.444603+00:00 <edge FQDN> NSX 6536 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="edge-appctl" s2comp="unixctl" level="WARN"] failed to connect to /var/run/vmware/edge/dpd.ctl dp-ipc threads in blocked state can be seen in /var/log/syslog: the blocked state keeps incrementing - For example, in the below log lines, the thread = urcu2 keeps incrementing from 4000ms to 8000ms to 16000ms. 2022-10-23T21:51:05.725Z <Edge FQDN> NSX 4468 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="urcu2" level="WARN"] blocked 4000 ms waiting for dp-ipc31 to quiesce2022-10-23T21:51:09.724Z <Edge FQDN> NSX 4468 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="urcu2" level="WARN"] blocked 8000 ms waiting for dp-ipc31 to quiesce2022-10-23T21:51:17.725Z <Edge FQDN> NSX 4468 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="urcu2" level="WARN"] blocked 16000 ms waiting for dp-ipc31 to quiesce2022-10-23T21:51:24.979Z <Edge FQDN> NSX 4468 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="dp-si-purge5" level="WARN"] blocked 1000 ms waiting for dp-ipc31 to quiesce2022-10-23T21:51:25.978Z <Edge FQDN> NSX 4468 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="dp-si-purge5" level="WARN"] blocked 2000 ms waiting for dp-ipc31 to quiesce2022-10-23T21:51:27.978Z <Edge FQDN> NSX 4468 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="dp-si-purge5" level="WARN"] blocked 4000 ms waiting for dp-ipc31 to quiesce

Cause

Large AppHA packets which are used to exchange Bridge service HA status got on the top of the retransmit timer heap and caused the bfd thread in a busy loop to process the same AppHA packet repeatedly while taking the bfd lock. This leads the CLI to be blocked after the config thread also needs the bfd lock to process an AppHa related config.

Impact / Risks

Datapath failure on the NSX Edge devices.

Resolution

This issue is resolved in VMware NSX-T 3.2.3 (build number 21703624)This issue is resolved in VMware NSX-T 4.0.2 (build number 20598727)This issue is resolved in VMware NSX-T 4.1.1 (build number 21332673)

Workaround

Put the affected Edge into Maintenance Mode and reboot it.

Related Information

This issue can be reproduced by increasing the AppHA packets of bridge service above size 1472, then toggling the Connected state of the vNics of the Edge VM in vCenter. The bridge AppHA packet size can be artificially increased by adding Transport Zones to the Edges.

Additional Resources / Links

Share:

BugZero® Risk Score

What's this?

Coming soon

Status

Unavailable

Learn More

Search:

...