Operational Defect Database

BugZero found this defect 269 days ago.

Hewlett Packard Enterprise | a00134327en_us

Advisory: HPE Cray CSM 1.4.0 and 1.4.1 NCN ARP Cache Settings

Last update date:

1/12/2024

Affected products:

Cray Shasta System Solutions

HPE Cray Supercomputing EX

HPE Cray supercomputers

Affected releases:

No affected releases provided.

Fixed releases:

No fixed releases provided.

Description:

Info

CSM 1.4 removed some ARP cache settings from the non-compute node (NCN) images which results in the kernel default values being used on these nodes. These resulting default values are inadequate for any system of non-trivial size: net.ipv4.neigh.default.gc_thresh1=128 net.ipv4.neigh.default.gc_thresh2=512 net.ipv4.neigh.default.gc_thresh3=1024 Symptoms Below are symptoms when these insufficient kernel values are being used: Pods for services such as Weave may be in a CrashLoopBackOff state. Some worker nodes may not be able to ping other worker nodes in the cluster CFS pods may report connectivity issues to service endpoints such as rgw-vip.local or api-gw-service-nmn.local Some worker nodes may be logging "neighbor: arp_cache: neighbor table overflow!" messages to dmesg and /var/log/messages

Scope

This issue affects CSM versions 1.4.0 and 1.4.1.

Resolution

In order to resolve the issue, increase the size of the ARP cache to the recommended minimum on all NCNs: sysctl -w net.ipv4.neigh.default.gc_thresh1=2048 sysctl -w net.ipv4.neigh.default.gc_thresh2=4096 sysctl -w net.ipv4.neigh.default.gc_thresh3=8192 Sites may wish to add these values to /etc/sysctl.conf to ensure the settings persist through a reboot. This issue will be resolved in CSM 1.4.2 and above by a CFS Ansible play that applies these settings during NCN personalization and image customization and allows for tuning of these values. CSM 1.5 will also incorporate these defaults back into the NCN images. Related Information Sites with a large number of compute nodes or high-speed network interfaces may need further ARP cache tuning. See the HPE Slingshot Operations Guide for more information.

Additional Resources / Links

Share:

BugZero® Risk Score

What's this?

Coming soon

Status

Unavailable

Learn More

Search:

...