BugZero found this defect 269 days ago.
Data sources
All data on this page is proprietary to BugZero® or gathered from public sources
1/12/2024
Cray Shasta System Solutions
HPE Cray Supercomputing EX
HPE Cray supercomputers
No affected releases provided.
No fixed releases provided.
CSM 1.4 removed some ARP cache settings from the non-compute node (NCN) images which results in the kernel default values being used on these nodes. These resulting default values are inadequate for any system of non-trivial size: net.ipv4.neigh.default.gc_thresh1=128 net.ipv4.neigh.default.gc_thresh2=512 net.ipv4.neigh.default.gc_thresh3=1024 Symptoms Below are symptoms when these insufficient kernel values are being used: Pods for services such as Weave may be in a CrashLoopBackOff state. Some worker nodes may not be able to ping other worker nodes in the cluster CFS pods may report connectivity issues to service endpoints such as rgw-vip.local or api-gw-service-nmn.local Some worker nodes may be logging "neighbor: arp_cache: neighbor table overflow!" messages to dmesg and /var/log/messages
This issue affects CSM versions 1.4.0 and 1.4.1.
In order to resolve the issue, increase the size of the ARP cache to the recommended minimum on all NCNs: sysctl -w net.ipv4.neigh.default.gc_thresh1=2048 sysctl -w net.ipv4.neigh.default.gc_thresh2=4096 sysctl -w net.ipv4.neigh.default.gc_thresh3=8192 Sites may wish to add these values to /etc/sysctl.conf to ensure the settings persist through a reboot. This issue will be resolved in CSM 1.4.2 and above by a CFS Ansible play that applies these settings during NCN personalization and image customization and allows for tuning of these values. CSM 1.5 will also incorporate these defaults back into the NCN images. Related Information Sites with a large number of compute nodes or high-speed network interfaces may need further ARP cache tuning. See the HPE Slingshot Operations Guide for more information.