Operational Defect Database

BugZero found this defect 45 days ago.

Hewlett Packard Enterprise | a00138636en_us

Advisory: HPE Cray Links Fail Between Nvidia ConnectX-6 and Slingshot Using Optical Cables after Upgrading

Last update date:

4/5/2024

Affected products:

HPE Cray Supercomputing EX

HPE Cray supercomputers

HPE Slingshot for HPC Clusters

Affected releases:

No affected releases provided.

Fixed releases:

No fixed releases provided.

Description:

Info

An issue exists where links between Nvidia ConnectX-6 and the Slingshot switch will not come up using Hisense Active Optical Cables after upgrading Clusterstor from 4.x to 6.x with Nvidia ConnectX-6 firmware 20.32.1010. Cable vendor confirmed a batch of incorrectly programmed cables. These AOC cables were programmed as Cu thus HPE developed a script to reprogram these cables correctly as AOC.

Scope

This advisory applies to all HPE Cray EX Supercomputer and HPE Cray Supercomputer systems with Nvidia ConnectX-6 NIC firmware version 20.32.1010.

Resolution

HPE has developed a script that can be used on site. The script will check for incorrectly programmed cables. It fixes the issue by programming Extended Specification Compliance code of SFF specification SFF-2024 byte value of 192 to 0x80 with code 0x33 for Active Optical Cable with 50GAUI, 100GAUI-2, or 200GAUI-4 C2M. Currently this batch of cables has byte 192 set as 0x40 which is for Cu cables. hisense_qsfp_update_b192.sh will detect and reprogram above specified fields to identify as https://downloads.hpe.com/pub/softlib2/software1/cd/p1078951391/v246275/hisense_qsfp_update_b192.sh FAQ: 1. Can this be run in a production environment or does it need a maintenance cycle? A maintenance cycle is needed, albeit the link is not coming up so we do not think production would be affected as it would not be part of the fabric. 2. Are there any impacts from running this script? The script updates a byte field in the cable headshell eeprom so there is no impact other than setting the cable eeprom correctly to allow the CX6 card to recognize the cable. 3. What is the best mechanism to reset the link? We recommend rebooting the server. Run hisense_qsfp_update_b192.sh bash script to detect and reprogram the cable to AOC and reboot the server. This will resolve the issue and the link will come up.

Additional Resources / Links

Share:

BugZero® Risk Score

What's this?

Coming soon

Status

Unavailable

Learn More

Search:

...