Operational Defect Database

BugZero found this defect 18 days ago.

Hewlett Packard Enterprise | a00139153en_us

Advisory: HPE ProLiant XL645d Gen10 Plus Server - Only One GPU Will Be Detected When Performing a Driver Query Using ROCm-smi or NVIDIA-smi When Running System ROM Version A48 3.00_01-26-2024 (or Later)

Last update date:

5/2/2024

Affected products:

HPE Apollo 6500 Gen10 Plus System

Affected releases:

No affected releases provided.

Fixed releases:

No fixed releases provided.

Description:

Info

For any HPE ProLiant XL645d Gen10 Plus server running System ROM version A48 3.00_01-26-2024 (or later), and configured with AMD Mi210 GPUs or NVIDIA HGX A100 SXM4 40GB/80GB GPUs, only one GPU will be detected when performing a driver query using ROCm-smi or NVIDIA-smi. As a result, ROCm-smi or NVIDIA-smi will be unable to monitor/manage some GPUs. The BIOS/Platform Configuration (RBSU) and Operating System will detect all GPUs. The below example illustrates this issue when configured with AMD Mi210 GPUs: The below example illustrates this issue when configured with NVIDIA HGX A100 SXM4 40GB/80GB GPUs: The NVIDIA System Management Interface (NVIDIA-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices. This utility allows administrators to query the GPU device state and with the appropriate privileges, permits administrators to modify the GPU device state. AMD ROCm is an open-source stack, composed primarily of open-source software designed for GPU computation. ROCm consists of a collection of drivers, development tools, and APIs that enable GPU programming from low-level kernel to end-user applications. AMD ROCm System Management Interface (ROCm-smi) enables functionality for clock and temperature management of ROCm-enabled systems.

Scope

In the scenario described above, any HPE ProLiant XL645d Gen10 Plus server running System ROM version A48 3.00_01-26-2024 (or later), and configured with AMD Mi210 GPUs or NVIDIA HGX A100 SXM4 40GB/80GB GPUs.

Resolution

This issue is under investigation. This advisory will be updated when additional information becomes available. If this issue has already occurred, downgrade the System ROM to version A48 2.90_10-27-2023. Note: System ROM versions 3.00_01-26-2024 and 3.00_01-26-2024(B) are considered Recommended. For a list of System ROM fixes, refer to the above link. RECEIVE PROACTIVE UPDATES : Receive support alerts (such as Customer Advisories), as well as updates on drivers, software, firmware, and customer replaceable components, proactively in your e-mail through HPE Support Alerts. Sign up for Support Alerts at the following URL: HPE Email Preference Center NAVIGATION TIP: For hints on navigating HPE.com to locate the latest drivers, patches and other support software downloads, refer to the Navigation Tips document. SEARCH TIP: For hints on locating similar documents on HPE.com, refer to the Search Tips document.

Additional Resources / Links

Share:

BugZero® Risk Score

What's this?

Coming soon

Status

Unavailable

Learn More

Search:

...