Operational Defect Database

BugZero found this defect 136 days ago.

Hewlett Packard Enterprise | a00136839en_us

Advisory: HPE Cray XD665 Chassis Using Internal Radiator May Experience GPU Cooling Performance Reduction

Last update date:

1/5/2024

Affected products:

No affected products provided.

Affected releases:

No affected releases provided.

Fixed releases:

No fixed releases provided.

Description:

Info

XD665 systems with the air cooling solution, using the internal radiator (P59027-002), will eventually deplete the coolant reservoir included within the radiator. This slow, invisible loss of coolant will eventually impact the GPU cooling for the affected system. Evaluation of the worst-case conditions suggests the coolant reservoir is sufficient for at least 3.5 years of continuous use, but after that period if the GPU cooling performance degrades significantly, the radiator (P59027-002) should be replaced with spare radiator (P63181-001).

Scope

This issue only affects XD665 systems with the air cooling solution, using the internal radiator (P59027-002). Systems with direct liquid cooling on both the CPUs and GPUs will not be affected. Regardless of firmware, operating system, applications, drivers, or utilities, an XD665 system with an internal radiator is still subject to this potential issue.

Resolution

To resolve this particular issue, first you need to identify the unique circumstances that indicate it. 1.Identify a low coolant scenario as follows: a) First, ensure the system has been in use for at least 3.5 years by checking the radiator CT label. (See image 1) Image 1 - CT Label Location Image 2 - Radiator CT Label b) If the CT code indicates the radiator is more than 3.5 years old, continue with step (c). Otherwise, any thermal issues are likely caused by another source. c) Verify GPU pumps are fully functional by checking the BMC webpage under the sensors tab. Image 3 - BMC Webpage d) Also on the sensors tab of the BMC, if a system at idle shows all four GPUs at temperatures greater than 50°C, this could be a sign of insufficient cooling. e) If the system is experiencing GPU throttling across all four GPUs with data center ambient temperatures below 30°C, which can be viewed as the "Tray Inlet" sensor on the BMC sensors tab, this could also be a sign of insufficient cooling. 2. Replace the radiator (P59027-002) with spare radiator (P63181-001)if a resolution criteria under #1 is met.

Additional Resources / Links

Share:

BugZero® Risk Score

What's this?

Coming soon

Status

Unavailable

Learn More

Search:

...