Operational Defect Database

BugZero found this defect 40 days ago.

Hewlett Packard Enterprise | a00138717en_us

Advisory: HPE Compute Scale-up Server 3200 - System May Encounter an HWERR_BIOS_HALT_DETECTED Condition During the OS Crashdump Process

Last update date:

4/10/2024

Affected products:

HPE Compute Scale-up Server 3200

Affected releases:

No affected releases provided.

Fixed releases:

No fixed releases provided.

Description:

Info

An HPE Compute Scale-up Server 3200 running a RHEL or SLES operating system may intermittently encounter an HWERR_BIOS_HALT_DETECTED condition during the OS Crashdump process, resulting in failure to capture a valid OS crashdump. The issue can also be generated when the "power nmi" command is issued from the RMC, and/or the "sysrq" command is issued from the OS. When this occurs, the system will log a CAE MCA alert event and an RCU dump will be generated before rebooting. This is NOT a hardware failure. It occurs because Linux is mapping a larger address range (1 GB) rather than the smaller address range requested. This allows the processor to speculate past the end of real RAM memory (specified in the memory map) into reserved space defined by the BIOS. Whenever this reserved memory space is violated, firmware will flag a FATAL error condition, resulting in HWERR_BIOS_HALT_DETECTED. Failure footprint example: audit.log: 2024-02-29T00:08:11Z||1.10.342-20231206_161054|2001|Admin|1236483|CLI|power nmi pnum=0 CAE: 18 2024-02-29 02:32:20Z 307 System Hardware 0 Fatal Machine Check Abort (MCA) observed on the system 8 2024-02-29 00:08:34Z 307 System Hardware 0 Fatal Machine Check Abort (MCA) observed on the system RCU at 00:10:27 on Thursday February 29 2024: FRU Summary for Superdome 3200 System Serial Number 5UF349HX6D NASID LOCATION DEVICE SCORE Reason for Failure Event ----- ---------- ------------ ----- ------------------------ 0x00 r001u01b UNKNOWN 47 BIOS Halt detected. Check uvcon dump for r001u01b 0x04 r001u06b UNKNOWN 47 BIOS Halt detected. Check uvcon dump for r001u06b 0x08 r001u11b UNKNOWN 47 BIOS Halt detected. Check uvcon dump for r001u11b 0x0C r001u16b UNKNOWN 47 BIOS Halt detected. Check uvcon dump for r001u16b 0x0F r001u16b1h1 SKT3 on H3 MLB sn PR1331C8JJ 41 Hub LH2_KTC detected Incoming snoop is incorrectly accessing Directory Memory. 0x0F r001u16b1h1 SKT3 on H3 MLB sn PR1331C8JJ 19 Hub LH3_KTC detected Incoming snoop is incorrectly accessing Directory Memory. 0x0F r001u16b1h1 SKT3 on H3 MLB sn PR1331C8JJ 17 Hub LH0_KTC detected Incoming snoop is incorrectly accessing Directory Memory. 0x0F r001u16b1h1 SKT3 on H3 MLB sn PR1331C8JJ 15 Hub LH1_KTC detected Incoming snoop is incorrectly accessing Directory Memory. IEL: RMC cli 'power nmi' issued: 445688 2024-02-28 02:52:45Z OS dbwnhl02 0 Info (2) 0000000000000000 DCD_OS_BOOT_COMPLETE ~ 445691 2024-02-29 00:08:15Z MFW r001u01c 0 Info (2) 0000000000000000 CLI_NPAR_POWER_NMI_START 445692 2024-02-29 00:08:18Z MFW r001u01b 0 Major(1) 0000000000000000 NMI_INITIATED 445693 2024-02-29 00:08:18Z MFW r001u01c 0 Info (2) 0000000000000000 CLI_NPAR_POWER_NMI_SUCCESS 445696 2024-02-29 00:08:19Z MFW r001u01c/ELS 0 Info (2) 0001001000000003 ELSD_PROCESS_CPER 445697 2024-02-29 00:08:19Z MFW r001u01c/ELS 0 Info (2) 0110000000000001 ELSD_WRITE_BUNDLE 445698 2024-02-29 00:08:19Z MFW r001u01c/ELS 0 Info (2) 0001001000000003 ELSD_PROCESS_CPER 445699 2024-02-29 00:08:19Z MFW r001u01c/ELS 0 Info (2) 0110000000000001 ELSD_UPDATE_BUNDLE 445701 2024-02-29 00:08:20Z MFW r001u01b 0 *WARN (3) 0000000000000000 HWERR_BIOS_HALT_DETECTED 445702 2024-02-29 00:08:20Z MFW r001u01b 0 *WARN (3) 0000000000000000 HWERR_BIOS_HALT_DETECTED 'sysrq' issued from OS: 501586 2024-02-29 02:26:16Z OS dbwnhl02 0 Info (2) 0000000000000000 DCD_OS_BOOT_COMPLETE ~ 501601 2024-02-29 02:31:38Z MFW r001u16b 0 *WARN (3) 0000000000000000 HWERR_BIOS_HALT_DETECTED 501602 2024-02-29 02:31:38Z MFW r001u16b 0 *WARN (3) 0000000000000000 HWERR_BIOS_HALT_DETECTED OS console: 2024-02-29 02:30:31Z Red Hat Enterprise Linux 8.8 (Ootpa) 2024-02-29 02:30:31Z Kernel 4.18.0-477.36.1.el8_8.x86_64 on an x86_64 2024-02-29 02:30:31Z 2024-02-29 02:30:31Z Activate the web console with: systemctl enable --now cockpit.socket 2024-02-29 02:30:31Z 2024-02-29 02:30:31Z dbwnhl02 login: [ 582.670017] sysrq: SysRq : Trigger a crash 2024-02-29 02:31:34Z [ 582.674179] Kernel panic - not syncing: sysrq triggered crash 2024-02-29 02:31:34Z [ 582.674179] 2024-02-29 02:31:34Z [ 582.681461] CPU: 2 PID: 34604 Comm: bash Kdump: loaded Tainted: G O --------- - - 4.18.0-477.36.1.el8_8.x86_64 #1 2024-02-29 02:31:34Z [ 582.693113] Hardware name: HPE Compute Scale-up Server 3200/Compute Scale-up Server 3200, BIOS Bundle:1.10.342-20231206_161054 SFW:009.010.108.000.2312042 2024-02-29 02:31:34Z [ 582.707019] Call Trace: 2024-02-29 02:31:34Z [ 582.709490] dump_stack+0x41/0x60 2024-02-29 02:31:34Z [ 582.712849] panic+0xe7/0x2ac 2024-02-29 02:31:35Z [ 582.715859] ? printk+0x58/0x73 2024-02-29 02:31:35Z [ 582.719040] sysrq_handle_crash+0x11/0x20 2024-02-29 02:31:35Z [ 582.723097] __handle_sysrq.cold.13+0x48/0xff 2024-02-29 02:31:35Z [ 582.727491] write_sysrq_trigger+0x2b/0x40 2024-02-29 02:31:35Z [ 582.731627] proc_reg_write+0x39/0x60 2024-02-29 02:31:35Z [ 582.735333] vfs_write+0xa5/0x1b0 2024-02-29 02:31:35Z [ 582.738691] ksys_write+0x4f/0xb0 2024-02-29 02:31:35Z [ 582.742042] do_syscall_64+0x5b/0x1b0 2024-02-29 02:31:35Z [ 582.745745] entry_SYSCALL_64_after_hwframe+0x61/0xc6 2024-02-29 02:31:35Z [ 582.750847] RIP: 0033:0x7f7da53eca28 2024-02-29 02:31:35Z [ 582.754460] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 15 4d 2a 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55 2024-02-29 02:31:35Z [ 582.773343] RSP: 002b:00007ffd7d9f7528 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 2024-02-29 02:31:35Z [ 582.780958] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f7da53eca28 2024-02-29 02:31:35Z [ 582.788134] RDX: 0000000000000002 RSI: 0000560f5b807d50 RDI: 0000000000000001 2024-02-29 02:31:35Z [ 582.795310] RBP: 0000560f5b807d50 R08: 000000000000000a R09: 00007f7da544cae0 2024-02-29 02:31:35Z [ 582.802488] R10: 000000000000000a R11: 0000000000000246 R12: 00007f7da568d6e0 2024-02-29 02:31:35Z [ 582.809665] R13: 0000000000000002 R14: 00007f7da5688860 R15: 0000000000000002 2024-02-29 02:31:38Z ******** [20240229.023138Z] PDHC r001u16b: HWERR: BIOS HALT detected! 2024-02-29 02:31:38Z ******** [20240229.023138Z] PDHC r001u16b: HWERR: BIOS HALT detected! 2024-02-29 02:31:38Z ******** [20240229.023138Z] PDHC r001u16b: HWERR: BIOS HALT detected! 2024-02-29 02:31:38Z ******** [20240229.023138Z] PDHC r001u16b: HWERR: BIOS HALT detected! 2024-02-29 02:31:38Z ******** [20240229.023138Z] PDHC r001u11b: HWERR: BIOS HALT detected! 2024-02-29 02:31:38Z ******** [20240229.023138Z] PDHC r001u11b: HWERR: BIOS HALT detected! 2024-02-29 02:31:38Z ******** [20240229.023138Z] PDHC r001u11b: HWERR: BIOS HALT detected! 2024-02-29 02:31:38Z ******** [20240229.023138Z] PDHC r001u11b: HWERR: BIOS HALT detected! 2024-02-29 02:31:38Z ******** [20240229.023138Z] PDHC r001u06b: HWERR: BIOS HALT detected! 2024-02-29 02:31:38Z ******** [20240229.023138Z] PDHC r001u06b: HWERR: BIOS HALT detected! 2024-02-29 02:31:38Z ******** [20240229.023138Z] PDHC r001u06b: HWERR: BIOS HALT detected! 2024-02-29 02:31:38Z ******** [20240229.023138Z] PDHC r001u06b: HWERR: BIOS HALT detected! 2024-02-29 02:31:39Z ******** [20240229.023139Z] PDHC r001u01b: HWERR: BIOS HALT detected! 2024-02-29 02:31:39Z ******** [20240229.023139Z] PDHC r001u01b: HWERR: BIOS HALT detected! 2024-02-29 02:31:39Z ******** [20240229.023139Z] PDHC r001u01b: HWERR: BIOS HALT detected! 2024-02-29 02:31:39Z ******** [20240229.023139Z] PDHC r001u01b: HWERR: BIOS HALT detected! 2024-02-29 02:31:40Z ******** [20240229.023140Z] NOTICE: Firmware initiating fatal error logging and recovery, do not power off or power cycle the partition. 2024-02-29 02:33:33Z ******** [20240229.023333Z] NOTICE: Firmware initiated rcu which can take several minutes, please do not reset or power cycle. 2024-02-29 02:43:24Z ******** [20240229.024324Z] NOTICE: Firmware has completed rcu. Please do not power off or power cycle the partition. 2024-02-29 02:45:14Z ******** [20240229.024514Z] NOTICE: Firmware initiating power cycle of partition.

Scope

Any HPE Compute Scale-Up Server 3200 running a RHEL or SUSE operating system.

Resolution

As a workaround, modify and add "nokaslr" to the Kernel bootline. Disabling kaslr will prevent kernel structures from being located at random places in memory. Random placement near the start or end of a memory range (in the address map) can lead the kernel to map outside the memory range, potentially intruding into the BIOS reserved region, causing a violation that may cause the system to become unresponsive. A solution will be incorporated into future RHEL and SLES releases. This advisory will be updated when additional solutions become available.

Additional Resources / Links

Share:

BugZero® Risk Score

What's this?

Coming soon

Status

Unavailable

Learn More

Search:

...