Operational Defect Database

BugZero found this defect 313 days ago.

Hewlett Packard Enterprise | a00133701en_us

Advisory: HPE B-series Switches - Gen 7 Platforms May Experience CRC Errors, Port Faults or a Disruptive Reboot in Response to Severe Congestion and Activation of the Oversubscription Management Behavior of the Traff

Last update date:

2/28/2024

Affected products:

HPE Storage Fibre Channel Switch B-series SN6700B

HPE Storage Fibre Channel Switch B-series SN6750B

HPE Storage SAN Director Switch

Affected releases:

No affected releases provided.

Fixed releases:

No fixed releases provided.

Description:

Info

HPE SN8700B Director Switch, HPE SN6700B Fibre Channel (FC) Switch and HPE SN6750B FC Switch running Fabric OS (FOS) v9.1.x prior to v9.1.1c or FOS v9.2.0 that encounters an overflow of frame buffers while attempting to manage and re-route oversubscribed flows in response to a severe congestion event can cause unexpected errors. If the number of frames overruns the buffer used to manage the oversubscription handling, then these excess frames can be missed during Traffic Optimizer handling. These excess frames can potentially be overwritten by other frames leading to frame Cyclic Redundancy Check (CRC) errors or even port faults if the header information is overwritten. Under severe congestion scenarios, the management of these overflow/excess frames can lead to the blocking of other FOS daemons which can result in watchdog time-outs. Critical daemons that time-out will cause a High Availability (HA) fail-over or disruptive switch reboot. In addition to potential frame overflow handling, SN8700B directors that previously had been operating on FOS v9.0.x and then later upgraded to FOS v9.1.x could encounter verify errors after HA fail-overs (including those caused by firmware upgrades to later versions of v9.1.x). Multiple “verify” error messages are observed during oversubscription management by Traffic Optimizer due to a detected conflict in programming of ports created when some ports, but not all ports are reset while at v9.1.x. The conflict between congestion management programming on ports that were never reset while at v9.0.x and then later encountered congestion management while at v9.1.x on ports that were reset can appear after an HA fail-over event. Only the HPE Gen 7 SN8700B directors with an HPE SN8700B 64Gb 48-port FC Switch Blade and/or SN8700B 32Gb 48-port FC Switch Blade installed are at risk of encountering both the overflow and “verify” errors. HPE SN8600B 32Gb 48-port FC Blade and HPE SN8600B 32Gb 64-port FC Blade installed in Gen 7 SN8700B directors are not at risk of encountering either incident. HPE Gen 7 SN6700B and SN6750B switches are only at risk of encountering the buffer overflow event. These switches are not exposed to nor are they at risk of encountering the “verify” error condition. To further be at risk, the fabric must experience severe congestion resulting in oversubscription management by Traffic Optimizer. The following RASlog message will be observed if this level of response was ever encountered: [TO-1006], 1011618/1002267, FID 128, INFO, Switch_100, Flows destined to dev02 device have been moved to PG_OVER_SUBSCRIPTION_4G_16G PG., cfs_ctrlr.c, line: 1470, comp:cfsd, ltime:2023/05/17-06:15:33:923058 NOTE : The oversubscription management action by Traffic Optimizer only exists in FOS v9.1.x and later. Gen 7 products running on FOS v9.0.x are not at risk to either the “buffer overflow” or “verify” condition. Gen 6 platforms are also not at risk from either condition. Switches which are still using the default setting of 28 buffers for F_Port credit are not at risk of the “buffer overflow” condition. Only Gen 7 platforms running FOS v9.1.x prior to v9.1.1c which have had the default F_Port credit value of 1 or more ports customized to a value higher than the default 28 buffers are at risk of the “buffer overflow” condition. For the buffer overflow condition to occur, in addition to requiring a period of severe congestion, the F-ports on the HPE SN8700B Director Switch, HPE SN6700B FC Switch or HPE SN6750B FC Switch also need to have been configured from the default 28 buffers to a greater number of buffers. Any Gen 7 director or switch that has had their maximum F-Port buffer counts increased above the default values used by FOS are potentially at risk, and any SN8700B director that was previously running FOS v9.0.x could be at risk to encounter “verify” errors. In both cases, Traffic Optimizer must also attempt to manage routing of frames in response to an oversubscription event caused during a period of severe congestion. To determine directors and switches that might be at risk, use the “portbuffershow” command to view the Buffer Usage If the total of all buffer usage for ports on the same ASIC/chip that are also zoned together add up to a value greater than 256 buffers, then the Gen 7 switch is considered to be at risk to encounter a buffer overrun should a severe congestion event require oversubscription management from Traffic Optimizer. The incident will not be encountered on every oversubscription management event, as the number of buffers being managed at the time of the event needs to exceed 256 while Traffic Optimizer is managing oversubscription but being configured to potentially handle more than 256 buffers will put the switch at risk. In the example output shown above, if all 8 F-ports are in one zone together, the switch is at risk to encounter a frame buffer overflow while Traffic Optimizer is managing an oversubscription condition as the total buffer usage count in this example is 360. However, in the following example where the F-Ports are not all zoned together, this switch would not be at risk as the two zones total up to 232 buffers (ports 0, 1, 3, 4, 5& 6) and 128 (ports 8 & 10) buffers respectively. Issue the following CLI command from the maintenance account to view the buffers used by each port. The maximum number of ports utilized for oversubscription management is 8 ports. If more than 8 ports are zoned together from the same ASIC/chip, then total the 8 ports with the highest Buffer Usage values to determine risk. NOTE : Gen 7 directors and switches that have never had their F-Port buffer counts changed from default are not at risk to encounter this frame buffer overflow issue. The default setting for Max/Reserved Buffers is 28 for 64G ports on Gen 7 products. If the Max/Reserved Buffer has never been increased, counts from the default will all show 28 ports maximum as a Buffer Usage count and will never encounter the buffer overflow issue. Even with 8 ports zoned together, the total value of maximum Buffer Usage is only 224 frames if the ports are still configured to the 28 buffer default value. In addition to the buffer overflow issue, SN8700B directors could also potentially be at risk to “verify” error messages if the following conditions are met in this order: The SN8700B director was previously running on FOS v9.0.x The director is then upgraded to FOX v9.1.x. The director then has F-ports that log out and log in while at the v9.1.x version. The director then encounters an oversubscription event that requires management from Traffic Optimizer. The director then performs an HA fail-over (firmware upgrade causes a fail-over to happen). The director encounters another oversubscription event that requires management from Traffic Optimizer. SN8700B directors that meet all of these conditions, in the specified sequence, could be at risk to encounter “verify” errors during oversubscription management from Traffic Optimizer. SN8700B directors that have only ever run on FOS v9.1.x firmware are not at risk to encounter the “verify” error as only the v9.1 programming model is being used for all ports. Gen 7 directors must have been previously running with FOS v9.0.x in order to be susceptible to this issue. SN8700B directors that have been cold-booted/power cycled while running on FOS v9.1.x firmware are also not at risk to encounter the “verify” error as all ports will use the v9.1 programming after the re-boot. Gen 7 directors and switches that have encountered an oversubscription management event will observe the following Traffic Optimizer RASlog: [TO-1006], 1011618/1002267, FID 128, INFO, Switch_100, Flows destined to b1a02 device have been moved to PG_OVER_SUBSCRIPTION_4G_16G PG., cfs_ctrlr.c, line: 1470, comp:cfsd, ltime:2023/05/17-06:15:33:923058 Additional symptoms that could appear due to these identified issues could be: Large counts of CRC errors on a link may be observed that are not fixed with optic/cable replacement. Frames may be discarded, and credit on a link can be lost. Ports may be faulted, and ASIC may halt and be faulted. A director may observe an unexpected HA fail-over or even a cold restart of the director. Switches may observe a cold restart.

Scope

The following switches are affected by this issue: HPE B-series SN6700B Fibre Channel Switch HPE B-series SN6750B Fibre Channel Switch HPE SN8700B 4-slot Power Pack+ Director Switch HPE SN8700B 8-slot Power Pack+ Director Switch

Resolution

FOS v9.1.1c (HPE targeted release July 14, 2023), v9.2.0a (HPE targeted release August 2023) and later provide a resolution for both issues described in this advisory. Gen 7 directors or switches still running a version of FOS v9.0.x and could be “at risk” to encounter the issues described, if upgraded to a FOS 9.1.x version prior to 9.1.1c. It is recommended to wait for the release of FOS v9.1.1c before upgrading. Gen 7 directors and switches that are currently operating on a v9.1.x or v9.2.0 releases, and are determined to be at risk, should implement the workaround. Deactivating the Traffic Optimizer oversubscription management action will prevent both the buffer overrun and “verify” errors from occurring. Any Gen 7 director or switch that has already encountered the “buffer overflow” event will need to perform a cold restart to fully recover from the event: Directors: Slot power off/on the impacted port blade Switches: Reboot (cold restart) the switch Option 1: Perform the reboot action shown above, and then implement the workaround to disable the oversubscription management action from within Traffic Optimizer. Option 2: Upgrade to a version of FOS with the solution and then perform the reboot action shown above. Upgrading to a version of FOS with the solution provided will prevent the “buffer overflow” event from happening, but once the condition is encountered, only a cold restart of the ASIC will resolve the condition. Upgrading to a version of FOS with the solution provided will prevent and automatically recover from the “verify” error condition without any further action. After upgrading to a version of FOS that contains the solution, a check of internal memory will be performed to determine if the director or switch has previously encountered the event and requires a reboot to recover from the error condition. The following RASlog will be displayed should the condition be detected after upgrading FOS to a version with the solution: 2023/06/01-17:07:50 (GMT), [C5-1057], 5, SLOT 2 | CHASSIS, CRITICAL, Switch_3, S10,C0: HW ASIC Chip is in an inconsistent state = 0x1002. If the above RASlog is observed after upgrading FOS, then the director or switch has previously encountered the “buffer overflow” condition prior to the upgrade and will need to perform a cold restart to fully recover from the incident: Directors: Slot power off/on the impacted port blade Switches: Reboot (cold restart) the switch Workaround “At risk” directors and switches can disable the Traffic Optimizer oversubscription management action. Issue the following CLI command from the maintenance account to disable the oversubscription management action behavior within Traffic Optimizer: maintenance> serviceexec trafoptdebug --enableosclassification 0 NOTE : The maintenance command needs to be run on all Logical Switches in the chassis. NOTE : The setting will be persistent across fail-overs and power cycles. After upgrading to v9.1.1c or v9.2.0a, the oversubscription management action can be re-enabled via the following command: Issue the following CLI command from the maintenance account to re-enable the oversubscription management action behavior within Traffic Optimizer: maintenance> serviceexec trafoptdebug --enableosclassification 1 NOTE : The maintenance command needs to be run on all Logical Switches in the chassis RECEIVE PROACTIVE UPDATES : Receive support alerts (such as Customer Advisories), as well as updates on drivers, software, firmware, and customer replaceable components, proactively in your e-mail through HPE Support Alerts. Sign up for Support Alerts at the following URL: HPE Email Preference Center

Additional Resources / Links

Share:

BugZero® Risk Score

What's this?

Coming soon

Status

Unavailable

Learn More

Search:

...