VSS Recovery mode

Unfortunately this blog has passed away (don’t worry it was peaceful), but just like Jesus we have a resurrection, go take a peak at the future https://theworldsgonemad.net/2018/vss-recovery-mode/

If you have Dual-active Detection (DAD) enabled and the VSL links between devices fail the standby switch will go active. The current active switch will be informed of this over the DAD links and go into recovery mode to stop a split-brain situation occurring.

21:28:01.153 GMT: %VSLP-2-VSL_DOWN:   All VSL links went down while switch is in ACTIVE role
21:28:01.185 GMT: %FASTHELLO-2-FH_DOWN:  Fast-Hello interface Gi1/2/12 lost dual-active detection capability
21:28:01.187 GMT: %FASTHELLO-2-FH_DOWN:  Fast-Hello interface Gi1/3/12 lost dual-active detection capability
21:28:01.201 GMT: %SW_DA-1-DETECTION:  detected dual-active condition
21:28:01.201 GMT: %SW_DA-1-RECOVERY: Dual-active condition detected: Starting recovery-mode, all non-VSL interfaces have been shut down
21:28:01.624 GMT: %C4K_REDUNDANCY-3-COMMUNICATION: Communication with the peer Supervisor has been lost
21:28:01.651 GMT: %FASTHELLO-2-FH_DOWN:  Fast-Hello interface Gi2/2/12 lost dual-active detection capability
21:28:01.651 GMT: %FASTHELLO-2-FH_DOWN:  Fast-Hello interface Gi2/3/12 lost dual-active detection capability
21:28:01.713 GMT: %C4K_REDUNDANCY-3-SIMPLEX_MODE: The peer Supervisor has been lost

When the switch goes into recovery mode all ports except the VSL ports are shutdown (it is possible to have some other links excluded from being shutdown). From the console it will say (recovery-mode)#

While in recovery mode, avoid config changes (don’t even type conf t). This marks config as modified and will require manual intervention to bring the VSS back (saving config on standby and rebooting it).

Upon seeing the VSL ports come up again the switch in recovery-mode will reload itself and comeback as the standby chassis with all its ports up.

SW-4506E-VSS01(recovery-mode)#
22:51:10.702 GMT: %C4K_IOSINTF-5-LMPHWSESSIONSTATE: Lmp HW session UP on slot 1 port 2.
22:51:10.800 GMT: %C4K_IOSINTF-5-LMPHWSESSIONSTATE: Lmp HW session UP on slot 1 port 1.
22:51:12.927 GMT: %LINK-3-UPDOWN: Interface TenGigabitEthernet1/1/1, changed state to up
22:51:12.928 GMT: %LINK-3-UPDOWN: Interface TenGigabitEthernet1/1/2, changed state to up
22:51:26.699 GMT: %VSLP-5-VSL_UP:  Ready for control traffic
22:51:29.699 GMT: %SW_DA-1-VSL_RECOVERED: VSL has recovered during dual-active situation: Reloading switch 1
22:51:29.720 GMT: %VSLP-5-RRP_MSG: Role change from Active to Standby and hence need to reload
22:51:29.720 GMT: %VSLP-5-RRP_MSG: Reloading the system...%Unable to initiate reload in peer.
22:51:30.589 GMT: %RF-5-RF_RELOAD: Shelf reload. Reason: dual-active
22:51:31.563 GMT: %SYS-5-RELOAD: Reload requested by VS. Reload Reason: dual-active.
22:51:31.607 GMT: %SYS-3-LOGGER_FLUSHED: System was paused for 00:00:01 to ensure console debugging output.

<Sat Feb 10 22:51:32 2018> Message from sysmgr: Reason Code:[3] Reset Reason:Reset/Reload requested by [console]. [Reload command]

A successful recovery should show the following message near the end of the bootup process.

Initializing as Virtual Switch STANDBY processor

***********************************
*       STANDBY SUPERVISOR        *
*     REDUNDANCY mode is SSO      *
*        Continue bootup          *
***********************************

If the VSS has configuration that has not been saved when it goes into recovery mode that switch will not automatically reload once the VSL links are restored.

SW-4506E-VSS01(recovery-mode)#
21:30:49.901 GMT: %VSLP-5-VSL_UP:  Ready for control traffic
21:30:53.909 GMT: %VSLP-5-RRP_MSG: Role change from Active to Standby and hence need to reload
21:30:53.909 GMT: %VSLP-5-RRP_UNSAVED_CONFIG: Ignoring system reload since there are unsaved configurations.
Please save the relevant configurations
21:30:53.909 GMT: %VSLP-5-RRP_MSG: Use 'redundancy reload shelf' to bring this switch to its preferred STANDBY role

In this situation, you must save the running config and reload manually. Only config changes applied to VSL ports on the switch can be saved, all other config changes are discarded as the node reboots as VSS standby.

SW-4506E-VSS01(recovery-mode)#redundancy reload shelf
System configuration has been modified. Save? [yes/no]: yes
Reload the entire shelf [confirm]
Preparing to reload this shelf

WB-4506E-VSS01(recovery-mode)#%Unable to initiate reload in peer.
Feb 10 2018 21:41:18.755 GMT: %RF-5-RF_RELOAD: Shelf reload. Reason: Reload Shelf CLI
Feb 10 2018 21:41:19.742 GMT: %SYS-5-RELOAD: Reload requested by console. Reload Reason: Reload Shelf CLI.
<Sat Feb 10 21:41:20 2018> Message from sysmgr: Reason Code:[3] Reset Reason:Reset/Reload requested by [console]. [Reload command]

After the recovery (once the VSL link is restored and switch reboots) the new active switch configuration will be used to overwrite the configuration in the peer switch (the old-active switch) when it becomes the hot-standby switch.
Changes made to the active switch need not match the old-active switch configuration because the configuration on the old-active switch (now the hot-standby switch) will be overwritten.

Mismatch configurations

If there is any difference on the configuration on the active and standby device after it reboots it will keep rebooting and never comeback up as part of the VSS.
It goes through the whole the bootup process but soon after “Initializing as Virtual Switch STANDBY processor fails redundancy mode checks and reboots.

Initializing as Virtual Switch STANDBY processor

22:02:41.980: %C4K_IOSVSLENCR-3-VSLPMKKEYSTOREERROR: Failed to open PMK keystore file.
22:02:49.857: %C4K_IOSMODPORTMAN-4-POWERSUPPLYBAD: Power supply 2 has failed or been turned off
22:03:35.394: %C4K_IOSINTF-5-LMPHWSESSIONSTATE: Lmp HW session UP on slot 1 port 2.
22:03:35.406: %C4K_IOSINTF-5-LMPHWSESSIONSTATE: Lmp HW session UP on slot 1 port 1.
22:03:51.393: %VSLP-5-VSL_UP:  Ready for control traffic
22:03:54.408: %VSLP-5-RRP_ROLE_RESOLVED: Role resolved as STANDBY by VSLP
22:04:29.793: %C4K_REDUNDANCY-2-IOS_VERSION_CHECK_FAIL: STANDBY:IOS version mismatch. Active supervisor version is 15.2(1)E2 (cat4500e-UNIVERSALK9-M). Standby supervisor version is 15.2(1)E2 (cat4500e-UNIVERSALK9-M). Redundancy feature may not work as expected.
22:04:29.793: %C4K_REDUNDANCY-2-NON_SYMMETRICAL_REDUNDANT_SYSTEM: STANDBY:STANDBY supervisor will operate in fallback redundancy mode rpr.
22:04:33.087: %C4K_REDUNDANCY-3-COMMUNICATION: STANDBY:Communication with the peer Supervisor has been established
22:04:34.356: %C4K_REDUNDANCY-2-VS_REBOOT_ON_RPR_FALLBACK: STANDBY:Supervisor in virtual-switch configuration cannot operate in redundancy mode RPR, will be reset
22:04:35.184: %RF-5-RF_RELOAD: STANDBY:Self Reload. Reason: Virtual-switch fallback to RPR
22:04:35.686: %SYS-5-RELOAD: STANDBY:Reload requested by Platform redundancy manager. Reload Reason: Virtual-switch fallback to RPR.
22:04:35 2018> Message from sysmgr: Reason Code:[3] Reset Reason:Reset/Reload requested by [console]. [Reload command]

 If you have this issue isolate it from the network by unplugging all links including DAD (stop an active-active situation causing network disruption). Removing the DAD links allows it to comeback online as no longer sees the other VSS member. Once back online compare the code on each switch, correct the problem, save configuration on both and reboot the standby.

The cause of the problem when I had this was because we had just configured DAD Links and for some reason on one of the switches it had cdp enable on one of the DAD links. Therefore, when configuring DAD Links it is best to default interfaces and make sure the configs of them are the same before applying dual-active fast-hello (port needs to be a switchport on 4500).

(config)# default interface gi 1/2/12
(config)# interface gi 1/2/12
(config-if)# dual-active fast-hello
WARNING: Interface GigabitEthernet1/2/12 placed in restricted config mode. All extraneous configs removed!

One thought on “VSS Recovery mode

  1. Another option if configs are out of sync is to save the good running config to external flash disk0:somefile.cfg Move flash to other Switch. Copy disk0:somefile.cfg to startup-c and reload.

    I’ve tested similar scenarios on a quad VSS moving Sw2 to standalone and then back to quad VSS. Much less error proned once in virtual mode again, to copy original active VSS config to startup on Sw2 and reload again.

    Like

Leave a comment