I’ve recently been dealing with an interesting issue in my lab, as a result of deciding to upgrade the entire infrastructure to vSphere 5.5GA from vSphere 5.1.
I’m lucky to have a pair of HP MSA P2000 units in my lab, connected to (amongst other boxes) some HP hosts – these items are the focus of this post. The MSAs and hosts were in the configuration as follows:
- MSAs paired over L2 network, each connected to a different /24 network. No QoS applied to either network.
- Hardware fully patched to the latest HP revisions.
- MSAs paired together using HP RemoteSnap replication software (for testing VMware Site Recovery Manager).
- Snapshots taken and replicated from ‘primary’ to ‘secondary’ every 30 minutes.
- Single vDisk defined per site, each with a single defined volume.
- Both sites have snap pools, as required by the RemoteSnap software.
At the time of this occuring, I was in the process of upgrading the infrastructure to 5.5. vCenter, SSO, Inventory and the DBs had all been upgraded successfully, but the ESXi hosts were still on 5.1 (build 799733).
The first time I noticed a problem was occuring was when one of the ESXi hosts. A PSOD! Not seen one of those in a while, so rebooting the host brought back the host to defacto working status. Oh – all the VMs in the cluster all showing as disconnected. Strange. So, I checked the datastores – none mounted! On checking the iSCSI software initiator, the configuration was correct, so all was well there. Next thing – check the storage.
This was where things get interesting. HP MSAs have 2 administrative interfaces: HP Storage Management Utility (SMU – web interface) and HP CLI (you guessed it – SSH). These interfaces are are both available to both storage management controllers – leaving 4 possible administrative touch points per storage chassis. On connecting to these interfaces, only 1 would work – SSH to controller 2. Running the following useful commands via the HP CLI interface quickly shows the health of the unit:
So, the controllers have crashed! What to do?
- ESXi host has PSOD, now rebooted.
- VMs disconnected, because no storage mounted.
- Storage unresponsive to management access.
With no options left and no management or access to the MSA, I go about a fix.
To get out of this, 3 things need to happen. Get the storage back on line, then find out what happened to cause the crashing, then prevent occurrence of the issue in the future.
1. Getting The Storage Online:
The only option for getting access back to the controllers is to restart the MSA. From scratch, no power. A graceful shutdown isn’t possible without management access, so a hard reset is the only option available. The process I followed was:
- Put ESXi hosts into Maintenance Mode (to prevent further damage to the LUNs in case of residual connection, and to prevent locking or disk resignature issues).
- Remove the MSA power cables.
- Reconnect after 5 minutes.
- Power-on the ESXi hosts.
Once the storage was back online, management access to all 4 routes was re-established, and I could immediately gain access to the SMU. Once the SMU was available, I was able to check the vDisks and make sure there was no damage to the volumes, and the integrity of the data was intact.
After the storage returned and the ESXi hosts were powered-on, I was able to recheck the iSCSI datastores were mounted – they were, and the VMs all reconnected to vCenter.
To find out what happened to cause this issue, I needed to look at the logs. Via the SMU, there is an option to download a log bundle similar to the VMware Support Log bundle available with vSphere. Looking through these logs and the alerts / errors on the SMU, I find a couple of interesting alerts:
A106522 2014-01-26 15:40:25 107 ERROR Critical Error: Fault Type: Page Fault p1: 0x0281D8B, p2: 0x0283C68, p3: 0x028517F, p4: 0x028528D CThr:dms_failove
B2054 2014-01-27 12:46:58 107 ERROR Critical Error: Fault Type: Page Fault p1: 0x0281D8B, p2: 0x0283C68, p3: 0x028517F, p4: 0x028528D CThr:dms_failove
OK. Now it’s time to talk to HP! They tell me that this is to do with VAAI crashing the controllers because of erroneous snapshots that have been orphaned inside the MSA, and hardware calls from ESXi VAAI are causing BOTH the controllers to fail. (I was thinking that redundant controllers in a storage unit would prevent this type of thing, but the issue relates to a single vDisk snap pool, which both the controllers talk to…..)
So now a decision needs to be taken. Do I continue with using RemoteSnap, or do I give vSphere Replication in 5.5 a go? I decide on the later.
So what steps did I take to make sure the issues don’t come back? Well, they are many, and detailed below.
Disable VAAI on ESXi 5.1, so the commands aren’t sent to crash the controllers again.
- Login to vCenter. (You can also do this with ESXCLI)
- Select the Host > Configuration > Advanced Settings.
- Set DataMover.HardwareAcceleratedMove to ’0′
- Set DataMover.HardwareAcceleratedInit to ’0′
- Set VMFS3.HardwareAcceleratedLocking to ’0′
- (No host reboot is needed for these changes).
Make changes to vDisks to remove replication. (Note: These need to be done on both sites!) The HP CLI manual from HP can be found here: HP CLI Reference Manual
- Login to the SMU.
- Delete the replication schedule, to prevent the snapshots from being taken and sent to the remote system.
- Login to a controller via HP CLI with administrative / change credentials.
- Delete any replication snapshots. (Use ‘show snapshots’ and ‘delete snapshots’ commands). Note: if these fail, you can use the ‘force’ option.
- Reset the master replication volume to a standard volume. (Use ‘convert master-to-std’ command).
- Delete the snap pools associated with the volume. (Use ‘delete snap-pool’ command).
Once all these have been completed, the volume should now be accessible as a standard iSCSI volume, ready to be mapped to ESXi hosts in the normal way. Hope this was helpful!