HP P2000, Controller Crashes & ESXi PSOD!

I’ve recently been dealing with an interesting issue in my lab, as a result of deciding to upgrade the entire infrastructure to vSphere 5.5GA from vSphere 5.1.

I’m lucky to have a pair of HP MSA P2000 units in my lab, connected to (amongst other boxes) some HP hosts – these items are the focus of this post. The MSAs and hosts were in the configuration as follows:

  • MSAs paired over L2 network, each connected to a different /24 network. No QoS applied to either network.
  • Hardware fully patched to the latest HP revisions.
  • MSAs paired together using HP RemoteSnap replication software (for testing VMware Site Recovery Manager).
  • Snapshots taken and replicated from ‘primary’ to ‘secondary’ every 30 minutes.
  • Single vDisk defined per site, each with a single defined volume.
  • Both sites have snap pools, as required by the RemoteSnap software.

At the time of this occuring, I was in the process of upgrading the infrastructure to 5.5. vCenter, SSO, Inventory and the DBs had all been upgraded successfully, but the ESXi hosts were still on 5.1 (build 799733).

Symptoms:

The first time I noticed a problem was occuring was when one of the ESXi hosts. A PSOD! Not seen one of those in a while, so rebooting the host brought back the host to defacto working status. Oh – all the VMs in the cluster all showing as disconnected. Strange. So, I checked the datastores – none mounted! On checking the iSCSI software initiator, the configuration was correct, so all was well there. Next thing – check the storage.

This was where things get interesting. HP MSAs have 2 administrative interfaces: HP Storage Management Utility (SMU – web interface) and HP CLI (you guessed it – SSH). These interfaces are are both available to both storage management controllers – leaving 4 possible administrative touch points per storage chassis. On connecting to these interfaces, only 1 would work – SSH to controller 2. Running the following useful commands via the HP CLI interface quickly shows the health of the unit:

show system

show controllers

show vdisks

show volumes

show snapshots

So, the controllers have crashed! What to do?

Status:

  1. ESXi host has PSOD, now rebooted.
  2. VMs disconnected, because no storage mounted.
  3. Storage unresponsive to management access.

With no options left and no management or access to the MSA, I go about a fix.

Resolution:

To get out of this, 3 things need to happen. Get the storage back on line, then find out what happened to cause the crashing, then prevent occurrence of the issue in the future.

1. Getting The Storage Online:

The only option for getting access back to the controllers is to restart the MSA. From scratch, no power. A graceful shutdown isn’t possible without management access, so a hard reset is the only option available. The process I followed was:

  1. Put ESXi hosts into Maintenance Mode (to prevent further damage to the LUNs in case of residual connection, and to prevent locking or disk resignature issues).
  2. Remove the MSA power cables.
  3. Reconnect after 5 minutes.
  4. Power-on the ESXi hosts.

Once the storage was back online, management access to all 4 routes was re-established, and I could immediately gain access to the SMU. Once the SMU was available, I was able to check the vDisks and make sure there was no damage to the volumes, and the integrity of the data was intact.

After the storage returned and the ESXi hosts were powered-on, I was able to recheck the iSCSI datastores were mounted – they were, and the VMs all reconnected to vCenter.

Troubleshooting:

To find out what happened to cause this issue, I needed to look at the logs. Via the SMU, there is an option to download a log bundle similar to the VMware Support Log bundle available with vSphere. Looking through these logs and the alerts / errors on the SMU, I find a couple of interesting alerts:

A106522    2014-01-26 15:40:25  107   ERROR          Critical Error: Fault Type: Page Fault  p1: 0x0281D8B, p2: 0x0283C68, p3: 0x028517F, p4: 0x028528D   CThr:dms_failove

B2054      2014-01-27 12:46:58  107   ERROR          Critical Error: Fault Type: Page Fault  p1: 0x0281D8B, p2: 0x0283C68, p3: 0x028517F, p4: 0x028528D   CThr:dms_failove

OK. Now it’s time to talk to HP! They tell me that this is to do with VAAI crashing the controllers because of erroneous snapshots that have been orphaned inside the MSA, and hardware calls from ESXi VAAI are causing BOTH the controllers to fail. (I was thinking that redundant controllers in a storage unit would prevent this type of thing, but the issue relates to a single vDisk snap pool, which both the controllers talk to…..)

Decision Time:

So now a decision needs to be taken. Do I continue with using RemoteSnap, or do I give vSphere Replication in 5.5 a go? I decide on the later.

Remedial Actions:

So what steps did I take to make sure the issues don’t come back? Well, they are many, and detailed below.

Disable VAAI on ESXi 5.1, so the commands aren’t sent to crash the controllers again.

  1. Login to vCenter. (You can also do this with ESXCLI)
  2. Select the Host > Configuration > Advanced Settings.
  3. Set DataMover.HardwareAcceleratedMove to ’0′
  4. Set DataMover.HardwareAcceleratedInit to ’0′
  5. Set VMFS3.HardwareAcceleratedLocking to ’0′
  6. (No host reboot is needed for these changes).

Make changes to vDisks to remove replication. (Note: These need to be done on both sites!) The HP CLI manual from HP can be found here: HP CLI Reference Manual

  1. Login to the SMU.
  2. Delete the replication schedule, to prevent the snapshots from being taken and sent to the remote system.
  3. Login to a controller via HP CLI with administrative / change credentials.
  4. Delete any replication snapshots. (Use ‘show snapshots’ and ‘delete snapshots’ commands). Note: if these fail, you can use the ‘force’ option.
  5. Reset the master replication volume to a standard volume. (Use ‘convert master-to-std’ command).
  6. Delete the snap pools associated with the volume. (Use ‘delete snap-pool’ command).

Once all these have been completed, the volume should now be accessible as a standard iSCSI volume, ready to be mapped to ESXi hosts in the normal way. Hope this was helpful!

Synology DS1513+ Released

DS1513+The Synology DS1512 has been a popular choice for many home labs in recent years. I hoped that the company’s raft of recent product updates would reach this model eventually. Well my wish was granted as Synology have announced the DS1513+.

There are a few modifications to note. The one that stands out the most at first glance is the doubling of LAN capability.  The DS1513+ boasts no fewer than 4 RJ45 ports. That does seem like quite a lot. It does open up some interesting possibilities though…

The full specifications for the DS1513+ can be found here.

Home Lab Build: vSpecialistLabs v2

So, time to update the home labs information, and yes, this time I may have overdone it a little in one or two areas.

I spend most of Sunday rebuilding my home lab (christened some time ago as vSpecialistlabs v2), adding some elements, changing and tweeking some hardware, and removing other hardware I wasn’t using at the time.

Essentially, I’ve ended-up with a home lab that comprises the following aspects:

  • Multi-site VM configuration, with multi-host clusters at both sites.
  • iSCSI shared storage for main ‘production’ site.
  • vSphere Replication to backup ‘DR’ site.
  • Managed networking.

Below is a picture of my home lab set-up, and you can immediately see where I may have gone OTT – screens! For some reason, I love to have screen real estate.

overview

The components of the lab / set-up are as follows:

  1. Servers:
    1. Server 1: IBM S5520HC chassis with 2 x E5520 2.26GHz, 24GB RAM, 1TB SATA, H/W iSCSI & Dual 1GB NICs.
    2. Server 2: As server 1 above.
    3. Server 3: HP NL35l MicroServer with 8GB RAM.
    4. Server 4: As server 3 above.
    5. Main PC. Desktop PC from Servers Plus. (Updated range can be found here). Intel i7-2700 Quad core @ 3.50GHz, 32GB RAM, 2 x OCZ Vertex 4 SSDs and 2TB SATA, X64 Windows 7 Pro. Eizo CE210W (main monitor) plus Dell E177FP (second monitor).
  2. Storage:
    1. 8 TB QNAP 459 Pro II NAS. (4 x 2TB drives in 2 RAIDs).
    2. Iomega external 1TB USB/FW disk.
  3. Networking:
    1. HP 1910-16G Managed gigabit switch.
    2. HP 1410-8G Un-managed gigabit switch.
  4. Accessories:
    1. Belkin Soho 4-port VGA KVM, with bluetooth USB keyboard – for all servers.

I will add more information about how the lab grows and is configured – especially in the light of required revision for updating my VCAP certification to v5. Things to note:

  • The cabling is far from finished! I’m still on connectivity at the moment – looking pretty is next phase.
  • Power configuration is top of the list. Running this from multi-plugs is not ideal (at least they aren’t daisy-chained!) The servers and PCs are all connected to surge protector PDUs.
  • For my PC, iMac and laptops I use Synergy across all clients for a single KVM view. For the servers, I use the Belkin KVM and separate keyboard.
  • I’m not a specific network focused bod, but I am looking at expanding the lab in the near future into the Cisco arena, for CCENT certification and beyond.

In the meantime, please feel free to ask questions or comment on my set-up, I’m always looking for ways to improve!

Nutanix Bloggers’ Session 08/10/2012

I was invited to a briefing by the vendor Nutanix on Monday at VMworld. Now there are a lot of new / recent startups in the storage space and keeping a handle on them all could occupy my time completely so I did hesitate to accept the invitation at first.

I had heard some good things about Nutanix from other bloggers though and, after looking at their website, I was intrigued to find out a little more. Along with a few other bloggers I found my way to the Tryp Apolo hotel in Barcelona where we were greeted by a number of Nutanix employees from EMEA and the US along with London VMUG’s very own Jane Rimmer.

Perhaps now is a good time to explain what it is that Nutanix do. They claim to be a software company but their software is only available on their hardware. I would perhaps think of them more as a storage solutions company. Anyway, that’s semantics.

Nutanix’s product aims to provide a full virtualization platform that performs consistently well, scales linearly and, most importantly, does not requires any shared storage. That’s right, no shared storage. No SAN.

Each node (host) is a fairly standard x64 architecture server with dual processors. Presently each node comes equipped with 320Gb of PCIe SSD (fusionio), 300Gb of SATA SSD and 5Tb of SATA HDDs. Each node also has 1x10GbE and 2x1GbE networking connections. Nodes are manufactured in blocks of 4 and each node has VMware ESXi pre-installed on it.

Aside from combining the hardware, Nutanix’s secret sauce comes in when it comes to presenting that local storage to ESXi. When the nodes are clustered, the available storage is combined and presented as a VMFS datastore to all of the hosts in the cluster. VMs provisioned on a host will have their files stored locally although it will appear like they are being stored on a shared datastore when viewed through the vSphere Client. Behind the scenes the Nutanix software actually replicates those files to other hosts within the cluster (imagine that there are more hosts than shown below – this was just a quick diagram that I knocked up):

The fact that the datastore is presented to all hosts means that vMotion and HA both work as intended. If a VM ends up on another host Nutanix will move that VM’s files to the correct host in the background and completely transparently.

With respect to scaling, Nutanix say that you can just add blocks to an existing deployment. As each node has its own storage, each node should have more than adequate storage performance to handle the VM load placed on it. Clever stuff but does it really work and does it really scale?

Being the diligent bloggers that we are, we asked plenty of questions and Nutanix seemed to have all of the right answers. For me, the idea of scaling in that way is perfect for a growing business. More established enterprises may be too heavily invested in existing technologies to consider it though. Technically it’s a clever solution too, no doubt about that, but perhaps they may need to introduce a few more sizing options for the hosts over time or the software up to being used on other hardware platforms.

After that, Nutanix gave us some insights into the future development of their product. I can’t go into details unfortunately but I look forward to seeing how they progress.

Thanks to Jane and Nutanix for organising the session (and the drinks afterwards) and talking with us all.

QNAP VAAI Details

I did promise to pop back to QNAP’s stand at VMworld Europe when I posted yesterday about them introducing VAAI across their range of storage appliances. True to my word, I popped in for a chat.

As a reminder, VAAI (vStorage APIs for Array Integration) enables ESXi hosts to offload specific virtual machine and storage management operations to compliant storage hardware – basically talking some of the storage load from the hosts and letting the storage hardware handle it.

Now whilst the functionality will be available across their range of products with release 3.8, it seems likely that they are only going to certify it on the x79 series. It will work on all of their current and past models however. The features to be implemented are:

  • Block Zeroing – used during the creation of vmdk disk files
  • Block Copy – used when deploying and cloning VMs / templates. Rather than the ESXi host copying vmdk files from the storage and re-writing them back, the copy is performed by the storage hardware.
  • Hardware accelerated locking – (aka Atomic Test & Set) used during the creation and locking of files on a volume
  • vSphere Client Integration – allows provisioning and management of datastores from within the vSphere client

QNAP said that 3.8 will be available as of November sometime although their website makes no mention of it currently. I did ask about other features, such as VASA (vStorage APIs for Storage Awareness), but there’s no word on those yet. Personally I suspect they knew a little more than they were letting on.