Concept Explained: Running Multiple vSphere Replication Appliances

Playing with vSphere Replication in my lab recently (following my joyous experiences with my HP MSA P2000 detailed here), I came across a small gotcha that might catch out some of those who (like me) learn stuff by playing with them, whilst reading about the deployment technical aspects alongside.

vSphere Replication in vSphere 5.5 has been enhanced over the v5.0/1 release, to enable multiple replication appliances to be deployed and managed by the same vCenter instance. vSphere 5.5 allows upto 10 replication appliances per vCenter, selectable on configuring a VM for replication or available for auto-selection.

The gotcha comes when registering subsequent appliances to a vCenter after initial replication connection has been established between a remote and a local vCenter.

Concept

When configuring vSphere Replication, a single appliance (listed as ‘Embedded’ in the Web Client) is listed in the vCenter Inventory > vSphere Replication > (vCenter) > Manage > vSphere Replication > Rerplication Servers. Additional servers can be deployed (up to a maximum of 10) using the following model:

vSR-Concept

Note: There is a single ‘embedded’ or a master appliance, which maintains the VRMS and an embedded database for the vCenter. This tracks replication for all the VMs for the vCenter instance. Additional appliances registered to the vCenter do not maintain their own databases – instead they are slave or secondary to the embedded or master appliance. (The dependency arrows might not be 100% correct in terms of direct communication, but this is just to demonstrate the principle that secondary appliances don’t maintain their own databases, instead using the embedded DB in the primary or master appliance).

Gotcha!

When deploying a second, third, or fifth appliance to a vCenter, remember:

  • DON’T deploy a second vSphere Replication appliance from the original OVA file downloaded from VMware.
  • DO use the Replication Servers screen in Web Client to register additional appliances to the existing vCenter server.
  • DO use the ‘Add On’ package downloaded alongside the original OVA file from VMware for deployment of additional appliances, not the original OVA file itself.

IF you do deploy a second complete OVA master, and register it with the vCenter Server, this will corrupt the entry in the vCenter Inventory for vSphere Replication, and the connection will need to be removed and reset for the two appliances to connect and communicate for the purposes of replication.

HP P2000, Controller Crashes & ESXi PSOD!

I’ve recently been dealing with an interesting issue in my lab, as a result of deciding to upgrade the entire infrastructure to vSphere 5.5GA from vSphere 5.1.

I’m lucky to have a pair of HP MSA P2000 units in my lab, connected to (amongst other boxes) some HP hosts – these items are the focus of this post. The MSAs and hosts were in the configuration as follows:

  • MSAs paired over L2 network, each connected to a different /24 network. No QoS applied to either network.
  • Hardware fully patched to the latest HP revisions.
  • MSAs paired together using HP RemoteSnap replication software (for testing VMware Site Recovery Manager).
  • Snapshots taken and replicated from ‘primary’ to ‘secondary’ every 30 minutes.
  • Single vDisk defined per site, each with a single defined volume.
  • Both sites have snap pools, as required by the RemoteSnap software.

At the time of this occuring, I was in the process of upgrading the infrastructure to 5.5. vCenter, SSO, Inventory and the DBs had all been upgraded successfully, but the ESXi hosts were still on 5.1 (build 799733).

Symptoms:

The first time I noticed a problem was occuring was when one of the ESXi hosts. A PSOD! Not seen one of those in a while, so rebooting the host brought back the host to defacto working status. Oh – all the VMs in the cluster all showing as disconnected. Strange. So, I checked the datastores – none mounted! On checking the iSCSI software initiator, the configuration was correct, so all was well there. Next thing – check the storage.

This was where things get interesting. HP MSAs have 2 administrative interfaces: HP Storage Management Utility (SMU – web interface) and HP CLI (you guessed it – SSH). These interfaces are are both available to both storage management controllers – leaving 4 possible administrative touch points per storage chassis. On connecting to these interfaces, only 1 would work – SSH to controller 2. Running the following useful commands via the HP CLI interface quickly shows the health of the unit:

show system

show controllers

show vdisks

show volumes

show snapshots

So, the controllers have crashed! What to do?

Status:

  1. ESXi host has PSOD, now rebooted.
  2. VMs disconnected, because no storage mounted.
  3. Storage unresponsive to management access.

With no options left and no management or access to the MSA, I go about a fix.

Resolution:

To get out of this, 3 things need to happen. Get the storage back on line, then find out what happened to cause the crashing, then prevent occurrence of the issue in the future.

1. Getting The Storage Online:

The only option for getting access back to the controllers is to restart the MSA. From scratch, no power. A graceful shutdown isn’t possible without management access, so a hard reset is the only option available. The process I followed was:

  1. Put ESXi hosts into Maintenance Mode (to prevent further damage to the LUNs in case of residual connection, and to prevent locking or disk resignature issues).
  2. Remove the MSA power cables.
  3. Reconnect after 5 minutes.
  4. Power-on the ESXi hosts.

Once the storage was back online, management access to all 4 routes was re-established, and I could immediately gain access to the SMU. Once the SMU was available, I was able to check the vDisks and make sure there was no damage to the volumes, and the integrity of the data was intact.

After the storage returned and the ESXi hosts were powered-on, I was able to recheck the iSCSI datastores were mounted – they were, and the VMs all reconnected to vCenter.

Troubleshooting:

To find out what happened to cause this issue, I needed to look at the logs. Via the SMU, there is an option to download a log bundle similar to the VMware Support Log bundle available with vSphere. Looking through these logs and the alerts / errors on the SMU, I find a couple of interesting alerts:

A106522    2014-01-26 15:40:25  107   ERROR          Critical Error: Fault Type: Page Fault  p1: 0x0281D8B, p2: 0x0283C68, p3: 0x028517F, p4: 0x028528D   CThr:dms_failove

B2054      2014-01-27 12:46:58  107   ERROR          Critical Error: Fault Type: Page Fault  p1: 0x0281D8B, p2: 0x0283C68, p3: 0x028517F, p4: 0x028528D   CThr:dms_failove

OK. Now it’s time to talk to HP! They tell me that this is to do with VAAI crashing the controllers because of erroneous snapshots that have been orphaned inside the MSA, and hardware calls from ESXi VAAI are causing BOTH the controllers to fail. (I was thinking that redundant controllers in a storage unit would prevent this type of thing, but the issue relates to a single vDisk snap pool, which both the controllers talk to…..)

Decision Time:

So now a decision needs to be taken. Do I continue with using RemoteSnap, or do I give vSphere Replication in 5.5 a go? I decide on the later.

Remedial Actions:

So what steps did I take to make sure the issues don’t come back? Well, they are many, and detailed below.

Disable VAAI on ESXi 5.1, so the commands aren’t sent to crash the controllers again.

  1. Login to vCenter. (You can also do this with ESXCLI)
  2. Select the Host > Configuration > Advanced Settings.
  3. Set DataMover.HardwareAcceleratedMove to ’0′
  4. Set DataMover.HardwareAcceleratedInit to ’0′
  5. Set VMFS3.HardwareAcceleratedLocking to ’0′
  6. (No host reboot is needed for these changes).

Make changes to vDisks to remove replication. (Note: These need to be done on both sites!) The HP CLI manual from HP can be found here: HP CLI Reference Manual

  1. Login to the SMU.
  2. Delete the replication schedule, to prevent the snapshots from being taken and sent to the remote system.
  3. Login to a controller via HP CLI with administrative / change credentials.
  4. Delete any replication snapshots. (Use ‘show snapshots’ and ‘delete snapshots’ commands). Note: if these fail, you can use the ‘force’ option.
  5. Reset the master replication volume to a standard volume. (Use ‘convert master-to-std’ command).
  6. Delete the snap pools associated with the volume. (Use ‘delete snap-pool’ command).

Once all these have been completed, the volume should now be accessible as a standard iSCSI volume, ready to be mapped to ESXi hosts in the normal way. Hope this was helpful!

Improving vSphere Web Client Performance

With the release of the (now maturing) VMware vSphere 5.5 release, more and more operations (but not all – yet) are being migrated to the vSphere Web Client.

Functionality of the vSphere 5.1 features are all fully available in the vSphere 5.5 .Net Windows client (the traditional client) along with Site Recovery Manager and Update Manager administration functions, but any and new vSphere 5.5 features are now only available via the web client.

Lots is made of the performance of the web client, and having used it on my home lab and now in Production environments, I can see why some users report on there being some perceived performance lag in the web client compared to the Windows client (population of menus, general navigation etc).  First off, a direct comparison of similar tasks shows the Web Client is slower compared to the Windows client, but there is a couple of things you can do as an administrator to improve the situation.

  1. Use a local browser on a server via a jumpstation if connecting to the infrastructure remotely. It might sound obvious, but with the Web Client using Flash – if you are connecting over a home broadband, VPN or WAN link to your DC, then decreasing the traffic route between the browser and the vCenter server improves performance significantly.
  2. Change the Flash settings of your browser. Because of the Web Client’s reliance on Flash, there are some settings that can assist in improving the performance of the Flash plug-in within the browser. Changing the ‘Local Website Storage’ setting can increase the temporary storage available to Flash from a default 100kb setting to something higher and more performant. This setting is set low intentionally because of security in Flash, rather than specifically for the vSphere Web Client. Fortunately, Adobe give a simple live view of the flash settings for your browser – to enable simple updating of the required setting.
    1. Visit:  http://www.macromedia.com/support/documentation/en/flashplayer/help/settings_manager07.html
    2. In the live view box, select your vCenter server (either by DNS or IP address) – see image below.
    3. Change the settings slider from 100kb up to 10MB or unlimited (mine is set to Unlimited).
    4. Close the website and browser session.
    5. Reload the Web Client. Is performance better? It might be with usage…..
  3. Another tip is to change the Tomcat configuration on the vCenter server. VMware has a KB published on this, where they talk about the ‘Small’, ‘Medium’, and ‘Large’ infrastructure instances we see at installation time. This change is about changing the JVM heap size to 3GB (usually for large installations), as this then impacts the vFabric tc Server on which the vCenter server is based. I have used this a couple of times for customers who have seen performance degradation on their vCenter Web Clients.

flashwebsettings

Hopefully these tips are useful – and the performance of your vSphere Web Clients improves as a result!

Update: Apparently VMware Support may use blogging sites to forward information to customers! Item 2 in the list above is also listed on the virtuallyGhetto blog of William Lam (Twitter: @lamw).

vCloud Director 5.1 to 5.5 Cell Upgrade ‘cpio: chown failed’

Upgrading my lab environment from vCloud Director v5.1 to v5.5, I came across an interesting error whilst upgrading the cells. My lab has the following vCD configuration:

  • 2 x RHEL 6.2 Cells
  • 1 x RHEL 6.2 NFS Server
  • 2 x vShield load balancer instances
  • 1 x Windows 2008 R2 DB server running SQL Server 2005

The upgrade process was:

  1. Quiesce the cell using the Cell Management Tool commands. (Upgrade Guide)
  2. Upload the vCD .BIN file to the /install directory of the cell (using WinSCP or similar).
  3. Change the execution parameters for the vCD .BIN file. (Upgrade Guide)
  4. Running the installation .BIN file. (Upgrade Guide)
  5. Confirm the existing v5.1 cell instance can be upgraded.

This is where the interesting error came in. The error: ‘error: unpacking or archive failed on file /opt/vmware/vcloud-director/data/transfer: cpio: chown failed – invalid argument’.

vcd5.5cellupgrade

Now because my ‘transfer’ folder is actually an exported NFS share from a third server that doesn’t host a vCD cell, I did a little digging around. I found references to 2 main things – no_root_squash and the version of the NFS export itself. On my NFS Server, the export was already set with the (rw, no_root_squash) parameters, but I rebooted both the cell and the NFS server anyway. The other idea was there was potential issues with NFS4 exports. So, I changed the export version in /etc/fstab to NFS3 using the following fstab line entry:

 <NFS Server IP>:nfs    /opt/vmware/vcloud-director/data/transfer/    nfs    rw,vers=3    0    0

Save changes to /etc/fstab and reboot the cell, and retry the cell upgrade using the .BIN file from earlier.

With the export set as NFS v3, the upgrade should be successful and the cell upgrade can proceed.

IT Disaster Recovery Preparedness Benchmark

Disaster Recovery (and Business Continuity) were sometimes an afterthought even as recently as a few years ago. When I started out in IT the attitude was usually similar to that of an ostrich burying its head in the sand. Thankfully times have clearly moved on.

Yesterday a press-release was brought to my attention that I’d like to share. It concerns a new research advisory council that has been created to help provide IT professionals (and, by extension, businesses) with a reflective measure of how prepared they are to handle Disaster Recovery situations. The DRP Council, as it’s known, have launched an online survey that takes just a few minutes to complete:

drpbAs recent cyber-attacks and natural disaster events have shown, the need for IT disaster recovery preparedness has never been greater. However, research indicates that less than half of all companies have a disaster recovery plan in place, and even fewer have actually tested their plans to see if they will work as expected.

This need to uncover the value of disaster recovery planning and testing, as well as gain a better understanding of DR best practices to make preparedness more cost-effective and efficient was the driving force behind a recently created Disaster Recovery Preparedness (DRP) CouncilFormed by IT business, government and academic leaders to address these issues, its mission is to increase DR Preparedness awareness, and improve DR practices.

The DRP Council has developed an online Disaster Recovery Preparedness Benchmark (DRPB) Survey.  The survey is designed to give business continuity, disaster recovery, compliance audit and risk management professionals a measure of their own preparedness in recovering critical IT systems running in virtual environments.

Founding members of the DRP Council include:

  • Steve Kahan, Council Chairman, PHD Virtual
  • Dave Simpson, Sr. Analyst, 451 Group
  • Bilal Hashmi, Sr. Systems Engineer, Verizon
  • Michael Sink, Director Data Center Technologies, University of South Florida
  • Steve Lambropoulos, University of South Florida
  • Darren Hirons, Principal Systems Engineer, UK Health & Social Information Centre
  • Trystan Trenberth, CEO and Managing Director, Trenberth LTD
  • Riaan Hamman, CTO, Puleng Technologies
  • Carlos Escapa, Council Research Director , PHD Virtual
  • Anita DuBose, Council Research Director, PHD Virtual

“Users can now benchmark their own disaster recovery preparedness and find out real answers on how they would be able to get their IT systems up and running within a realistic time-frame to meet stringent business requirements,” said Steve Kahan, Chairman of the DRP Council. “Just 10 minutes of their time will provide them with some immediate feedback and a benchmark score that rates your DR preparedness with other companies that have participated.”

“I am unsure if our current best practices are the best or most efficient ways to deliver our SLA,” said Darren Hirons, Principal Systems Engineer, UK Health & Social Information Centre. “Learning about best practices through the Disaster Recovery Preparedness Benchmark could help us learn new ways to shorten the SLAs and deliver better service to our businesses.”

The DRPB survey provides a benchmarking score from 0-100 that measures the implementation of IT disaster recovery best practices. DRPB benchmarking scores parallel the grading system familiar to most students in North America whereby a score of 90-100 is an “A” or superior grade; 80-89 is a “B” or above average grade; 70-79 is a “C” or average grade and 60-69 is a “D” or unsatisfactory grade. Below 60, rates as an “F”, or failing grade.

Supporting Resources

Disaster recovery Preparedness Council:  http://drbenchmark.org/about-us/our-council/

Disaster Recovery Benchmark Test:  http://drbenchmark.org/benchmark-survey/survey-overview/

About the Disaster Recovery Preparedness Council

The DRPC is an independent research group engaged in IT disaster recovery management, research, and benchmarking in order to deliver practical guidance for how to improve Business Continuity and Disaster Recovery. www.drbenchmark.org

As a consultant, I don’t have anything but lab environments of my own that I can base responses on. If you manage a production environment though, I’d urge you to take a few minutes to complete the survey.

Cheers!