Manual vCNS / vShield Edge HA Little Gem!

vCNS-HARecently, I have been doing lots with vCNS and manual creation / manipulation of vShield Edge devices (posts coming soon). One thing that drive me crazy is a tiny little thing that prompted me to write this quick Little Gem – ‘Edge HA’ sat on my to do list, and gloated at me…..

When creating a manual vShield Edge device in vCNS, there is the usual opportunity to create an pair of appliances for running the pair in High Availability mode. Trouble is, the options for deployment are limited and not very clear. (This might be clear / obvious to some, but weren’t to me!)

When creating an HA pair, in the vShield Manager console editing the Edge device in question under Settings – the HA Configuration gives few options. Essentially, ‘Enabled’ or ‘Disabled’, vNIC, Declared Dead Time and Management IPs. Here’s where my confusion was based. Management IPs. So many questions……!

The option for Management IPs is even outlined. 2 IP entry boxes, and note text: ‘You can specify pair of IPs (in DIDR format with /30 subnet. Management IPs must not overlap with any vnic subnets’.

OK, so I need Management IPs to manually create a HA pair. What /30 address range do I need to specify? Can the IP range share an existing vNIC, or does the Edge device need another interface or uplink. Where do I define the /30 addresses. Do they need their own vLANs? Must I create a whole new private address range specifically for HA heartbeat? Like I said – so many questions. Scour the documentation, Google ‘vShield Edge Management IPs’ produces no helpful results. So – to the LAB!

Turns out, you don’t need Management IPs at all. Simply change the HA Status to ‘Enable’, select a vNIC to support HA heartbeat, and add a second Edge appliance via the green plus symbol (it will prompt for the parameters) to deploy the HA pair! When both report as ‘Deployed’, HA is configured and your Edge device is protected.

Sigh. Like I said. This might seem obvious to some, but it wasn’t to me. ‘Edge HA’ is no longer on my to-do list!

HA Failover Errors in vSphere

In a production system running vSphere 5.1, I’ve noticed that on occasion an error appears at cluster level about ‘HA initiating virtual machine failover’.

vSphereHA

Checking the host and vCenter logs shows no host reboots or isolation. In this case, there fortunately is a simple fix that seems to rid the cluster of the erroneous error message.

Disable HA on the cluster (Right-click cluster > Edit Settings > Uncheck HA checkbox). The hosts will automatically disable HA as part of the cluster settings with no effect to the running VMs on the host.

Re-enable HA on the cluster (reverse the settings above). The HA cluster will have an election and elect a new HA master and configure the remainder of the hosts as slaves.

Voila! The error should now disappear and your cluster should be back to full health.

Finding HA Primary Nodes

A question came up in yesterday’s “Chad’s Choice” webcast about choosing which hosts in a cluster would be configured as HA primary nodes. I’m not going to go into any great detail here about what HA primary nodes are because there is a more comprehensive article on HA freely available over on the Yellow Bricks blog of Duncan Epping.

The short answer to whether or not you can choose HA primary nodes is a simple “no”. It’s not possible.

Things are rarely simple though. Technically it is possible (again see Duncan’s HA deepdive page for details) but, and this is important, manually choosing HA primaries is not supported – even experimentally.

The good news though for anyone who wants to know which hosts are their HA primaries is that there is now a dead simple way to find out. As of PowerCLI 4.1.1 there is a nice new cmdlet available. Getting a list of HA primaries is as simple as:

Get-HAPrimaryVMHost -Cluster <Cluster Name>

It’s not the speediest of cmdlets but it does work. See below:

HA Agent on ESX-HOST in cluster CLUSTER-NAME has an error

If I wanted to, this could be a very big post all about configuring HA correctly. But I don’t want to reinvent the wheel. Instead I just want to share my experiences with this error:

HA Agent on ESX-HOST in cluster CLUSTER-NAME has an error

Odds are that you will eventually see this one pop up in vCenter for one of your ESX 3.x hosts. If you’re not sure what it means, well the translation basically is that the host displaying the error could fail and your VMs running on it probably won’t get started up automatically on another host. Essentially HA is broken on the host. [Read more...]