vCAC 6.0.1, Inaccessible Tenants, and Missing Identity Stores

With vCAC 6.0.x, there is a bug in the SSO appliance where several symptoms present all at the same time:

  • Authentication to AD or LDAP identity stores fails, returning the user to the blank authentication screen.
  • When logged-in to the default tenant as administrator (usually ‘administrator@vsphere.local’), accessing tenant identity stores results in a ‘System Exception’ error.
  • Tenant Admins cannot add or edit identity stores.

This is a documented bug, as listed in VMware KB Article 2075011, and at the time of writing there is a workaround.

The issue as documented is the administrator account in the default tenant expires 90 days after implementation of the appliance. I came across this issue, and was for a while not understanding the syntax of the commands required to complete the workaround. So, here are the steps in minutia that should work for others implementing this same fix.

Note: Whatever is in highlighted code needs to be typed as single entry lines, with a return at the end to complete the command entry.

1. SSH to the SSO server IP address. Authenticate as the SSO Root User.

2. Reset the account control flag by issuing the following commands:

/opt/likewise/bin/ldapmodify -H ldap://localhost:389 -x -D “cn=administrator,cn=users,dc=vsphere,dc=local” -W <<EOF

When typing this command, you are not returned to the usual root prompt, but rather to a simple ‘>’ prompt. This is what stumped me for a bit….. At that prompt, enter the following commands. (Note: replace tenant_name instances in the commands below with the name of your own tenant).

dn: cn=tenantadmin,cn=users,dc=tenant_name

At the > prompt, enter:

changetype: modify

At the > prompt, enter:

replace: userAccountControl

At the > prompt, enter:

userAccountControl: 0

At the > prompt, enter:

EOF

You will be prompted for LDAP password. Enter the password for the default tenant administrator (usually ‘administrator@vsphere.local’).

Once authenticated, the message ‘Response: modifying entry “cn=administrator,cn=users,dc=tenant_name.”‘ is displayed, and the command prompt returns to the usual prompt.

3. Disable password expiration

by issuing the following commands:

/opt/likewise/bin/ldapmodify -H ldap://localhost:389 -x -D “cn=administrator,cn=users,dc=vsphere,dc=local” -W <<EOF

When typing this command, you are not returned to the usual root prompt, but rather to a simple ‘>’ prompt. This is what stumped me for a bit….. At that prompt, enter the following commands:

dn: cn=DCAdmins,cn=builtin,dc=vsphere,dc=local

At the > prompt, enter:

changetype: modify

At the > prompt, enter:

add: member

At the > prompt, enter:

member: cn=administrator,cn=users,dc=tenant_name

At the > prompt, enter:

EOF

You will be prompted for LDAP password. Enter the password for the default tenant administrator (usually ‘administrator@vsphere.local’).

Once authenticated, the message ‘Response: modifying entry “cn=DCAdmins,cn=builtin,dc=vsphere,dc=local”‘ is displayed, and the command prompt returns to the usual prompt.

4. Retry vCAC login to either the default or user tenants – the problem should be resolved and the login should work as normal.

Concept Explained: Running Multiple vSphere Replication Appliances

Playing with vSphere Replication in my lab recently (following my joyous experiences with my HP MSA P2000 detailed here), I came across a small gotcha that might catch out some of those who (like me) learn stuff by playing with them, whilst reading about the deployment technical aspects alongside.

vSphere Replication in vSphere 5.5 has been enhanced over the v5.0/1 release, to enable multiple replication appliances to be deployed and managed by the same vCenter instance. vSphere 5.5 allows upto 10 replication appliances per vCenter, selectable on configuring a VM for replication or available for auto-selection.

The gotcha comes when registering subsequent appliances to a vCenter after initial replication connection has been established between a remote and a local vCenter.

Concept

When configuring vSphere Replication, a single appliance (listed as ‘Embedded’ in the Web Client) is listed in the vCenter Inventory > vSphere Replication > (vCenter) > Manage > vSphere Replication > Rerplication Servers. Additional servers can be deployed (up to a maximum of 10) using the following model:

vSR-Concept

Note: There is a single ‘embedded’ or a master appliance, which maintains the VRMS and an embedded database for the vCenter. This tracks replication for all the VMs for the vCenter instance. Additional appliances registered to the vCenter do not maintain their own databases – instead they are slave or secondary to the embedded or master appliance. (The dependency arrows might not be 100% correct in terms of direct communication, but this is just to demonstrate the principle that secondary appliances don’t maintain their own databases, instead using the embedded DB in the primary or master appliance).

Gotcha!

When deploying a second, third, or fifth appliance to a vCenter, remember:

  • DON’T deploy a second vSphere Replication appliance from the original OVA file downloaded from VMware.
  • DO use the Replication Servers screen in Web Client to register additional appliances to the existing vCenter server.
  • DO use the ‘Add On’ package downloaded alongside the original OVA file from VMware for deployment of additional appliances, not the original OVA file itself.

IF you do deploy a second complete OVA master, and register it with the vCenter Server, this will corrupt the entry in the vCenter Inventory for vSphere Replication, and the connection will need to be removed and reset for the two appliances to connect and communicate for the purposes of replication.

HP P2000, Controller Crashes & ESXi PSOD!

I’ve recently been dealing with an interesting issue in my lab, as a result of deciding to upgrade the entire infrastructure to vSphere 5.5GA from vSphere 5.1.

I’m lucky to have a pair of HP MSA P2000 units in my lab, connected to (amongst other boxes) some HP hosts – these items are the focus of this post. The MSAs and hosts were in the configuration as follows:

  • MSAs paired over L2 network, each connected to a different /24 network. No QoS applied to either network.
  • Hardware fully patched to the latest HP revisions.
  • MSAs paired together using HP RemoteSnap replication software (for testing VMware Site Recovery Manager).
  • Snapshots taken and replicated from ‘primary’ to ‘secondary’ every 30 minutes.
  • Single vDisk defined per site, each with a single defined volume.
  • Both sites have snap pools, as required by the RemoteSnap software.

At the time of this occuring, I was in the process of upgrading the infrastructure to 5.5. vCenter, SSO, Inventory and the DBs had all been upgraded successfully, but the ESXi hosts were still on 5.1 (build 799733).

Symptoms:

The first time I noticed a problem was occuring was when one of the ESXi hosts. A PSOD! Not seen one of those in a while, so rebooting the host brought back the host to defacto working status. Oh – all the VMs in the cluster all showing as disconnected. Strange. So, I checked the datastores – none mounted! On checking the iSCSI software initiator, the configuration was correct, so all was well there. Next thing – check the storage.

This was where things get interesting. HP MSAs have 2 administrative interfaces: HP Storage Management Utility (SMU – web interface) and HP CLI (you guessed it – SSH). These interfaces are are both available to both storage management controllers – leaving 4 possible administrative touch points per storage chassis. On connecting to these interfaces, only 1 would work – SSH to controller 2. Running the following useful commands via the HP CLI interface quickly shows the health of the unit:

show system

show controllers

show vdisks

show volumes

show snapshots

So, the controllers have crashed! What to do?

Status:

  1. ESXi host has PSOD, now rebooted.
  2. VMs disconnected, because no storage mounted.
  3. Storage unresponsive to management access.

With no options left and no management or access to the MSA, I go about a fix.

Resolution:

To get out of this, 3 things need to happen. Get the storage back on line, then find out what happened to cause the crashing, then prevent occurrence of the issue in the future.

1. Getting The Storage Online:

The only option for getting access back to the controllers is to restart the MSA. From scratch, no power. A graceful shutdown isn’t possible without management access, so a hard reset is the only option available. The process I followed was:

  1. Put ESXi hosts into Maintenance Mode (to prevent further damage to the LUNs in case of residual connection, and to prevent locking or disk resignature issues).
  2. Remove the MSA power cables.
  3. Reconnect after 5 minutes.
  4. Power-on the ESXi hosts.

Once the storage was back online, management access to all 4 routes was re-established, and I could immediately gain access to the SMU. Once the SMU was available, I was able to check the vDisks and make sure there was no damage to the volumes, and the integrity of the data was intact.

After the storage returned and the ESXi hosts were powered-on, I was able to recheck the iSCSI datastores were mounted – they were, and the VMs all reconnected to vCenter.

Troubleshooting:

To find out what happened to cause this issue, I needed to look at the logs. Via the SMU, there is an option to download a log bundle similar to the VMware Support Log bundle available with vSphere. Looking through these logs and the alerts / errors on the SMU, I find a couple of interesting alerts:

A106522    2014-01-26 15:40:25  107   ERROR          Critical Error: Fault Type: Page Fault  p1: 0x0281D8B, p2: 0x0283C68, p3: 0x028517F, p4: 0x028528D   CThr:dms_failove

B2054      2014-01-27 12:46:58  107   ERROR          Critical Error: Fault Type: Page Fault  p1: 0x0281D8B, p2: 0x0283C68, p3: 0x028517F, p4: 0x028528D   CThr:dms_failove

OK. Now it’s time to talk to HP! They tell me that this is to do with VAAI crashing the controllers because of erroneous snapshots that have been orphaned inside the MSA, and hardware calls from ESXi VAAI are causing BOTH the controllers to fail. (I was thinking that redundant controllers in a storage unit would prevent this type of thing, but the issue relates to a single vDisk snap pool, which both the controllers talk to…..)

Decision Time:

So now a decision needs to be taken. Do I continue with using RemoteSnap, or do I give vSphere Replication in 5.5 a go? I decide on the later.

Remedial Actions:

So what steps did I take to make sure the issues don’t come back? Well, they are many, and detailed below.

Disable VAAI on ESXi 5.1, so the commands aren’t sent to crash the controllers again.

  1. Login to vCenter. (You can also do this with ESXCLI)
  2. Select the Host > Configuration > Advanced Settings.
  3. Set DataMover.HardwareAcceleratedMove to ’0′
  4. Set DataMover.HardwareAcceleratedInit to ’0′
  5. Set VMFS3.HardwareAcceleratedLocking to ’0′
  6. (No host reboot is needed for these changes).

Make changes to vDisks to remove replication. (Note: These need to be done on both sites!) The HP CLI manual from HP can be found here: HP CLI Reference Manual

  1. Login to the SMU.
  2. Delete the replication schedule, to prevent the snapshots from being taken and sent to the remote system.
  3. Login to a controller via HP CLI with administrative / change credentials.
  4. Delete any replication snapshots. (Use ‘show snapshots’ and ‘delete snapshots’ commands). Note: if these fail, you can use the ‘force’ option.
  5. Reset the master replication volume to a standard volume. (Use ‘convert master-to-std’ command).
  6. Delete the snap pools associated with the volume. (Use ‘delete snap-pool’ command).

Once all these have been completed, the volume should now be accessible as a standard iSCSI volume, ready to be mapped to ESXi hosts in the normal way. Hope this was helpful!

Improving vSphere Web Client Performance

With the release of the (now maturing) VMware vSphere 5.5 release, more and more operations (but not all – yet) are being migrated to the vSphere Web Client.

Functionality of the vSphere 5.1 features are all fully available in the vSphere 5.5 .Net Windows client (the traditional client) along with Site Recovery Manager and Update Manager administration functions, but any and new vSphere 5.5 features are now only available via the web client.

Lots is made of the performance of the web client, and having used it on my home lab and now in Production environments, I can see why some users report on there being some perceived performance lag in the web client compared to the Windows client (population of menus, general navigation etc).  First off, a direct comparison of similar tasks shows the Web Client is slower compared to the Windows client, but there is a couple of things you can do as an administrator to improve the situation.

  1. Use a local browser on a server via a jumpstation if connecting to the infrastructure remotely. It might sound obvious, but with the Web Client using Flash – if you are connecting over a home broadband, VPN or WAN link to your DC, then decreasing the traffic route between the browser and the vCenter server improves performance significantly.
  2. Change the Flash settings of your browser. Because of the Web Client’s reliance on Flash, there are some settings that can assist in improving the performance of the Flash plug-in within the browser. Changing the ‘Local Website Storage’ setting can increase the temporary storage available to Flash from a default 100kb setting to something higher and more performant. This setting is set low intentionally because of security in Flash, rather than specifically for the vSphere Web Client. Fortunately, Adobe give a simple live view of the flash settings for your browser – to enable simple updating of the required setting.
    1. Visit:  http://www.macromedia.com/support/documentation/en/flashplayer/help/settings_manager07.html
    2. In the live view box, select your vCenter server (either by DNS or IP address) – see image below.
    3. Change the settings slider from 100kb up to 10MB or unlimited (mine is set to Unlimited).
    4. Close the website and browser session.
    5. Reload the Web Client. Is performance better? It might be with usage…..
  3. Another tip is to change the Tomcat configuration on the vCenter server. VMware has a KB published on this, where they talk about the ‘Small’, ‘Medium’, and ‘Large’ infrastructure instances we see at installation time. This change is about changing the JVM heap size to 3GB (usually for large installations), as this then impacts the vFabric tc Server on which the vCenter server is based. I have used this a couple of times for customers who have seen performance degradation on their vCenter Web Clients.

flashwebsettings

Hopefully these tips are useful – and the performance of your vSphere Web Clients improves as a result!

Update: Apparently VMware Support may use blogging sites to forward information to customers! Item 2 in the list above is also listed on the virtuallyGhetto blog of William Lam (Twitter: @lamw).

vCloud Director 5.1 to 5.5 Cell Upgrade ‘cpio: chown failed’

Upgrading my lab environment from vCloud Director v5.1 to v5.5, I came across an interesting error whilst upgrading the cells. My lab has the following vCD configuration:

  • 2 x RHEL 6.2 Cells
  • 1 x RHEL 6.2 NFS Server
  • 2 x vShield load balancer instances
  • 1 x Windows 2008 R2 DB server running SQL Server 2005

The upgrade process was:

  1. Quiesce the cell using the Cell Management Tool commands. (Upgrade Guide)
  2. Upload the vCD .BIN file to the /install directory of the cell (using WinSCP or similar).
  3. Change the execution parameters for the vCD .BIN file. (Upgrade Guide)
  4. Running the installation .BIN file. (Upgrade Guide)
  5. Confirm the existing v5.1 cell instance can be upgraded.

This is where the interesting error came in. The error: ‘error: unpacking or archive failed on file /opt/vmware/vcloud-director/data/transfer: cpio: chown failed – invalid argument’.

vcd5.5cellupgrade

Now because my ‘transfer’ folder is actually an exported NFS share from a third server that doesn’t host a vCD cell, I did a little digging around. I found references to 2 main things – no_root_squash and the version of the NFS export itself. On my NFS Server, the export was already set with the (rw, no_root_squash) parameters, but I rebooted both the cell and the NFS server anyway. The other idea was there was potential issues with NFS4 exports. So, I changed the export version in /etc/fstab to NFS3 using the following fstab line entry:

 <NFS Server IP>:nfs    /opt/vmware/vcloud-director/data/transfer/    nfs    rw,vers=3    0    0

Save changes to /etc/fstab and reboot the cell, and retry the cell upgrade using the .BIN file from earlier.

With the export set as NFS v3, the upgrade should be successful and the cell upgrade can proceed.