Brents IT Blog

Random thoughts by an IT GOAT


VMWare and SANs - Major VMware Flaw

Many many shops around the globe are using VMware now and have been for a while.  A few months ago i posted a blog about a flaw in VMWare version 4.0 that required you to update to version 4.0 update 1 to fix it.  Well i dont think its really fixed.  So here is the story of the last few days of my life.  If you have a SAN and use VMware, i highly suggest you read it and take note.

Day 1 Sunday - Recieved notification from our SAN system that a hard drive in one of SAN units had failed.  No big deal, it happens.

Day 2 Monday - Tell Junoir admin to call HP and order new drive for lefthand SAN unit with issue.

Day 3 Tuesday - Still no hard drive, supposed to arrive in 4 hours, nothing as of noon.  4pm rolls around, a second drive in the same SAN unit kicks offline.  At this moment in time, 7 LUNs go offline.  The LUNs that go offline are for duplicated data that does not need to be kept online and can be restored from backup or an online live copy ( reporting/achive data from live databases ).

After a half hour, 2 of my vmware hosts are now reporting as "not responding".  I call vmware, they stare at it for a bit, then decide to disconnect a host and bring it back, host doesnt come back.  By now half my virtuals are showing as not responding.  VMWare brings on a storage specialist, who then determines that the secondary LUNs that went offline are causing ALL of our vmware hosts to show up as not responding.  He then tells me i have to either disconnect the LUNs fromt the host side or hide the LUNs and reboot the hosts.  Well I would LOVE to do that, but because they are not responding, i cant access them to remove the LUNs NOR can i vmotion the virtuals off the hosts!  So its either restore the LUNs or hard reset all the hosts.  Meanwhile some of the virtuals go into a nonresponsive mode as well so i cant even issue a shutdown command remotely.

So bottom line on this, DONT let you LUNs go offline.  Sure, i can prevent all SAN failures, no problem.... give me a break.  This issue is a HUGE blunder in their programming department.  I can litterly tap into someones vmware environment, unplug a single SAN unit and  watch the whole thing crash and burn.  This was NOT the case in vmware 3.5. 

Day 4 Wedesday - Called HP who called dell who then couriered us a hard drive to try and rebuilt the downed SAN unit.  After a few hours of trying, no luck, SAN was dead.   Guess what, since the SAN is now gone, so our the LUNs!  So i had to delete the LUNs from the cluster and then blindly reboot all the hosts via ssh.  While this did work, some of the LUNs still ended up being stuck in an inactive state.  It also caused one of my databases to be corrupted when the vircutl was hard reset.  Luckily everything else came up and our medical side was able to function for the day. 

I finally recieved the original harddrive that was supposed to arive in 4 hours on monday, wednesday by noon.  If the hard drive had been recieved on monday or even tuesday morning, the secondary drive failure may not of caused the whoe catastrophe.  Whos to say it wouldnt of failed during the first drive rebuild, but atleast it would of had a head start.  SAN unit still offline.

Day 5 Thursday - Was able to restore accounting database with no data loss.  Whew.  The department as a whole lost two days of entry work though.  SAN unit still will not re-image.  Should get a new controller by tomorrow morning first thing.  If that doesnt work, i think i will demand HP send me a new replacement unit.  The amount of productivey loss due to their inability to honor a 4 hour window certainly warrants the cost.  Plus its been down for three days, we lose another unit, everything goes offline and i am back to restoring things.  I suppose i could turn on 3 way replication for a few of the more critical shares, but who would of thought the thing would of been offline for this long.

Devices failures and downtime such as noted above could be avoided by spending more money, but in our environment, they are few and far between.  If my budget was larger, maybe this wouldnt of caused such a big problem, course if VMware would fix this and realize the stupidity of their requirement, maybe no one else would have to deal with this either.  I will also note that while i love the HP lefthand units, I DO NOT love how HP and dell have interacted with the hard drive/hardware replacements on older lefthand units that were built using dell servers( before HP bought lefthand ).  This was not the first time they took more than 4 hours to get a drive to us, last time it took a week and i yelled at everyone i could get to talk to me on the phone or at conferences.  I was also assured that it would not happen again.... well it did and we lost data.  I guess i have a mission now for the HIMSS conference.

BTW: i didnt sleep from tuesday morning till wednesday at 10pm.  What a marathon. Oh and our lawyers are now looking at the possibility of going after Dell and HP for failing to fulfill contractual requirements.  I would prefer they just give us HP replacement SANs so we dont have to worry about navigating the HP/Dell relationship.  Oh and i should mention that i have never had an issue with dell not getting me anything within the four hour window when i had to deal with them directly.