Brents IT Blog

Random thoughts by an IT GOAT

NAVIGATION - SEARCH

When the cloud comes crashing down....

So I receive a frantic text message on Sunday night saying that the user could not log into our remote site.  Now I am with my mother and aunt chatting the night away waiting for them to leave for my aunt’s house.  I stare over at the message perplexed thinking that the DNS issue that occurred last week was happening again.  Last week for some unknown reason our main web domain was wiped off the face of all the Brighthouse DNS servers, so none of our users who have Brighthouse internet could access our .com domain.  Luckily our .org domain was working, so I had a work around for those who complained and brighthouse fixed the issue within a few hours.  Now usually when they fix something, its fixed, so it was odd that it might be happening again on the same day a week later, but who knows, could be a larger issue at play.
So before I called Brighthouse, I checked my phone, site was working.  Checked my netbook, site still there and accessible, but I couldn’t login, needed the citrix plug-in.  So I ran over to my main PC and accessed the site from there, still up, no DNS issue.  Relief swept over me thinking it was a user error or on the users side, then I tried to login.  The login was a no go, perplexed by the message stating it could not authenticate, I then began thinking there was an issue with the network.  We have three total AD Domain server on our cloud, what could of possibly happened to make them not accessible besides the network?  At this point I realize that an office visit is necessary and since I only live about 5 minutes from the office, its best to hit this head on now before 500 people login(or not) tomorrow morning starting at 6am.
I arrive at the office to discover the alarm going off, not good.  I clear the alarm and codes, call the alarm company and verify the alarm cancellation, which incidentally, the maintenance department had cleared a few hours earlier.  Now my interest is peaked, these two things must be related, but how?  So I make my way through the building, through many a locked door to the hall where our server room is located.  At this point I hear what sounds like a jet engine and more beeping then a power plant going into meltdown.  Well my heart drops and the hopes of returning home quickly are dashed by the chaotic flashing of random orange and red lights.

My first thought was that the room had suffered a fire, but as I opened the door, the heat wave indicated something far worse, a cooling failure.  At this point, it’s time to cool the room as quickly as possible by any means necessary.  I employed three large fans( the main A/C for the building was still working ), checked the secondary and tertiary A/C units, then pushed them into service(tertiary was not running for some reason ).  I then call the maintenance folks and tell them to get the AC guys over stat.  Keep in mind it’s about 10:30pm est, that like trying to find a lawyer after 5pm.  Luckily this is a populated area and they were on their way.
Now while I wait, it’s time to survey the damage.  All of our IBM and HP servers had auto shutdown, this included three of our SAN units.  Our dell servers were still online despite the 115 degree room temperature, maybe they have a higher tolerance given the noise coming from the humming fans.  Anyhow, I begin to lose it, while our SAN units are redundant, I can only lose two out of the six, not three, which three seemed to still be online(these were Dell units Lefthand sold us shortly before HP bought them ).  Worse case, I spend the next day rebuilding the entire cloud and restoring files, best case, I power them back up and everything works.  I quickly called HP to verify what would happen, they provided some comfort saying that the cluster will either come online or not work at all.  While I calmed just a little bit, the news was not _that_ reassuring.  Luckily we had switched out our hard drive for USB sticks when we changed over to ESXi, so was 99% sure our VMware host cluster would be fine.

As for the main A/C unit being offline, it turned out that a wire on the roof had been somehow fried during a surge.  Being that its generator fed, I was a bit surprised by this, but was glad it would be a quick fix.  By the time they had it back online, our secondary and tertiary units had cooled the room down to 90 degrees and with the help of the main unit, we were back in the 70s within a few minutes.  Turns out our tertiary A/C unit was turned off because it just stayed on and ignored its temperature gauge that told it not to turn on until the room hit 80 degrees.  Just to make matter worse, our secondary unit had sprung a leak in the duct work and was cooling the attic nicely.  All three issues lead to my current predicament.  At this point I told the maintenance guy that they needed to link in one of the main A/C unit air vents to the server room as a fourth A/C backup.  Granted the building A/C won’t help that much but maybe it will buy us a little more time in the event of a failure again. 
Now for those of you at home following along, I am sure at this point you are screaming that we need a sensor in the room to report A/C failures quickly.  Yes?  Well we did have a sensor in the room and it was working.  It had alerted the alarm company of the issue and they contacted maintenance who then said it was a false alarm and CANCELED the alarm.  Wasn’t that nice of them?  Of course it’s not completely maintenances fault, the alarm company had a data failure the year before and LOST all the notes and contact information tree I had provided such an event occur.  So when they called the maintenance folks, they never told them it was an A/C failure because they lost the note telling them what that zone was.  So bottom line as a backup to our alarm sensor, I have purchased a second sensor that reports directly to I.T.  It also checks for fire, moisture, humidity and voltage.  Now my phone will light up light a Christmas tree if something like this happens again, which it better not, or in the least get to a point where servers shut down.

So back to the servers and SANs, I powered them all on and waited for what felt like 2 hours(actually 20 minutes).  The servers came back up, a few with orange lights, most without.  I then waited... a few more minutes and then decided to check virtual center.  It was a no go, I could not start the service.  I also could not log into some machines and it got me thinking, all of our DC's are in the cloud( that should never fail as long as we keep the gen going ).  So I made a mental note to add another DC asap, but this time, a physical box.  To VMwares credit, they did warn me and it was in the back of my mind to do, I just had not done it because we have limited resources in the man power area.  Anyhow, I checked the SAN environment, no issues, just a few LUNs were rebuilding as expected, when those completed I booted up the VMware hosts.  About 20 minutes after they booted up, most of the VMs started coming online.  Luckily they did boot( I was worried about that ), unfortunately they were all on the same host.  Apparently DRS had moved all the server from the rest of the powering down hosts to our newest and most powerful host that apparently had just enough heat resistance to hold off until all had been transferred to it.  This didn’t help since several databases were in stopped state due to lack of resources.  Well I put big bertha into maintenance mode and the VMs scattered back to a more normal state.  Unfortunately the heat had somehow damaged the memory in the big server, so I had to shut it back down and reset the RAM modules.  This worked and it was back online with its full RAM count about 10 minutes later.

I hope my story here helps point out some weaknesses and work arounds for those who are considering a cloud of their own.  Remember in I.T., it’s not about how or what, but when.

BTW: we have our Main A/C feeding into the room now, the other units fully functional and a physical DC, now I am just waiting for my new sensor to arrive.