Brents IT Blog

Random thoughts by an IT GOAT


Best Practices... what not to do!

So i get a frantic call from one of the companies i assist on the side.  Yes, i help other companies for money(consulting?).  I try not to do it that often cause it usually causes repeat business that may eventually take more time than my regular job.  It does however help me keep my skills that i normally dont use in check with reality.  Anyhow, back to the story.....

He states, "the mail server is down and it wont come back up".  He then follows this with the words "It was stuck, so i rebooted it."  After asking what happened he also mentioned that TWO of the disks showed up as orange status.  For those who dont know, orange status means the drives were in a failed state.  At this point, he wants to just boot it back up.  I ask him which drives were orange, he cant remember( ???? ).  After i pause a moment, i tell him to pull the one out he thinks was bad and reboot it.  He says he cant remove the drive because it was just hanging in the bay with a paper holding in place(WTF?).   After much grunting, he says the drive came out.

Now the server boots with an error message stating active directory cant start, then shuts down.  Safe mode doesnt work either, same message.  At this point, i said, i guess its time i come out there.  Personally never having to do an active directory restore on a live server, i thought this would be a good time to get some experience.  BTW: apparently the drive had been offline in the orange state for a few days and the backup drive had not been working.  So it was crucial that the server be restored back to a working state.  Luckily for them, they had another DC, so AD was not in trouble.  Best practice:  Have a support contract for the equipment you have, second best, have parts on hand to replace bad parts if you refuse to pay for support.  Best Practice:  Make sure your backups are working and the backups are good!

So i get there at the pre-arranged time, server still standing strong against a working condition. We reboot the server after determining what drives are what( this took some effort since it wasnt standard config ), we got it booted to ADRM(Active Directory Restore Mode).  Now came the fun part, i said "enter your ADRM password, you know the one used when the server was promoted to a DC."  At this point, a strange look came accross his face and then he stated, "Should be the admin password", but when that didnt work, i reminded him of the fact that it was done during the DCPROMO function and could be something different.  I then said "Where is that pad i told you to write them all down on?".  He said, "I transferred the contents of the pad to my email, you know the email server."  I stared for a minute and said "The one that we are currently working on?" to which he replied "Yes!".  After a long stare again, i said, "Maybe we should find that pad again." 

Now at this point, i must remind everyone that you should SAVE you ADRM password in a password list someone and PRINT IT OUT then save the print out in a safe location.  Best Practice:  The password list can be maintained easily in a locked/hidden share or local disk/usb drive.  Make sure to also have a print out that you KEEP updated in a safe location, such as a safety deposit box or in a safe at home or another building.

So after much searching, about an hour, he says with great gusto, "FOUND IT!!!!"  I stared for a moment and said, now what are you going to do with that once we are done here?  "Hide it here so no one but us know where it is" he says. Fine i say, but just remember if the building burns down, so does your password list.  His reaction was as if it didnt matter, so i moved on.

Yay, password works.  Logged into ADRM, compacted the AD database (C:\windows\NTDS btw) using the ntdsutil tool, no worky.  Browse the net for a few minutes, find that the eseutil will also work, which it does.  We then recheck the file via ntdsutil, no go, "oh yea, delete the logs"... delete the logs.   Ran the compact again, then the checker.... all is well.  Boot the server back up, it comes up no problem.  DC syncs with the other DC and all is well again.  Oh wait, in the process of discovering the disks with orange lights, the raid 5 was also broken.  So i say "well let me load up open manage and take a look".  Open Manage, doesnt show the controller.  I run the installer for open manage, come to find out the perc drivers were not even correct, nor was the firmware on the raid controller updated.  So a few reboots later and all is good.  Best Practice: Keep the drivers and bios as up to date as you can, make sure your storage management software is also up to date and working.

Being that we didnt know which drive was bad in the mirror set, we loaded in a different drive and set it as the hot spare.  After a few moments the server attached the drive to the failed mirror and all was well in the rebuilding world. Best Practice:  Have a visio of your server drive setup's, services, any important information and network connection assignments.

So yet again, another happy customer.  Note to customer, all of this could have been avoided if he had replaced the disk when it was first noted as failed.  Thus following best practices should not only be your goal in life when running an IT department or managing servers, it should be something you strive for perfection in.  Granted it, will not always save you, but it will save you time, money and your weekend(s) in the long run.