So, about 9pm Wednesday night, we took a power hit in our lab server room at work. This room has about 200 servers, and for some idiotic reason I’ve yet to figure out, isn’t on generator power. Each rack has a UPS though, so in theory, it should outlast a power blip. Unfortunately, the HVAC requires a manual restart. So, what’s that mean?
Closed room + 200 servers + no ventilation/cooling + 12 hrs (how long before it was brought to our attention) = 109 degrees in the COOLEST part of the room.
The door to the server room was actually warm to the touch, and pulling out the keyboard drawer made me go "ooh, that’s _hot_". This is something I’ve been through before, and had a heads up about it on the way to the office. It’s very similar to a mass casualty incident, you triage and move as quickly as possible through. And well, you do usually have some casualties as computers don’t like to work in that heat. Today’s count: 5 dead power supplies (that we know of, we likely will have some hardware issues stemming from this in the near future).
This is probably the 10th time I’ve gone through this since being there..so it’s kind of become "routine". A year ago though, when it was just J and I dealing with it, we would have a good majority back online by noon. We wouldn’t have to communicate much, we very quickly divided up and conquered. Today? Well, things have changed and we have more resources. But more resources does not equal quality. I basically spent the entire day being server/hardware monkey and dealing with these "remote" resources who were trying earnestly to help…but who lack the troubleshooting expertise or even clue on how to get through this. Just trying to get them to focus instead of looking and thinking of everything that had to be brought up was a task in futility. They are good people, but it’s just not a good time to try to teach someone troubleshooting skills…at all.
There’s a very specific order that is necessary to bring everything back up and have it working…if you ignore this order, you’ll be fighting issues for a while. It’s an extremely complex/integrated system. Focus became the word of the day for these more junior "engineers", but it’s frustrating being the one telling them to focus. It’s also frustrating when they compliment you on the documentation you’ve written to cover this (like I said, we’ve been through this a few times), and yet, I still had to hold their hands and guide them through the document step by step.
5 PM came, and we still had very little back up, as something was just off with each environment, but not in the same way to point to one specific thing. I was asked to stay and handle it…and was offered more resources from our India team.. I finally just started working on it declining the extra resources, and had the environments back up approximately 10pm.
I did have to make one sanity call to an extremely knowledgeable engineer as after looking at stuff for 6 hrs, and not getting different results was driving me up a wall. As we were discussing the issue and checking configuration files, I managed to stumble across a stale mount that ultimately had been the source of the issue (the config for an obscure and never used component could not be loaded). Man, I felt like an idiot. Although, it was nice to have this engineer talk to me as a peer and actually sound impressed that I figured out that that was the issue after all.
Yeah, I know, long post…but I think I just needed to vent and get this out. The only good thing is…there is no way in hell I’m working tomorrow.
