Ugh. What a horrible weekend. I and the team have been spending the entire weekend dealing with massive data corruption caused - of all things - by an electrical fire on the main electrical line coming into our colo here in San Francisco.
The colo fire has led to a cascade of failures that has caused the Technorati service to be down for most of the weekend. It's also giving me a lot more respect for people who build and maintain 100% uptime of services, the trials and tribulations they go through, and also the cost of being operationally excellent.
At about 9:30PM PST Friday, there was an electrical fire on the power main inside our colocation center, where our entire server infrastructure is housed. This caused our battery backup power supplies to kick in, but the independent power generator at the colo never kicked in - possibly because the problem was a fire inside the building rather than a general power blackout of the neighborhood. Well, the fire was only problem #1. We weren't expecting or planning for an outage of that kind. It caused a cascade of other problems that made the rest of the weekend a huge PITA. Problem #2 was that we didn't have a good enough emergency plan in place that would shut our systems down cleanly when power ran out like that. Unfortunately, that meant that when the batteries died, our server farm went down quite ungracefully - causing problem #3, which was data corruption due to the unclean shutdown.
The rest of the weekend has been spent recovering from these failures - we've had to do consistency checks and then rebuilds of the data sets that got corrupted, and we're doing that for over a hundred machines. Bad bad bad. At least we've been performing regular daily backups, and we're able to use that as starting points on our road to recovery. The current ETA to get services back up and running is by Monday morning, which will mean a weekend of unplanned downtime.
What we're doing about it
Clearly, this is unacceptable, but the damage is already done, so the big question on my mind is how to learn from this outage and make sure that it never happens again. One of the important things learned is that there's a reason why some colocation centers are called “Tier 1” (and priced that way) and others are not. Tier 1 means that everything is overprovisioned, and there's plenty of infrastructure backup already built into the place - electrical, network, fire suppression, environmental, security, etc. We have been planning a move to a new colocation center, but this most recent incident just underscores the need to move asap. Second, it illustrates a hole we had in our emergency plan - we had built our emergency plan based on a threat level of a short outage followed by a quick electrical recovery. We had planned for a shutdown of critical systems if battery power fell below a certain threshold (upsd for you techies out there), but we hadn't gotten it implemented given that we were planning the move to the new colo. That of course, led to the data corruption that is keeping the team up all weekend.
Once we get past this crisis and get the service back up and running again, I guarantee that we'll be doing a post-mortem analysis to see where the failure points were, and how we can avoid them in the future. I have learned a lot from this experience about the value (and implicit cost) of planning and building systems around unreliable components, and doing everything you can to eliminate risk - and also about planning for quick recovery when the unthinkable happens. I have a lot of respect for the folks at Google, Yahoo, eBay, and the like for their ability to build and maintain a solid world-class infrastructure, have it scale, and also innovate with new applications as well.
Planning for Murphy's Law
Count me as one of the humbled. To our users and customers: Ouch this hurts, and we're working on making sure that outages like this never happen again. To the folks down in ops and engineering at companies around the globe keeping these systems running and useful, no matter what Murphy throws at you, you've got my appreciation.
I'll post as we have more updates on service status as the day progresses.
UPDATE: As of 10:00 PST Monday, the infrastructure is back online, and we're live again, but we're still making sure that everything is back 100%. Searches should work, and we're monitoring our response time.Posted by dsifry at September 26, 2004 02:19 PM | Other blogs commenting on this post | TrackBack