We’re having a Hackathon at our new offices in San Francisco, on Wednesday, October 6. We’re going to have lots of pizza and beer, and lots of outlets and free wifi. The idea is to actually do some real web services hacking that night after talking about it all day at the looks-to-be-great Web 2.0 conference. If you’re a hacker who knows about our API, or is interested in learning more about coding web services, this is going to be a chance to hang out with our core developers as well as with other leading web services developers, with the goal being to foster great new applications and tools using those APIs. We’re also really interested in sparking further conversation and getting feedback from you so that we can make Technorati more valuable for you. You don’t have to be a Web 2.0 attendee to come to the hackathon – but there’s sure to be lots of attendees there.
OK, the details:
WHEN: Wednesday October 6, 2004, from 8PM – whenever!
WHERE: Technorati Offices (map) at 665 3rd Street, Suite 207, San Francisco, CA 94107 (between Brannan and Townsend Streets)
WE PROVIDE: Free pizza, beer, soft drinks; Fast WiFi, whiteboards, lots of room to hack as a group or individually, experts, help, and advice.
YOU PROVIDE: Creativity, energy, good humor, great ideas, willingness to teach and learn, readiness to hack!
IMPORTANT NOTE: Space is limited (our offices only hold so many people!), so please RSVP (email@example.com) as soon as possible to guarantee your spot! First come, first served.
Thanks to the cool guys at UltraBar, you can now have Technorati at your fingertips with the Technorati Toolbar! I’ve been using this for a while now, and it is fantastic – you can type in search queries, and get back up-to-the-minute results from around the web, and you can also click on the Technorati talk-bubble () from any page you’re browsing to see who is talking about that page, and what they’re saying – which has made reading the news online a completely more satisfing experience. The toolbar works on all versions of Firefox past 0.9 (including the 1.0 PR that is currently available). Fantastic job, guys!
The Technorati service has now been restored following the weekend from hell. Many thanks to the Engineering and Ops teams who worked all weekend to get things back up and running quickly. Thanks as well for the kind response from folks out in the blogosphere, and from our customers and partners. We’re 100% committed to providing great service to all of you.
Onward and upward.
Ugh. What a horrible weekend. I and the team have been spending the entire weekend dealing with massive data corruption caused – of all things – by an electrical fire on the main electrical line coming into our colo here in San Francisco.
The colo fire has led to a cascade of failures that has caused the Technorati service to be down for most of the weekend. It’s also giving me a lot more respect for people who build and maintain 100% uptime of services, the trials and tribulations they go through, and also the cost of being operationally excellent.
At about 9:30PM PST Friday, there was an electrical fire on the power main inside our colocation center, where our entire server infrastructure is housed. This caused our battery backup power supplies to kick in, but the independent power generator at the colo never kicked in – possibly because the problem was a fire inside the building rather than a general power blackout of the neighborhood. Well, the fire was only problem #1. We weren’t expecting or planning for an outage of that kind. It caused a cascade of other problems that made the rest of the weekend a huge PITA. Problem #2 was that we didn’t have a good enough emergency plan in place that would shut our systems down cleanly when power ran out like that. Unfortunately, that meant that when the batteries died, our server farm went down quite ungracefully – causing problem #3, which was data corruption due to the unclean shutdown.
The rest of the weekend has been spent recovering from these failures – we’ve had to do consistency checks and then rebuilds of the data sets that got corrupted, and we’re doing that for over a hundred machines. Bad bad bad. At least we’ve been performing regular daily backups, and we’re able to use that as starting points on our road to recovery. The current ETA to get services back up and running is by Monday morning, which will mean a weekend of unplanned downtime.
What we’re doing about it
Clearly, this is unacceptable, but the damage is already done, so the big question on my mind is how to learn from this outage and make sure that it never happens again. One of the important things learned is that there’s a reason why some colocation centers are called “Tier 1” (and priced that way) and others are not. Tier 1 means that everything is overprovisioned, and there’s plenty of infrastructure backup already built into the place – electrical, network, fire suppression, environmental, security, etc. We have been planning a move to a new colocation center, but this most recent incident just underscores the need to move asap. Second, it illustrates a hole we had in our emergency plan – we had built our emergency plan based on a threat level of a short outage followed by a quick electrical recovery. We had planned for a shutdown of critical systems if battery power fell below a certain threshold (upsd for you techies out there), but we hadn’t gotten it implemented given that we were planning the move to the new colo. That of course, led to the data corruption that is keeping the team up all weekend.
Once we get past this crisis and get the service back up and running again, I guarantee that we’ll be doing a post-mortem analysis to see where the failure points were, and how we can avoid them in the future. I have learned a lot from this experience about the value (and implicit cost) of planning and building systems around unreliable components, and doing everything you can to eliminate risk – and also about planning for quick recovery when the unthinkable happens. I have a lot of respect for the folks at Google, Yahoo, eBay, and the like for their ability to build and maintain a solid world-class infrastructure, have it scale, and also innovate with new applications as well.
Planning for Murphy’s Law
Count me as one of the humbled. To our users and customers: Ouch this hurts, and we’re working on making sure that outages like this never happen again. To the folks down in ops and engineering at companies around the globe keeping these systems running and useful, no matter what Murphy throws at you, you’ve got my appreciation.
I’ll post as we have more updates on service status as the day progresses.
UPDATE: As of 10:00 PST Monday, the infrastructure is back online, and we’re live again, but we’re still making sure that everything is back 100%. Searches should work, and we’re monitoring our response time.
The folks at WiFi management software and services company Sputnik have just released a major software and services upgrade. Sputnik Control Center is the easy-to-use, easy-to-buy software that allows you to manage hundreds of WiFi access points as a single system, manage access control, create and deploy captive portals, track usage by AP and user, set up network policies, and much much more. Check it out: there’s a Sputnik Hotspot Kit for only $599. that includes Two Sputnik AP 160s and two Sputnik Control Center licenses. Makes it really easy to become a WiFi access provider, or to install secure wireless across a company or campus.
SputnikNet enables you to run a managed wireless network without having to set up or run your own server. With SputnikNet, you get a hosted Sputnik Control Center set up just for you. Just plug Sputnik-Powered APs into broadband Internet, and manage your wireless network. You can manage as many access points and wireless networks as you like for only $19.95 per access point per month.
Congratulations, Sputnik folks! Full disclosure: I’m a founder and advisor for Sputnik, so don’t just take my word for it – go and see what others are saying. Daily Wireless has a good review, as does WiFi Networking News and WiFi Planet.
Of late I’ve been hearing more and more about “Getting Things Done” (aka GTD) and programs like Tinderbox? Seems that there’s some new management trend starting to spread. Anyone actually use these systems? Are they useful for long-term productivity improvement? Before I go and investigate this too much further, I’m interested in hearing if it’s just another management/productivity fad, or if there’s something real behind all this. Both the book and the software seem to be getting a lot of conversation spreading about them.
I was recently asked by a reporter for resources on real, concrete events where blogging had a significant effect on political events, and while a number of events came to the top of my head (Trent Lott, Salam Pax, Ed Schrock, Bush’s National Guard memos fisking) I was thinking it would be great to have a timeline for all of us – so I started one. It is pretty bare-bones for now, with entries for the aforementioned Lott, Schrock, and Rathergate. Please drop by and add/edit the page to fill out the timeline.