Well, Evan Williams announced it from stage at Live from the Blogosphere, and Dan Gillmor breaks the story in the Merc. Congratulations, Evan, Jason, and the whole Pyra team. Pyra is the company that runs Blogger, for those of you who didn't know the connection.
Technorati Anywhere!
I whipped up a little tool this evening that I thought y'all might like. It's called the Technorati Anywhere! bookmarklet. What it does is simple - the bookmarklet opens a new window or tab in your browser with the list of links (and short excerpts) of people who link to the page you're viewing. It is a new way of instantly checking sources and finding out the credibility of the page you're currently viewing. If you don't like your browser opening new windows or tabs, here's a version of Technorati Anywhere! that shows the results in the current window.
It works with recent versions of IE, Netscape, Mozilla, and Safari. The only requirement is that you have Javascript enabled.
To install it, do the following, depending on the browser you have:
For IE 5.5 and above users:
Right click on one of the links above, and then click on "Add to Favorites..." Since there's some Javascript in the bookmark, IE will tell you that "You're adding a favorite that may not be safe." Click OK, and you're done, it is in your favorites.For Netscape 6 and above and Mozilla 1.0 and above users:
Drag the link into your bookmark bar. Alternatively, you can right click on the link and then click on "Add bookmark".For Safari users:
Drag the link into your bookmark bar.That's it! I hope you enjoy it. I like it better than the Technorati Sidebar I created a while back, and the instant gratification of being able to find out the Link Cosmos for any page on the web gives me a much richer experience. I hope you enjoy it too Send me feedback and leave comments below, I'd love to hearfrom you on its usefulness, bugs, new features whatever.
Update: Emmanuel M. Decarie sent in the update to get everything working with Safari. Thanks, Emmanuel, that was quick!
Breaking the (power) law
Clay Shirky has gotten a lot of people talking about power laws and how they relate to the blogosphere. Dave Winer and others disagree, and there's a bunch of other interesting conversations going on as well. What's interesting is that both Clay and Dave are right, depending on how you look at things. Dave sees the world from the microeconomic point of view - it is really easy to create new communities with blogs, and there is no scarcity of links. Clay looks at things from the macroeconomic view, seeing overall patterns of thought leaders evolving as people look for editors and subject matter experts to help guide them through all the links. Clay is also right in his observation that blog linking will tend to follow a power law - that is, a small proportion of bloggers will get a huge number of incoming links as the Technorati Top 100 will attest across the blogosphere. However, what Clay doesn't emphasize, is that blogging communities, even though they have some lightposts, tend to form into small open communities. That's why my blogroll looks very different from Glenn Renolds', or Doc Searls', or Joel Spolsky's. Even though we might all have a few bloggers in common, most of the links are different. In other words, the blogging space has a high degree of dimensionality.
I thought about the problem that this presented to a traditional link engine. When you rank bloggers simply by the number of people who link to them, you get a very static list of "a-list" bloggers, as shown by the Technorati Top 100. What I wanted to do was to break that power law, and give more exposure to the lesser known, but still interesting bloggers, especially on days when they stand out and do something interesting.
I think I've found a way to do that, and it all boils down to the fact that Clay described a power law.
The Technorati Top 100 ranks based on a linear relationship of the incoming links to a blog. A linear equation looks like this:
y = ax + b
As we all know, that leads to a boring top 100 page.
So, I started playing around with the ranking algorithm. Now, a power law looks something like this:
y = ax2 + bx + c
Or, a more skewed graph looks like:
y = ax3 + bx2 + cx + d
Remember high school algebra? I'm sorry to make your brain hurt. The key point to remember is that equations that follow a power law start to get really big really fast as you increase x. You start to get a graph that looks like a parabola.
What I wanted to do was to give some of the lesser-known bloggers some visibility. The way I did this was to invert the power law when I did my rankings. I looked at two variables: The number of new inbound links to a blog, and the current number of total blogs already linking to it.
In order to reverse the power law, I used the following as the ranking algorithm for the Interesting Newcomers page:
n = Number of new inbound links
c = Current number of inbound blogs (as of the day before)
(n3)/(c+n)2 where c > 30
And I'm using a quadratic equation for the "Interesting Recent Blogs" page:
(n2)/(c+n)2 where c > 40
The results are very interesting.
What the ranking algorithms described above does is make it progressively harder to move up in ranking as the number of current inbound blogs increases. This effectively negates the power law that Clay describes, and gives us a way of measuring apples to apples.
Basically, the idea is that for a relatively obscure blogger who has, say, 40 people currently linking to his blog, getting 4 or 5 new blogs linking to him can have the same effect as a a-list blogger getting 40 or 50 new links.
Intuitively, we know that this is right - After all, it's very easy for Doc Searls to get 20 new links to him - he has such a large readership. But for a smaller blogger to get a bunch of new links, he must have posted something really interesting that day.
One more point - why does c have to be greater than 30 or 40? Well, there's two reasons for this: This equation doesn't work well when the number of current incoming blogs is very small - someone who has only one person linking to him can jump right to the top of the ranking if he gets one or two new links, and that's not very interesting for us. The second reason to set the bar at a certain level is to ensure that the blogger in question actually has an audience. The audience may be small, but at least some people are linking to him, which is a good way to knock out the cruft at the real tail of the power curve.
These equations probably aren't perfect - I haven't done any curve fitting or formal statistical analysis to make sure that the equations are correct, but I'm just using my holistic "feels good" barometer. The power law may not be a quadratic or cubic relationship - it could be of a different power, but the quadratic and cubic relationships give a decent spread of both a-list and unknown interesting bloggers in the Technorati Interesting Recent blogs and Interesting Newcomers lists. For the Interesting Newcomers list, I simply cut out all of the bloggers who already have an audience - so you won't see any a-list bloggers on that list, at least not once they become a-list. :-)
This is interesting research for me, but the most satisfying thing about it is that I've found a way to identify interesting new writers and add them to my blogroll - people who I would have never had found out about otherwise. I can also use the other Technorati tools, like the link cosmos, to find out who is linking to them - which gives me a quick feeling for who is in their community.
Let me know your thoughts - do the new rankings look reasonable to you? Ar you finding new and interesting blogs? More of the same old same old, or just boring crap? I know I've already found one new blog I'd never heard about before - Exploding Cigar, currently number 4 on the Interesting Newcomers list. Very funny, great blog.
UPDATE: Jason Kottke has done the analysis, and comes up with the following formula:
y = 5989.8x-0.8309
The important part of that equation is the power degree - -0.8309. To counteract that effect, we need to invert that (hope I'm getting my math right), which would make the power needed to ocunteract the power law to be approximately x1.2038. That about matches up with the formula I spelled out for the Interesting blog list, which approximates a x1.5 relationship, for reasonable values of c. Note: This is something I just did on the back of a napkin with only 4 hours of sleep, and no coffee, so I may be way off, but if some kind mathematician can check the work and comment, I'd be much obliged.
Linux gives us the power we need to crush those who oppose us.
I'm Steve, and I'm a super-villain.
More Technorati News
In my last post, I discussed a database overhaul that significantly improved the access times of the Technorati link tracking service. This has resulted in faster load times, and faster web spider indexing times, which means fresher information on the site.
But that's not all that I did in my recent weekend reengineering at Technorati. I also:
- Added in <guid> fields to the RSS feeds. These special RSS 2.0 tags allow you to identify each RSS entry with a unique identifier, and are perfect for the RSS feeds that Technorati produces. I added them so that RSS Aggregators can identify posts and links uniquely. When you get an RSS watchlist, it is filled with up-to-the-second information on who is linking to you. It also includes text in the feed noting when the link was created, and when the blog was created. This means that every time you check your RSS feed, the text inside each item in the feed changes. The <guid> field allows aggregators to keep track of these posts, and mark them as read, for example. If you've got a Link Cosmos as big as Dave's or Doc's, that helps seriously cut through the clutter.
- Fixed the blog indexing engine so that a blog that is reachable from two or more addresses will be identified as such. For example, take a look at the awesome bOingbOing blog. Some people link to it at www.boingboing.net, and some people link to it at boingboing.net. The links go to the same place, but the old Technorati code thought there were two blogs there. That's fixed now. It also means that the Technorati Top 100 and Interesting Recent Blogs lists are more accurate as well.
- Fixed the Link Cosmos display engine so that links that you create to your own blog don't show up on your Link Cosmos. I got lots of complaints from people on that bug, and I think it is most prevalent with people who use Radio with its Categories option to post multiple blog channels to different directories on the same site - it generates lots of self-referential links when doing blog updates.
- Added a Creative Commons license. You (and your browsing tools) will now see at the bottom of every Technorati page, the Creative Commons copyright license for the page. The license gives you permission to permits others to copy, distribute, display, and perform the work with attribution, and not for commercial purposes. In other words, you can't make a knock-off Technorati site by pulling all the content and replacing Technorati with your name, and you can't use the Technorati results for commercial purposes unless we work out a deal. Of course, that license doesn't apply to the RSS feeds that you get when you purchase a watchlist. You can use those for commercial purposes all you like.
Here's one idea I've been toying with: Would you be interested in viewing graphs of the number of incoming blogs/links to a site over time? It would be a great way to track interest and authority of a site as time passes. Would you be willing to help subsidize the work necessary to build it and store all the data? It's not something that I could work on right away (Sputnik work is my #1, #2, and #3 priorities right now), but I'd be interested in your thoughts. Leave comments below, and let me know.
Technorati Technical Update
It's been a while since I've been able to blog, but I've just had a minute or two to come up for air, and I wanted to mention the goings-on at Technorati. Things have been really great. It seems that lots of people find the service useful, and people are signing up for watchlists. Unfortunately, what also happened was that the database schema backing the Technorati site was poorly designed (bad Dave!), which made the website slow to a crawl over the last few weeks. What happened is that I had a really big table in the database called "links" that had the following structure:
CREATE TABLE links (
lid bigint NOT NULL AUTO_INCREMENT,
bid bigint NOT NULL,
linkedblog bigint NOT NULL,
href VARCHAR(255) NOT NULL DEFAULT '',
linktext VARCHAR(255) NOT NULL DEFAULT '',
title VARCHAR(255) NOT NULL DEFAULT '',
priortext VARCHAR(255) NOT NULL DEFAULT '',
aftertext VARCHAR(255) NOT NULL DEFAULT '',
created DATETIME,
updated DATETIME,
current CHAR(1) NOT NULL DEFAULT 'Y',
PRIMARY KEY (lid),
INDEX (href),
INDEX (bid),
INDEX (linkedblog),
INDEX (created)
);
There were two big problems with this table - first, each record was way too big. I thought I was getting the best of both worlds when I mde all of my char() columns VARCHARs - only use the space you need, right? More on that later. The second problem was that I had over 6 million active links that Technorati was tracking, so traversing the database was s-l-o-w. And since I was using Linux on x86 and MySQL as the backend for the database, I was going to start bumping into the 2GB file size limit in short order.
Here's where some good old-fashioned database optimization techniques came into play. One of the best things I learned about database design was to make it easy to calculate record offsets. That means using fixed character field widths for tables that need fast lookups. So, what I did was split the links table into two tables, and made some changes to the new links table:
CREATE TABLE links (
lid bigint NOT NULL AUTO_INCREMENT,
bid bigint NOT NULL,
linkedblog bigint NOT NULL,
href CHAR(127) NOT NULL DEFAULT '',
created DATETIME,
updated DATETIME,
current CHAR(1) NOT NULL DEFAULT 'Y',
PRIMARY KEY (lid),
INDEX (href),
INDEX (bid),
INDEX (linkedblog),
INDEX (created)
);
CREATE TABLE linkcontext (
lid bigint NOT NULL,
title VARCHAR(255) NOT NULL DEFAULT '',
linktext VARCHAR(255) NOT NULL DEFAULT '',
priortext VARCHAR(255) NOT NULL DEFAULT '',
aftertext VARCHAR(255) NOT NULL DEFAULT '',
PRIMARY KEY (lid),
);
You'll notice what I did - I pulled out all of the information that wasn't in an index, and put that data into the linkcontext table. I made sure that there was a unique key (lid) that identified the data for a particular link, and also kept the space-saving format of the VARCHAR. Essentially, the index file for the linkcontext table is simply a set of pairs - the lid of the link, and the offset of the linkcontext record corresponding to the data. I did lose a bit of flexibility with this system - Notice that the link href is now a maximum size of 127 characters, down from a theoretical maximum of 255 earlier. I decided to take this tack because the number of URLs that were longer than 127 characters are less that 0.01 percent of the URLs in the database, so I figured it was an acceptable loss.
Actually, I originally tried cutting the database down to 63 characters, but that unfortunately cut out a bunch of interesting sites, like Reuters, which use long URLs to designate topics and types of headlines, for example. So, 127 was the right number.
The biggest changes came to the links table - Note that every column has a fixed width. What that means is that MySQL no longer has to do reindexing of the table whenever inserting or deleting records - all it needs to do is a quick multiplication of the lid offset with the total number of bytes per record.
But wait, what does all this mean when doing SELECT calls? Doesn't it mean that I have to SELECT information from two tables instead of one?
Yes, that's true, but again, we can take advantage of the power of left-handed JOINs. Here's how I extracted data from the links table before:
SELECT lid, bid, href, linktext, priortext, aftertext, nearestpermalink, UNIX_TIMESTAMP(created) AS created, UNIX_TIMESTAMP(updated) AS updated FROM links WHERE href LIKE 'http://www.sifry.com/%' AND current='Y' ORDER BY created DESC
Slow stuff, because we had to jump around the indexes looking for the contents of the href column.
Here's what it became:
SELECT links.lid AS lid, bid, href, linktext, priortext, aftertext, nearestpermalink, UNIX_TIMESTAMP(created) AS created, UNIX_TIMESTAMP(updated) AS updated FROM links,linkcontext WHERE href LIKE 'http://www.sifry.com/%' AND current='Y' AND links.lid = linkcontext.lid ORDER BY created DESC
No need for two SELECT calls - or any change to the other business logic following the SQL - by waiting until the very last item in the WHERE clause, I've reduced the number of columns that match the WHERE to an absolute minimum before matching up the links with the related linkcontext data. The speedup? From minutes per query to seconds (or less).
I made some other changes to the codebase, including a number of bug fixes and content change, but I'll discuss that in another blog post...
Alan Alda for Science Advisor!
Alan Alda has answered the call, responding to the Edge Question, and it's remarkably good. Each answerer is asked to become President Bush's Science adviser, and answer the question, "What are the pressing scientific issues for the nation and the world, and what is your advice on how I can begin to deal with them?"
Excerpts from Alda's response:
The world is going to come to an end in about 5 billion years no matter what we do. So, in the long run, you're off the hook. It's true that things like Global Warming, plus the increasing loss of clean water and bio diversity, can hasten The End Of Everything As We Know It, but even so, it will all end eventually. Nobody gets blamed for continuing a disastrous policy, so there will be no harm to your reputation if you do nothing. People simply do not say, "Caesar did nothing to halt the Roman practice of putting lead in the air and water, probably resulting in the eventual weakening and fall of the empire." But they're absolutely fascinated with the way he could divide Gaul into thirds.
Recognizing this, I will not advise you to do anything related to the environment. I will simply ask permission to put a glass of water on your desk every day with little things swimming in it. Sooner or later, you'll slip and drink from it, and while you're in the hospital, we can talk about the billion or so people who have nothing else to drink.
"State of the Union" and Creative Commons licensing
Public Campaign (a beltway non-profit focusing on campaign finance reform) has created a really interesting poster called "State of the Union". It is an expose into the buyout of America's political system through large corporate campaign donations, and the data they present is quite compelling.
The other interesting part of this story is that they released the poster under a Creative Commons license, which is a first (I believe) for a large political policy group. Their intentions are to see the electronic version of the poster disseminated widely - emailed, printed and posted on office doors. The Creative Commons licensing was intentionally chosen to let people know that it is OK to copy and print out the poster so long as it is done for non-commercial purposes (they are selling a 2'x3' version of the poster on glossy poster paper for $15) and also to promote the ideals of Creative Commons.
This is an experiment on their part, and I think it is something that should be supported. It (a) gets the Public Campaign message out on the net, and (b) gets policy wonks inside the beltway to sit up and take notice of the work that the great folks at Creative Commons have been doing. Maybe, just maybe, it'll help another political organization to release their work under a similar license.
Avalanches start with a single flake of snow...
Full disclosure: My brother Micah works at Public Campaign (msifry at publicampaign.org), and he and I had long discussions about using the Creative Commons license. I also host one of Public Campaign's sites and mailing lists on a server I own.
Blogging Hours
The past few days, I've been back at Johns Hopkins, my alma mater. I was there giving a talk to their intersession class on Entrepreneurship (I've put up a PDF version of my slides, btw)
Afterwards, I got to see and talk with some of my old professors and it was great to talk with some of the best and brightest and catch up on their current work. They asked me what kinds of things they could do to (a) improve alumni relations, (b) improve or change the student experience, and (c) increase or enhance the reputation and knowledge sharing at the University.
I've been thinking about it, and I think I've got a suggestion:
Give every faculty member, graduate student, undergraduate, and employee at the university a blog.
If I had a million dollars to give to the university, I'd split it into $10,000 chunks and I'd make them available as grants to the 100 people that posted the most interesting, useful blogs during the school year. Make it a contest.
Imagine that - 100 members of the JHU community blogging daily. Some would talk about their current research, some would write about daily life, some would post poetry and writings, who knows. The conversation would be phenomenal. It would get national and local press. It would open an window to the entire world of the interests, knowledge, and thinkings of 100 of the world's finest professors, students, and administrators in higher education today.
I think it would also start conversations. It would attract students to the school. As Doc likes to say, it would be arson. It would light fires of interest, collaboration, and involvement. Just spending 3 days down here, talking with some of the great people, I got intrigued by all the potential. I saw the stovepiped information pathways, the bureaucracy, and - to a person - everyone railed against it. Here's an idea: Give the university a choir of voices. Make it easy for people to talk, easy to post. Imagine the connections that would happen just by doing a Google search, researchers across the world that could find each other. Throw away that old-fashioned quarterly newsletter, or even better - supplement it with the best of the conversations that these blogs start.
From an infrastructure perspective, it would cost almost nothing. An extra server or two. Training? Writing a blog has become point and click.
Heck, Lessig blogs. Reynolds blogs. When I think about Stanford, guess who I think of? When I think of the University of Tennessee, again, guess who? Imagine an entire faculty doing what these guys do. Wow. Office hours are from 2:30 to 4. Blogging hours are from 4 to 4:30.
The first university that gets serious about using blogs will create a huge impact in profile, research quality, cooperation, and collaboration both inside and outside of the university. The first one to do it will show its cluefulness. The value to the rest of us would be huge as well. I would bet it would end up increasing alumni giving as well.
The key thing is to create incentives for people to communicate. Go ahead, put up a disclamer on the blog pages, and get a blogging policy. But it is key to not punish people for publishing their thoughts. Don't get PC. Trust the students. Trust the faculty. They will rise to the occasion. After all, they're signing their names to the work.
I would read them, sure as hell.
God, that would be great.
The 10 cardinal columnist sins
I saw this link on Gawker today (keep it up, folks), and saw myself reflected all too well in its 10 admonitions. Something for all bloggers to read before they hit the "Post" button.