I'm proud to announce that Technorati and Newsweek are working together, including a deep integration of posts and links from bloggers (here's an example) into Newsweek's site. This includes the Newsweek Blog Roundup and summary widget on every Newsweek page (shown here on the right). This acts just like a "most viewed articles" or "most emailed articles" widget - only the determinations are made by watching the number of bloggers that are linking to Newsweek articles. It shows the top 10 Newsweek stories generating the most discussion on Weblogs within the past 7 days. You can see it on the Newsweek homepage and on each of the article pages, simply scroll down a bit and look on the right hand side.
In addition, Newsweek has launched a section covering the conversations in the blogosphere about Newsweek's columnists as well. For example, here's the data on Steven Levy, Anna Quindlen, Michael Isikoff and Mark Hosenball. You can also subscribe to the search via RSS feed by Technorati Watchlist (available at the top of each Blog Talk), and you can dive as deep as you like by getting all of the posts as well.
My kudos to the folks at Newsweek for their forward thinking recognizing and including bloggers inside their tent. This is just the beginning of many ways that mainstream media and bloggers can work together to provide a more complete picture of a story - facts, opinions and feedback all shown in one place, making for a better reader experience.
One more thing - if you think that this is a cool feature, and you want to see Newsweek and other media companies roll out systems like this, leave a comment, or send a trackback, or better yet, go the the Blog Roundup and at the bottom of the page, rate the article. That'll send Newsweek a message that this is something you want to see more of.
First off, some terminology and an understanding of what we're measuring. The chart below illustrates a measure of influence or authority of a site or blog as measured by the number of people who are linking to it. Note that this is not a measure of page views or website "hits". Rather, Technorati looks at linking behavior as a proxy for attention and influence. In other words, the more people who link to a site or blog, the more influence it has on others.
As the chart above shows, the most influential media sites on the web are still well-funded mainstream media sites, like The New York Times, The Washington Post, and CNN. However, a lot of bloggers are achieving a significant amount of attention and influence. Blogs like bOingbOing, Daily Kos, and Instapundit are highly influential, especially among technology and political thought leaders, and sites like Gizmodo and Engadget are seeing as much influence as mainstream media sites like the LA Times. A note on counting: Some organizations with multiple domains or highly syndicated strategies like the Associated Press and Reuters, are underrepresented in this chart, given that their impact is not easily countable using our methods. An interesting statistic to note is the current placement of subscription sites like WSJ.com (the Wall Street Journal). While the WSJ has begun to offer some content outside of its subscriber-only site, the policy is clearly costing them some influence and attention in the blogosphere, as bloggers find it difficult to link to articles in the subscriber-only sections. Also interesting to note is that even though The New York Times and The Washington Post require free registration to view the articles, bloggers are still linking to the stories, and this behavior hasn't changed much in the past 6 months.
More to come later in the week, including all of the underlying data...
Today I will write about some of the darker sides of the blogosphere, including the increase in spam and fake blogs, comment and trackback spam. Along with the growth in the blogosphere (as reported in parts 1, 2 and 3 last week), Technorati has also been tracking an increase in the number of people who are trying to manipulate the blogosphere. First off, some defintions:
Spam blogs are blogs that are created in order to influence results on a search engine by filling the results with spam or fake postings. Sometimes it is done to influence page rank-type algorithms, which monitor the number of pages (in this case blog postings) what link to a page or a site. In the more general web sense, these are called "Link Farms". Sometimes it is to push higher rankings of those posts and blogs for certain keywords, also known as "keyword stuffing". There's been quite a bit already written about link farms and keyword stuffing, it is a pretty well-known technique used by some people to influence search ranking. It is also pretty easy to catch, and most search engines actively penalize or exclude these sites from their index. Here's some example spam blogs.
Fake Blogs are blogs that appear "blog-like" on the surface: They have numerous posts, usually around a particular area or subject, and at first glance look as if they were created by a person. However, these blogs are actually automated creatures created by programs usually in order to get highly targetting Adsense advertising, or in some cases are built to be become a portal for affiliate systems like the Amazon Associates program. They are created in order to perpetuate click fraud or sometimes as a part of a "make money fast" scam on the internet by again taking advantage of traffic brought to them by search engines and web rings. Here's some example fake blogs.
I should note that some fake blogs may very well contain interesting and relevant content, which opens a debate onto how useful or valuable they are. This is why I don't include fake blogs in with Spam blogs (as defined above) because it is debatable that these systems are actually providing readers some value.
Comment and Trackback Spam
Modern blogging systems allow for comments and trackbacks as ways of allowing readers or other bloggers to easily add their thoughts and comments to a post. Unfortunately, some spammers have been abusing these systems as well. Many hosting providers and tool makers have incorporated authentication mechanisms and captchas to make it more difficult to automate the tasks. They have also added moderation capabilities and many vendors have made these moderation system turned on by default on new blogs. Early this year, a number of search engines including Technorati adopted the rel="nofollow" microformat. This latest set of salvos have worked quite well in many cases, but there are thunderclouds on the horizon as research into defeating captcha systems has been effective, and my expectation is that this will continue to be an ongoing battleground in the future.
So what's being done about it?
The people who build spam and fake blogs think that they can get some kind of advantage - usually by getting additional search engine rankings or affiliate income by building these systems. In essence, they believe that there is an economics that spurs them on - and at Technorati, we've been working together with leading players to eliminate that economic incentive. We're working with the folks who run web advertising systems and at major affiliate programs to alert them of spammers as quickly as possible. We've been building real-time systems to identify spammers and fake blogs and sharing that information with other web search engines so that link farms and keyword stuffers see no increases in search rankings.
Now, that doesn't mean that some of these blogs won't slip through - it requires a lot of algorithms, deep thinking, and human intervention to build and monitor systems that deal with these problems. It is also an ongoing issue that needs time, care and attention as spammers come up with new and innovative ways to get game search engines and affiliate networks. It would be disingenuous of me to proclaim that the folks at Technorati have got it all solved. We don't. But we've been putting a lot of time and effort into building those systems, and we're going to continue to innovate as well.
Technorati doesn't index comments or trackback content or links, and we also support the nofollow tag (you'll note I used it when linking to the example spam and fake blogs above) to give greater control to bloggers who want to point to spam or fake blogs without implicitly endorsing the site.
We've also been working on a number of social methods to help filter through the blogosphere so that bloggers and readers can help to filter wheat from the chaff. Expect to see more from us on this in the coming months.
Web 2.0 Spam Squashing Summit
In February 2005, the first Web 2.0 Spam Squashing Summit was held in Silicon Valley. Key industry players such as AOL, Google, MSN, Six Apart and Yahoo were all in attendance at the standing room-only event, and it engendered a lot of industry cooperation and communication.
Working together with the same group of folks, the second Web Spam Squashing Summit will be held in the second half of September in Silicon Valley again. Final details are still being arranged, but representatives from Amazon, AOL, Ask Jeeves, Drupal, Google, MSN, Six Apart, Tucows, and Wordpress have all confirmed their plans to attend the event.
More to come, including an open invitation to others in the industry, in the next few weeks. Watch this space.
Coming next: Blogs and the Mainstream Media.
Today's post is going to cover some new ground - Tags. This is new ground because Technorati started tracking and displaying blog post tags in January 2005.
A brief introduction to tags:
Tags are a simply categories or topics. Most blog tools make it easy to categorize your posts, and working with the microformats community, Technorati implemented a simple way to track and aggregate blog posts, photos, and links that are all categorized, or "tagged" with the same name. Unlike rigid taxonomy schemes that many people dislike using, the ease of tagging for personal organization with social incentives leads to a rich and discoverable system, often called a folksonomy. Intelligence is provided by real people from the bottom-up to aid social discovery. And with the right tag search and navigation, folksonomy may outperform more structured approches to classification, as Clay Shirky points out:
This is something the ‘well-designed metadata’ crowd has never understood — just because it’s better to have well-designed metadata along one axis does not mean that it is better along all axes, and the axis of cost, in particular, will trump any other advantage as it grows larger. And the cost of tagging large systems rigorously is crippling, so fantasies of using controlled metadata in environments like Flickr are really fantasies of users suddenly deciding to become disciples of information architecture.
For those of you interested in a deeper explanation, you can get more information on tags and Technorati's tagging implementation, including how it works and browse the top 250 tags in roman languages and across all languages as well.
First a look at the total number of blog posts with tags. The pickup rate has been nothing short of remarkable, over 25 Million blog posts with categories or tags, as shown in the chart below:
I can honestly say that no one at Technorati was expecting an adoption rate of that magnitude.
The chart below shows the number of tagged blog posts that we indexed each day from January through July of 2005:
Almost a third of each day's blog postings use tags or categories - just over 300,000 posts each day at the end of July. What is also interesting is that people are also busily creating the "long tail" of tags on a daily basis as well. In other words, lots of people are creating new tags that are built for specific purposes, like for conferences or travelogues - some are using tags to help build communities around a topic. There are even spammers who tag (more on that tomorrow, grr). Some bloggers are using tags as a way to help organize information around an area or topic, as in the folksonomy example cited above, and event organizers are encouraging this by suggesting tags to them for use in their blog posts, photos, and on social bookmarking services like del.icio.us and furl. The chart below shows the number of brand new tags tracked each day. Note how it starts off with a big spike, as nearly 100,000 unique tags were tracked in the first week.
The numbers dipped somewhat as most common words were soon used as tags. However, growth in non-english languages, especially asian languages such as Chinese and Japanese has increased the average number of new tags seen each day to about 12,000 per day.
Of course, because the act of tagging is such a new thing, making predictions on where it will go in the future is anyone's guess. I believe that as long as the tagging system is set up to encourage accountability (e.g. link-based tags that are inside of a blog post) and discourage gaming, the folksonomy created will continue to provide useful in helping even non-bloggers to help view a more organized world.
Oh, and one more thing: Thanks to our the computer visualization whizzes at the School of Art at Carnegie Mellon University, we came up with a video that shows the growth of tags in the blogosphere. You can see the most popular tags tracked each day as time goes from January (when things were still on a workbench) to late June 2005, when Technorati had tracked a total of about 20 Million tagged posts.
This is the video that was shown at the AlwaysOn conference last month, and we've had numerous requests to put it up on the internet. Thanks to a very generous donation of storage and bandwith from our friends at Ourmedia.org and The Internet Archive who have put the video up on their servers. You can watch the 320x160 version or the full size video.
Please note, the full-size video is 61 Megabytes, and the smaller video is 12.2MB, so it may take a while to load on low bandwidth connections. The video is licensed under Creative Commons Attribution-NonCommercial license so go ahead and remix, mash, and have fun with it, we had a blast making it.
Small Version (12.2MB)
Large Version (20MB)
UPDATE: If anyone wants to set up a bittorrent for the video, please go ahead and let me know, and I'll post the torrent info here!
Tomorrow: More on Spam...
Technorati Tags: AlwaysOn, blogosphere, blogs, delicious, flickr, internetarchive, ourmedia, posts, scaling, search, search engine, sotb, sotb2005, statistics, stats, tags, tagvideo, technorati, technoratitag, video, weblog, weblogs, wow
Onwards and upwards! This is part 2 of the August 2005 State of the Blogosphere. Part 1 covers the overall growth of the blogosphere in terms of new blogs created. Today I'll discuss the number of posts made each day, also known as posting volume. Just to keep everyone updated on that set of statistics, here's what I wrote back in March, 2005:
To expand on my post yesterday on the overall growth of the number of weblogs, today I'm going to look at another important measure of the growth of the blogosphere, posting volume. A single post is a single entry to a weblog, whether it be a long essay or just a short entry, each is a post, and the posting volume is the aggregate number of posts per day. Just as it is important to note the increased growth in the number of weblogs out there, it is as or more important to see if blogging is a fad or if people are blogging at a sustained rate. The chart below shows that posting volume has been growing. (Compare with the chart from October 2004)
Here's that same chart updated with data through to the end of July 2005 (Compare with the chart from March 2005):
As you can see by the black trend line, posting volume has followed a strong upward trend. After a brief dip last winter, the average rate of postings has grown steadily such that at the end of July 2005, there were about 900,000 posts created each day. That's about 37,500 posts every hour, or 10.4 posts per second. It peaked at just over 1.1 Million posts per day after the Live 8 concerts and Justice Sandra Day O'Connor announced her resignation from the US Supreme Court.
In fact, the posting volume has more than doubled in the 7 months from the beginning of January 2005 to the end of July 2005. Partly this is due to the tremendous popularity of simple hosted blog solutions like MSN Spaces, AOL Journals, Blogger, and LiveJournal, and we've seen a lot of people take up blogging because of the growth of tools like post-from-IM, a feature available for AOL and MSN users, where they can post from their instant messaging clients. There's also been a significant jump in tools making it easy to post to weblogs, including Flickr, TextAmerica, Buzznet, del.icio.us, and others, so posting can be as easy as tagging an interesting link or snapping a photo on your cameraphone.
I'd like to point out as well that Technorati's median time from post to index has now dropped to under 5 minutes. That means that on average, public blog posts are indexed by Technorati in less than 5 minutes after they are created or modified, and are thus available in our search and tag results. This is also part of the recent performance and scaling work we've been doing.
I always find it interesting to look at the spikes in posting volume as well, and see what they can tell us by looking at the number of posts around the current events that caused a significant reaction in the blogosphere. I've listed a few of them on the chart above, including the US political conventions last summer, the Indian Ocean Tsunami, the Superbowl, Live8, and the London Bombings. Please note that the absolute number of posts is not indicative of importance of the event - remember that there are a lot more bloggers today than there were 6 months ago. However, it is very interesting to look at the percentage deviation from the norm that each spike represents - the bigger the relative spike, the more jarring the event was to the overall blogosphere.
On the larger chart you can also see the effect that weekends have on posting volume as well, generally causing a drop of 5-10% from weekday volume. Not shown on this chart is information in intraday posting volume: We see the largest number of posts each day between the hours of 7AM and noon Pacific time, meaning between 10AM and 3PM Eastern time in the USA.
More tomorrow - including the growth of tags.
Well, it is that time again! It has been almost 6 months since the last State of the Blogosphere, and so the team at Technorati and I have put together some high level information on what we've been tracking. Today I'll focus on the macro growth of the blogosphere, both in the number of bloggers out there, as well as in the growth of new blogs per day. You can compare the chart below to the charts from October 2004 and March 2005.
As of the end of July 2005, Technorati was tracking over 14.2 Million weblogs, and over 1.3 billion links. Interestingly, this is just about double the number of blogs that we were tracking 5 months ago. In March 2005 we were tracking 7.8 million blogs, which means the blogosphere has just about doubled again in the past 5 months, and that the blogosphere continues to double about every 5.5 months.
MSN Spaces, Blogger, LiveJournal, AOL Journals, as well as a number of international hosted services are growing quickly, and use of software like WordPress and Movable Type to provide blogs continue to grow significantly. There's a growing number of WordPress-based hosted services that are arising, including Laughing Squid, Dreamhost, and Blue Host, marking an interesting trend - that of ISPs and hosting providers using the GPL'ed software as a differentiating feature of their services. Moblogging sites like Textamerica and Buzznet have also been growing as well, as more people are blogging from their camera-enabled mobile phones. Growth has not only occurred in the US, but there has been a lot of blog growth in Japan, Korea, China, France, and Brazil, to name a few countries.
Here's a view of the number of new blogs created each day that Technorati is tracking, even after removing spam blogs (more on that later in the week) from our index:
You can see the charts from March 2005 and November 2004 to get an idea of how this is increasing, although all the data is included on the chart above. Technorati is now tracking about 80,000 new weblogs being created every day, which means a new weblog is created about every second. About 55% of all blogs are considered active - that is, 55% of all weblogs have had a posting in the last 3 months. In addition, 13% of all weblogs (currently 1.8 Million blogs) update at least weekly.
Interestingly, the activity statistics have remained remarkably consistent over time - In November 2004, we reported that 55% of all blogs were active, which is just about the same number as are active today. I think that this shows that even as the blogosphere is growing at a geometric pace, the "stickiness" of the tools and the willingness to write hasn't changed much at all.
Tomorrow I'll give an update on posting volume, which is a better statistic to track the growth of blogging. Lots of people who start new blogs are kicking tires and thus the numbers displayed above could be indicative of a fad in progress - but watching the posting volume shows how many people are actually blogging on a day-by-day basis. I think that is a much better indicator that people are making blogging a habit and a part of their daily lives. Later in the week I'll also describe the rise of tags, the increase in spam (or fake) blogs and SEO, and give an update on the relative influence of blogs compared to the mainstream media.