May 1, 2006

State of the Blogosphere, April 2006 Part 2: On Language and Tagging

Late last month, I gave a high-level overview of the growth of the blogosphere, covering the overall size of the data sets that Technorati tracks, the number of new blogs created each day, the number of posts per day, and the issue of splogs or spam blogs.

To recap, here's the highlights of Part 1:

  • Technorati now tracks over 35.3 37.3 Million blogs
  • The blogosphere is doubling in size every 6 months
  • It is now over 60 times bigger than it was 3 years ago
  • On average, a new weblog is created every second of every day
  • 19.4 million bloggers (55%) are still posting 3 months after their blogs are created
  • Technorati tracks about 1.2 Million new blog posts each day, about 50,000 per hour


Strong International Growth

Back in April 2005, Technorati started automatically tracking the primary language of each blog that we tracked. We did this so that we could easily allow people to filter out posts in languages other than their native language. This is available in a pull-down menu on every search results page. We also wanted to get some idea of where the worldwide growth of blogging was taking place, and what trends we could glean from the data.

There are three very important caveats in the data sets that I'm going to describe below. The first is that we are using automated language analysis software (based on languid), and it may have bugs, thus over or undercounting a particular language or group of languages. We're going to be continually improving the capabilities of this software, but we are pretty confident in its ability to work reliably, especially over the large data sets that Technorati tracks (over 35 million blogs at this time, and over 1.2 million posts each day). Second, we believe that we are grossly undercounting the Korean blogosphere, mostly due to the fact that the largest Korean blog and hompy services (like Cyworld or Planet Weblog) are not being indexed by Technorati at this time. In addition, we believe that we're somewhat undercounting the French blogosphere, in particular because our indexing of skyblog is poor. We'd love to rectify this - if anyone at these (or other) blogging services is interested in being indexed, please drop me a line. Last, Japanese bloggers appear to write shorter posts more often. This could be a result of blogging from mobile phones, and may be skewing the results, given that we are tracking the total number of posts in this analysis.

Another key point to remember is that language breakdown does not necessarily imply a particular country or regional breakdown. For example, Spanish and English are spoken in a large number of countries around the globe - and this analysis doesn't attempt to determine from which country a blogger is writing from - only the primary language of her post.

The following charts show the relative volume of blog posts based on the primary language of the post, on a month by month basis:

Slide0014

Here's a more detailed breakdown of the last 6 months of data:

Slide0009-1

Slide0011

Slide0013

Something that may come as a surprise (at least to the English-speaking world) is that English isn't the biggest language of the blogosphere. In fact, English isn't even the primary language of one third of all posts that Technorati tracks anymore. Another interesting finding is that the Chinese blogosphere, which grew significantly in 2004 and 2005 (launches of MSN Spaces in Chinese, Bokee.com saw a peak of 25% of all posts in Chinese in November 2005) seems to be slowing down somewhat this year.

One of the further topics for research is to investigate the language breakdown of posting activity based on blog hosting site or software type. My hypothesis is that various language communities have often grown on one service or another, often for viral or historical reasons, showing a disproportionate language breakdown for that service. For example, livejournal.com hosts a large number of Russian language journals/blogs, and MSN Spaces hosts an overrepresentation of Chinese language blogs.

Tags and Categories

Tagging, the act of categorizing posts with simple words or phrases, continues to grow, and the number of posts with tags or categories has grown past the 100 Million mark since Technorati began tracking tags in January of 2005.

Nearly half (47%) of all blog posts have an author-generated category or set of tags associated with the post. For this analysis, Technorati excluded generic or default categories, like "General" or "Diary", which some services put into each post if the author doesn't specify a particular tag or category. We only counted posts that used a non-default tag or category.

Slide0015-1

Many bloggers use this tagging capability to help get their content found by people who are searching for a particular topic, even if that topic isn't listed as a keyword in the post. Of course, one of the remaining open questions is whether or not that will lead to massive gaming of the system, but current trends seem to present evidence that large-scale gaming is not occurring. In fact, my belief is that because tags are built as hyperlinks inside the document, and thus visible to the reader, that a strong social pressure to use appropriate tags (or at least to not use inappropriate tags) manifests itself, especially with bloggers who want to cultivate influence and readers.

Clarification: I had a number of questions from people to clarify the tagging statistics. 47% of daily blog posts that Technorati tracks (about 560,000 posts out of the 1.2 Million postings per day) have one or more tag or category associated with the post. Obviously that number fluctuates somewhat given the day and the number of postings tracked that day. Hope that clears things up!

In Summary:

  • The blogosphere is multilingual, and deeply international
  • English, while being the language of the majority of early bloggers, has fallen to less than a third of all blog posts in April 2006.
  • Japanese and Chinese language blogging has grown significantly.
  • Chinese language blogging, while continuing to grow on an absolute basis, has begun to decline as an overall percentage of the posts that Technorati tracks over the last 6 months
  • Japanese, Chinese, English, Spanish, Italian, Russian, French, Portuguese, Dutch, and German are the languages with the greatest number of posts tracked by Technorati.
  • The Korean language is underrepresented in this analysis
  • Language breakdown does not necessarily imply a particular country or regional breakdown.
  • Technorati now tracks more than 100 Million author-created tags and categories on blog posts.
  • The rel-tag microformat has been adopted by a number of the large tool makers, making it easy for people to tag their posts. About 47% of all blog posts have non-default tags or categories associated with them.

Technorati Tags: , , , , , , , , , , , , , , , , , , , , ,