Late last month, I gave a high-level overview of the growth of the blogosphere, covering the overall size of the data sets that Technorati tracks, the number of new blogs created each day, the number of posts per day, and the issue of splogs or spam blogs.
To recap, here's the highlights of Part 1:
Strong International Growth
Back in April 2005, Technorati started automatically tracking the primary language of each blog that we tracked. We did this so that we could easily allow people to filter out posts in languages other than their native language. This is available in a pull-down menu on every search results page. We also wanted to get some idea of where the worldwide growth of blogging was taking place, and what trends we could glean from the data.
There are three very important caveats in the data sets that I'm going to describe below. The first is that we are using automated language analysis software (based on languid), and it may have bugs, thus over or undercounting a particular language or group of languages. We're going to be continually improving the capabilities of this software, but we are pretty confident in its ability to work reliably, especially over the large data sets that Technorati tracks (over 35 million blogs at this time, and over 1.2 million posts each day). Second, we believe that we are grossly undercounting the Korean blogosphere, mostly due to the fact that the largest Korean blog and hompy services (like Cyworld or Planet Weblog) are not being indexed by Technorati at this time. In addition, we believe that we're somewhat undercounting the French blogosphere, in particular because our indexing of skyblog is poor. We'd love to rectify this - if anyone at these (or other) blogging services is interested in being indexed, please drop me a line. Last, Japanese bloggers appear to write shorter posts more often. This could be a result of blogging from mobile phones, and may be skewing the results, given that we are tracking the total number of posts in this analysis.
Another key point to remember is that language breakdown does not necessarily imply a particular country or regional breakdown. For example, Spanish and English are spoken in a large number of countries around the globe - and this analysis doesn't attempt to determine from which country a blogger is writing from - only the primary language of her post.
The following charts show the relative volume of blog posts based on the primary language of the post, on a month by month basis:
Here's a more detailed breakdown of the last 6 months of data:
Something that may come as a surprise (at least to the English-speaking world) is that English isn't the biggest language of the blogosphere. In fact, English isn't even the primary language of one third of all posts that Technorati tracks anymore. Another interesting finding is that the Chinese blogosphere, which grew significantly in 2004 and 2005 (launches of MSN Spaces in Chinese, Bokee.com saw a peak of 25% of all posts in Chinese in November 2005) seems to be slowing down somewhat this year.
One of the further topics for research is to investigate the language breakdown of posting activity based on blog hosting site or software type. My hypothesis is that various language communities have often grown on one service or another, often for viral or historical reasons, showing a disproportionate language breakdown for that service. For example, livejournal.com hosts a large number of Russian language journals/blogs, and MSN Spaces hosts an overrepresentation of Chinese language blogs.
Tags and Categories
Tagging, the act of categorizing posts with simple words or phrases, continues to grow, and the number of posts with tags or categories has grown past the 100 Million mark since Technorati began tracking tags in January of 2005.
Nearly half (47%) of all blog posts have an author-generated category or set of tags associated with the post. For this analysis, Technorati excluded generic or default categories, like "General" or "Diary", which some services put into each post if the author doesn't specify a particular tag or category. We only counted posts that used a non-default tag or category.
Many bloggers use this tagging capability to help get their content found by people who are searching for a particular topic, even if that topic isn't listed as a keyword in the post. Of course, one of the remaining open questions is whether or not that will lead to massive gaming of the system, but current trends seem to present evidence that large-scale gaming is not occurring. In fact, my belief is that because tags are built as hyperlinks inside the document, and thus visible to the reader, that a strong social pressure to use appropriate tags (or at least to not use inappropriate tags) manifests itself, especially with bloggers who want to cultivate influence and readers.
Clarification: I had a number of questions from people to clarify the tagging statistics. 47% of daily blog posts that Technorati tracks (about 560,000 posts out of the 1.2 Million postings per day) have one or more tag or category associated with the post. Obviously that number fluctuates somewhat given the day and the number of postings tracked that day. Hope that clears things up!
In Summary:
Technorati Tags: blogging, blogosphere, blogs, blogsearch, charts, international, language, microformat, microformats, posts, search, search engine, sotb, sotb2006, statistics, stats, study, tags, technorati, technoratitag, weblog, weblogs
Posted by dsifry at May 1, 2006 3:17 AM | TrackBack | View blog reactionsThanks again for your reports and what a surprise to discover that English is not the major Lingua Franca of the blogosphere. Cheers.
I'd be very interesting to see a more detailed breakdown of tag use. Are there 100M posts with tags, or 100M tags on all posts? The graph indicates one, the text the other. Also, it'd be fascinating to know how many distinct tags are in use. If the number of tags exceeds the number of words in the languages involved, we're recreating the "tower of babel". If this is true, then the sole hope for sense to be made of this or to emerge is for tools supporting folksonomies and the exploration of the blogosphere or bigger, the "annotated web," according to one's social speheres of trust.
We're building activeweave's stickis precisely to address this.
Posted by: Marc A. Meyer at May 1, 2006 9:07 AMFascinating. Many thanks.
Posted by: Bleepless at May 1, 2006 10:59 AMI wonder if the Chinese slowdown in blogging is related to the increasingly overt censorship of the Internet by Chinese authorities. Blogging is best when it has candor and passion--you can't do that if you have to second guess yourself because you know you are being watched.
I would say that not being able to self-express could be bad for a society (and economy?).
I wonder if the Chinese slowdown in blogging is related to the increasingly overt censorship of the Internet by Chinese authorities. Blogging is best when it has candor and passion--you can't do that if you have to second guess yourself because you know you are being watched.
I would say that not being able to self-express could be bad for a society (and economy?).
Can your software distinguish human blogs and spam blogs?
Posted by: Jim Anderson at May 1, 2006 3:55 PMvery very interesting. except, it's called "blogtopia," and yes! i coined that phrase!
Posted by: skippy at May 1, 2006 8:21 PMGreat compilation of data. I'm interested in seeing what happens in the UK. Since, as the Guardian posted (http://business.guardian.co.uk/comment/story/0,,1763760,00.html), there is a information discrepancy as to how much UK users are blogging and blog-reading. It's always hard, but it'd be great to have information geographically divided as to who is writing English-language blogs.
A big thank you! Amazing info. Didn't have a clue about the extent of Japanese/Chinese blogging. I appreciate your compilation of this data.
Posted by: Beth* A. at May 2, 2006 11:08 AMI think this may be related to the difference between blogging and business. For business, a common worldwide language has been highly advantageous to participants in the global economy. I do remember articles in the Economist about English becoming the global language of business.
Then, too, there is the fact that the keywords of computer languages and pretty much all documentation of computer languages are in English. I've wondered if there might be a market for computer books written in Indian or Chinese, but none of the major software book vendors seem to wonder about such things, so that market must not exist (or it hasn't to date).
Blogging is an entirely different realm. You are not speaking to the entire world. Rather, you are speaking to a collection of individuals who share some similarity with you, shared experience, or a common interest. These individuals are in essence "plucked" out of the ocean of all Internet actors. Bloggers and subscribers find each other through tags and searches.
There is no real need for bloggers to adopt a common language. For business, I don't think this has changed, not yet anyway. But perhaps the networks and mini-communities that are being formed by blogging will become a breeding ground for a new kind of entrepreneurship that works with others who speak the same language (I mean "languages" in a broader aspect here, languages of various realms, including the "language" of common interests), engendering new economies that are global in reach, but unlike today's huge multinational corporations in scale.
These new network-centric businesses would operate within their own global, but bounded, niche. In such cases, the business transactions might indeed occur in whatever language was favored by the community.
Posted by: Kevin Farnham at May 2, 2006 2:28 PMThis is a very interesting set of stats. Another way of looking at the blogosphere would be the degree of global connectivity. Many of the Japanese, Chinese and small language blogs might be less likely to be linked to, translated or referenced by blogs of other languages than would Spanish, English and Arabic blogs since they would have a larger number of countries where people are bilingual in that language.
At any rate, once there is really powerful, really accurate translation software available and people can read and respond to the ideas and thoughts of people in just about any language the real fun will begin. I forecast 3 years of global flame wars followed by 2 years of global sulking ending in a state of general grumbling tolerance.
Posted by: Chuck the Lucky at May 2, 2006 2:33 PMDavid,
Would'nt an analysis of content length give a more accurate perspective? Since the Cellphones are probably used for more entries with less content in some languages. Of course weighting the amount of chars for each language would give another variation. Just an idea or two... Thanx for your enlightment asis anyhow.
Posted by: HullaBaloo at May 2, 2006 3:27 PM55% of people maintain their blogs after 3 months. Sadly, I find that statistic still low. Perhaps new bloggers get disheartened 3 months later when very little visitors have dropped by to leave a comment. I encourage everybody to leave more comments and spur on one another to maintain a healthy blogosphere.
Posted by: Techie at May 2, 2006 9:55 PMI'm interested in your classification of tagging as it relates to default tags/categories.
How did you arrive at a list of default tags? You mention 2 default categories, but not the one WordPress uses, for instance. Which others did you use?
I'm also assuming that once a blogger has started to categorize his/her/its posts you start to count them, even if they may simply have changed the default catgeory and might still be defaulting, but just to a category that's not the tool's out-of-the-box default. Hard to deal with that kind of defaulting, but it would be interesting to know what ratio of that 47% were being *actively* tagged.
Posted by: cori at May 3, 2006 4:03 AMI have written an entry about the language issue in my blog with the title, "Not many blogs in South Asian languages". I have identified 5 reasons behind the lack of posts from South Asian languages although Bengali and Hindi are among the top 7 languages in the world.
" Google Adsense does not support any of the South Asian languages and that is why those who want to earn money by blogging blog in English."
I think this is the most important obstacle for bloggers in South Asia.
Wow! 1 percent for the most spoken language in Europe.
Posted by: marketing-blog.biz at May 3, 2006 1:47 PMInformacion sobre BananoBananoBananoBananoBananoBananoBananoBananoBananoBanano
Posted by: wilmer at May 4, 2006 1:36 PMDavid,
thanks a lot for the stats.
I was wondering: Can you see from your stats how many weblogs are actually active? For example how many weblogs have been posted to within the last 1/2/3 months?
A lot of the "forgotten" weblogs probably were sort of test runs, right? So this figure doesn't tell about the level of activity...
Cheers!
Posted by: Peter Bihr at May 5, 2006 10:54 AMPerfect Hype ! Good Luck for further growth!
Posted by: Xenoar at May 5, 2006 3:37 PMgxrugtcseq
Posted by: klonopin at May 5, 2006 10:19 PMgxrugtcseq
Posted by: klonopin at May 5, 2006 10:19 PMcbscsilsbwi
Posted by: lamictal at May 5, 2006 11:06 PMcbscsilsbwi
Posted by: lamictal at May 5, 2006 11:06 PMcbscsilsbwi
Posted by: lamictal at May 5, 2006 11:06 PMInvestors this week will be hoping the Consumer Price index (CPI) shows a similar lack of inflationary growth, muravey com as that would bode well for bets that the Federal Reserve will end its 18-month rate hiking campaign soon
Posted by: arrowsplace me uk at May 5, 2006 11:49 PM| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | ||||
| 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| 25 | 26 | 27 | 28 | 29 | 30 | 31 |