State of the Blogosphere, April 2006 Part 2: On Language and Tagging
40Late last month, I gave a high-level overview of the growth of the blogosphere, covering the overall size of the data sets that Technorati tracks, the number of new blogs created each day, the number of posts per day, and the issue of splogs or spam blogs.
To recap, here’s the highlights of Part 1:
- Technorati now tracks over
35.337.3 Million blogs - The blogosphere is doubling in size every 6 months
- It is now over 60 times bigger than it was 3 years ago
- On average, a new weblog is created every second of every day
- 19.4 million bloggers (55%) are still posting 3 months after their blogs are created
- Technorati tracks about 1.2 Million new blog posts each day, about 50,000 per hour
Strong International Growth
Back in April 2005, Technorati started automatically tracking the primary language of each blog that we tracked. We did this so that we could easily allow people to filter out posts in languages other than their native language. This is available in a pull-down menu on every search results page. We also wanted to get some idea of where the worldwide growth of blogging was taking place, and what trends we could glean from the data.
There are three very important caveats in the data sets that I’m going to describe below. The first is that we are using automated language analysis software (based on languid), and it may have bugs, thus over or undercounting a particular language or group of languages. We’re going to be continually improving the capabilities of this software, but we are pretty confident in its ability to work reliably, especially over the large data sets that Technorati tracks (over 35 million blogs at this time, and over 1.2 million posts each day). Second, we believe that we are grossly undercounting the Korean blogosphere, mostly due to the fact that the largest Korean blog and hompy services (like Cyworld or Planet Weblog) are not being indexed by Technorati at this time. In addition, we believe that we’re somewhat undercounting the French blogosphere, in particular because our indexing of skyblog is poor. We’d love to rectify this – if anyone at these (or other) blogging services is interested in being indexed, please drop me a line. Last, Japanese bloggers appear to write shorter posts more often. This could be a result of blogging from mobile phones, and may be skewing the results, given that we are tracking the total number of posts in this analysis.
Another key point to remember is that language breakdown does not necessarily imply a particular country or regional breakdown. For example, Spanish and English are spoken in a large number of countries around the globe – and this analysis doesn’t attempt to determine from which country a blogger is writing from – only the primary language of her post.
The following charts show the relative volume of blog posts based on the primary language of the post, on a month by month basis:
Here’s a more detailed breakdown of the last 6 months of data:
Something that may come as a surprise (at least to the English-speaking world) is that English isn’t the biggest language of the blogosphere. In fact, English isn’t even the primary language of one third of all posts that Technorati tracks anymore. Another interesting finding is that the Chinese blogosphere, which grew significantly in 2004 and 2005 (launches of MSN Spaces in Chinese, Bokee.com saw a peak of 25% of all posts in Chinese in November 2005) seems to be slowing down somewhat this year.
One of the further topics for research is to investigate the language breakdown of posting activity based on blog hosting site or software type. My hypothesis is that various language communities have often grown on one service or another, often for viral or historical reasons, showing a disproportionate language breakdown for that service. For example, livejournal.com hosts a large number of Russian language journals/blogs, and MSN Spaces hosts an overrepresentation of Chinese language blogs.
Tags and Categories
Tagging, the act of categorizing posts with simple words or phrases, continues to grow, and the number of posts with tags or categories has grown past the 100 Million mark since Technorati began tracking tags in January of 2005.
Nearly half (47%) of all blog posts have an author-generated category or set of tags associated with the post. For this analysis, Technorati excluded generic or default categories, like “General” or “Diary”, which some services put into each post if the author doesn’t specify a particular tag or category. We only counted posts that used a non-default tag or category.
Many bloggers use this tagging capability to help get their content found by people who are searching for a particular topic, even if that topic isn’t listed as a keyword in the post. Of course, one of the remaining open questions is whether or not that will lead to massive gaming of the system, but current trends seem to present evidence that large-scale gaming is not occurring. In fact, my belief is that because tags are built as hyperlinks inside the document, and thus visible to the reader, that a strong social pressure to use appropriate tags (or at least to not use inappropriate tags) manifests itself, especially with bloggers who want to cultivate influence and readers.
Clarification: I had a number of questions from people to clarify the tagging statistics. 47% of daily blog posts that Technorati tracks (about 560,000 posts out of the 1.2 Million postings per day) have one or more tag or category associated with the post. Obviously that number fluctuates somewhat given the day and the number of postings tracked that day. Hope that clears things up!
In Summary:
- The blogosphere is multilingual, and deeply international
- English, while being the language of the majority of early bloggers, has fallen to less than a third of all blog posts in April 2006.
- Japanese and Chinese language blogging has grown significantly.
- Chinese language blogging, while continuing to grow on an absolute basis, has begun to decline as an overall percentage of the posts that Technorati tracks over the last 6 months
- Japanese, Chinese, English, Spanish, Italian, Russian, French, Portuguese, Dutch, and German are the languages with the greatest number of posts tracked by Technorati.
- The Korean language is underrepresented in this analysis
- Language breakdown does not necessarily imply a particular country or regional breakdown.
- Technorati now tracks more than 100 Million author-created tags and categories on blog posts.
- The rel-tag microformat has been adopted by a number of the large tool makers, making it easy for people to tag their posts. About 47% of all blog posts have non-default tags or categories associated with them.
Technorati Tags: blogging, blogosphere, blogs, blogsearch, charts, international, language, microformat, microformats, posts, search, search engine, sotb, sotb2006, statistics, stats, study, tags, technorati, technoratitag, weblog, weblogs
Related posts:










Thanks again for your reports and what a surprise to discover that English is not the major Lingua Franca of the blogosphere. Cheers.
I’d be very interesting to see a more detailed breakdown of tag use. Are there 100M posts with tags, or 100M tags on all posts? The graph indicates one, the text the other. Also, it’d be fascinating to know how many distinct tags are in use. If the number of tags exceeds the number of words in the languages involved, we’re recreating the “tower of babel”. If this is true, then the sole hope for sense to be made of this or to emerge is for tools supporting folksonomies and the exploration of the blogosphere or bigger, the “annotated web,” according to one’s social speheres of trust.
We’re building activeweave’s stickis precisely to address this.
State of the blogosphere – Japanese top, Chinese on level with English
David Sifry, the boss of blog search engine Technorati, has released his monthly stack of info on the…
Song of the ‘Sphere: Tagging, Japanese
I really think so… or at least, we were alerted by David Sifry that:
English isn’t the biggest language of the blogosphere. In fact, English isn’t even the primary language of one third of all posts that Technorati tracks anymore.
In terms of posts…
Fascinating. Many thanks.
47% of all blog posts are tagged
David Sifry provides more Technorati stats, this time with a little more info on the tagging front….
I wonder if the Chinese slowdown in blogging is related to the increasingly overt censorship of the Internet by Chinese authorities. Blogging is best when it has candor and passion–you can’t do that if you have to second guess yourself because you know you are being watched.
I would say that not being able to self-express could be bad for a society (and economy?).
I wonder if the Chinese slowdown in blogging is related to the increasingly overt censorship of the Internet by Chinese authorities. Blogging is best when it has candor and passion–you can’t do that if you have to second guess yourself because you know you are being watched.
I would say that not being able to self-express could be bad for a society (and economy?).
Can your software distinguish human blogs and spam blogs?
very very interesting. except, it’s called “blogtopia,” and yes! i coined that phrase!
Site News
My favorite foreign guy is in town visiting his girlfriend. He is a good guy and I am hoping his girlfriend stays nice so I can attend a wedding someday. The visit requires some drinking and barhopping, so expect blogging t …
One Internet or Many?
One theme in the book is that an evolving balkanization of the internet is often driven by consumer preference. A good example is the suprising decline in the use of the English language on the Web. A quote from Ch. 3 The Economist confidently stated i…
One Internet or Many?
One theme in the book is that an evolving balkanization of the internet is often driven by consumer preference. A good example is the suprising decline in the use of the English language on the Web. From Ch. 3 The Economist confidently stated in in 199…
A big thank you! Amazing info. Didn’t have a clue about the extent of Japanese/Chinese blogging. I appreciate your compilation of this data.
I think this may be related to the difference between blogging and business. For business, a common worldwide language has been highly advantageous to participants in the global economy. I do remember articles in the Economist about English becoming the global language of business.
Then, too, there is the fact that the keywords of computer languages and pretty much all documentation of computer languages are in English. I’ve wondered if there might be a market for computer books written in Indian or Chinese, but none of the major software book vendors seem to wonder about such things, so that market must not exist (or it hasn’t to date).
Blogging is an entirely different realm. You are not speaking to the entire world. Rather, you are speaking to a collection of individuals who share some similarity with you, shared experience, or a common interest. These individuals are in essence “plucked” out of the ocean of all Internet actors. Bloggers and subscribers find each other through tags and searches.
There is no real need for bloggers to adopt a common language. For business, I don’t think this has changed, not yet anyway. But perhaps the networks and mini-communities that are being formed by blogging will become a breeding ground for a new kind of entrepreneurship that works with others who speak the same language (I mean “languages” in a broader aspect here, languages of various realms, including the “language” of common interests), engendering new economies that are global in reach, but unlike today’s huge multinational corporations in scale.
These new network-centric businesses would operate within their own global, but bounded, niche. In such cases, the business transactions might indeed occur in whatever language was favored by the community.
This is a very interesting set of stats. Another way of looking at the blogosphere would be the degree of global connectivity. Many of the Japanese, Chinese and small language blogs might be less likely to be linked to, translated or referenced by blogs of other languages than would Spanish, English and Arabic blogs since they would have a larger number of countries where people are bilingual in that language.
At any rate, once there is really powerful, really accurate translation software available and people can read and respond to the ideas and thoughts of people in just about any language the real fun will begin. I forecast 3 years of global flame wars followed by 2 years of global sulking ending in a state of general grumbling tolerance.
David,
Would’nt an analysis of content length give a more accurate perspective? Since the Cellphones are probably used for more entries with less content in some languages. Of course weighting the amount of chars for each language would give another variation. Just an idea or two… Thanx for your enlightment asis anyhow.
blog界における日本語のシェア
日本語のblogってすごく多いのです。というエントリー
Sifry’s Alerts: State of the Blogosphere, April 2006 Part 2: On Language and TaggingJapanese and Chinese language blogging has grown significantly.
Technoratiのindexの1/3…
blog in Japanese, so many.
From /.Japan- State of the Blogosphere, April 2006 Part 2: On Language and Tagging…
55% of people maintain their blogs after 3 months. Sadly, I find that statistic still low. Perhaps new bloggers get disheartened 3 months later when very little visitors have dropped by to leave a comment. I encourage everybody to leave more comments and spur on one another to maintain a healthy blogosphere.
I’m interested in your classification of tagging as it relates to default tags/categories.
How did you arrive at a list of default tags? You mention 2 default categories, but not the one WordPress uses, for instance. Which others did you use?
I’m also assuming that once a blogger has started to categorize his/her/its posts you start to count them, even if they may simply have changed the default catgeory and might still be defaulting, but just to a category that’s not the tool’s out-of-the-box default. Hard to deal with that kind of defaulting, but it would be interesting to know what ratio of that 47% were being *actively* tagged.
Wow! 1 percent for the most spoken language in Europe.
Informacion sobre BananoBananoBananoBananoBananoBananoBananoBananoBananoBanano
David,
thanks a lot for the stats.
I was wondering: Can you see from your stats how many weblogs are actually active? For example how many weblogs have been posted to within the last 1/2/3 months?
A lot of the “forgotten” weblogs probably were sort of test runs, right? So this figure doesn’t tell about the level of activity…
Cheers!
Perfect Hype ! Good Luck for further growth!
Investors this week will be hoping the Consumer Price index (CPI) shows a similar lack of inflationary growth, muravey com as that would bode well for bets that the Federal Reserve will end its 18-month rate hiking campaign soon
El Español es el cuarto idioma más usado en el mundo de los blogs
Según las últimas estadísticas publicadas por David Sifry de Technorati, los cuatro idiomas más utilizados en el mundo de los blogs son: Japonés (37%), Inglés (31%), Chino (15%) y Español (3%). Parte de lo que se ve en el estudio de tendencias e…
Singapore throwing a chance away?
Smart Mobs has an article today about China’s under-15s, and their rise to account for 30% of China’s on-line activity – and China is itself the second-biggest internet population (the story is taken from a piece in Shanghai Daily. A Chinese-language blo
Singapore throwing a chance away?
Smart Mobs has an article today about China’s under-15s, and their rise to account for 30% of China’s on-line activity – and China is itself the second-biggest internet population (the story is taken from a piece in Shanghai Daily. A Chinese-language blo
Language and Tagging in the Blogosphere; Dave’s Newest Analysis
Technorati founder has published the second of his state of the blogosphere posts 2. Dave has concentrated on geographical blogging activity and languages. The quick summary is English, while being the language of the majority of early bloggers, has fa…
大塚愛 ヤリマン大塚愛の盗撮・流出画像
大塚愛盗撮流出ヤリマン大塚愛の盗撮・流出画像
磯山さやか 磯山さやか盗撮キター!流出画像はこちら⇒
磯山さやか盗撮流出磯山さやか盗撮キター!流出画像はこちら⇒
Multilingual MySpace
Via TechCrunch comes the news that social networking site MySpace is to be launched in non-English…
Multilingual MySpace
Via TechCrunch comes the news that social networking site MySpace is to be launched in non-English formats,…
Dell Customer Advocates in the Blogosphere
Before we established a presence in the blogosphere, we had been reading your thoughts on Dell. Some…
English language only 39% of international blogs
The latest Technorati State of the Blogosphere report shows that 39% of all blog postings are in english. Japanese is the second-most popular language, at 31% and China third with 12%. Here’s Technorati’s pie graph for June ’06: Dave Sifry…
English language only 39% of all blogs
The latest Technorati State of the Blogosphere report shows that 39% of all blog postings are in english. Japanese is the second-most popular language, at 31% and China third with 12%. Here’s Technorati’s pie graph for June ’06: Dave Sifry…
한국에는 블로그가 없다?
본 글은 Channy 님의 Who are korean bloggers? 과 Sifry 의 State of the Blogosphere, April 2006 Part 2: On Language and Tagging에 대한 트랙백으로 쓰여진 글입니다. 음.. #3 에 속하는 저와 많은 tistory 가족들과 TatterTools…
Satellite TV
Get a free Satellite TV system today
forex
learn to trade forex online