April 20, 2003

Easy News Topics

Last week, Paolo Valdermarin and Matt Mower released their specification of Easy News Topics 1.0 (ENT), which is designed as an RSS 2.0 module that can add topic and categorization information to an RSS feed.  I committed to get back to them (and others) with a review and some commentary on the approach.

The good news: As a format, ENT is easy to understand, easy for application developers to implement, and pretty easy to parse.  Kudos to Matt and Paolo for coming up with a design that is simple but extensible. 

Now the bad news:  I'm worried about two issues.  First is the problem of self-categorization.  ENT presupposes that authors can successfully create microcontent with the following properties:
  1. It can be placed in one or more categories
  2. the author is qualified to categorize the content correctly
  3. the author's categories have meaning to the reader
In addition, we then run into a larger problem with self-categorization, which is the question of categorization across feeds.  In other words, we have a problem of definitions - one person's rebel is another person's revolutionary.  Even with ENT's inclusion of clouds, which are (potentially) external topic maps that create self-consistent maps of the world, we still have the problem of intentional or unintentional misunderstanding and misreading of metadata like categories, which leads me to think that the entire concept of self-categorization is extremely difficult to work on a large scale.

A good example of this failure to scale is the history of web page metadata tags, especially the keyword tag.  At first, people acted in a trustworthy manner, and put quality information in META tags in HTML documents.  This was ostensibly useful so that the aggregators of that time, search engines, could more effectively sort and categorize data, and more accurately weight the document during search queries.  But soon, bad faith actors entered the picture, and attempted to influence search results by putting false or misleading metadata in their META tags.  As a result, META keyword tag information (self-categorization) has almost completely fallen from the ranking algorithms in modern search engines. 

However, search engines still correctly categorize data, to a large extent.  Newer aggregate algorithms like pagerank and others can effectively categorize larger documents, often by inbound links or by vocabulary similarity. Two documents that share a similar vocabulary often are similar, and are often in the same categories.  So perhaps the answer is to allow for the aggregate collective human filters across the web (aka readers) help to create truly accurate categories for microcontent.  Of course, that collaborative filtering takes time, and thus negates some of the power of blogs, their conversational qualities.  If you have to wait 12 hours to find out what 20 other readers thought of some piece of content on the web, would that be worth it to you?  I would take that tradeoff.  But what if that timeframe was one week?  Or one month?  You'd end up missing the conversation itself while you were waiting for the collaborative filtering process to tell you that you should participate in the conversation. 

So, I think we're in a bit of a knot.  In the end, this won't really be solved until we can place a certain value on a particular's reputation - does blogger A tend to self-categorize in a way that I find interesting and blogger B does not?  In the end, I think there is no free lunch, and a file format standard like ENT won't remove the problems of bad-faith actors or self-categorization problems.  But, it is an easy to parse, easy to implement standard that will allow us to further explore these fundamentally social questions.
Posted by dsifry at April 20, 2003 11:48 PM | TrackBack | View blog reactions