Word Bursts and Trend Spotting
1Math geekout time:
An interesting article in The New Scientist, talking about how tracking changes in word frequency can be indicative of emerging trends. For those of you with a mathematical bent, this is a rough approximate of what LSI (Latent Semantic Indexing) of a set of documents over time does as well. LSI allows you to "reduce the dimensionality" of the word frequency lists by taking advantage of the fact that some words and phrases are synonyms, or are in a variety of ways related to each other.
The big problem with LSI over large data sets (like the web) is that the calculations required to perform it (SVD) are difficult to solve numerically as the document sets get larger.
The "word burst" idea gets around all of that because it just follows individual word or phrase frequency trends. It’s an interesting idea, something that would be cool to implement… But not for now. Right now, other tasks have higher priorities. Could be a fun weekend project, though…
Related posts:





In my job at Reuters, I have spent time experimenting with Natural Language Processing as a way of making more sense out of unstructured news stories. For the most part, these attempts have been unsuccessful given the professionally-focussed financial services customers who don’t want to leave their news filtering up to nondeterministic software.
That said, I do think the blog community would be much more receptive, and early-adopters, of such an approach to categorize blogs and filter based on criterea like sentiment or “buzz”.
I’ve seen tools like Netmood (http://www.pbump.com/netmood/) that let people rate a blog in realtime, thus enabling others to filter a large list of subscribed blogs in a news aggregator based on whether the community finds them interesting at the moment. This kind of 2-stage filtering would also benefit from “word burst” filtering…. And, I do agree it would be a cool thing to add to Technorati!!