Recently, Tristan Louis wrote two very interesting posts on comparisons between Google, Yahoo, and Technorati's results on searching blogs and counting references to the Technorati Top 100 bloggers as well as the long tail of 11.5 Million bloggers out there. If you haven't read the posts, please go over and have a look at both of them - there's lots of very interesting data and analysis in there, and I think there are some very interesting conclusions that Tristan draws from the data.
However, I believe that Tristan's analysis begs a question that hasn't been asked yet: How accurate are the numbers that search engines report about the size of their result sets?
We give a lot of faith to the numbers that search engines report, when trying to guess how popular something is. Google reports today that there are "about 624,000" results for "long tail". Yahoo reports "about 779,000" results. People quote these numbers as accurate statistics, and Tristan is using these numbers to do some comparative analysis of the coverage of Google, Yahoo, and Technorati's indexes. However, I'm having difficulty ascertaining the accuracy of these numbers. I've listed some examples below, and a simple how-to so that you can check yourself for your favorite searches.
My questions with Tristan's conclusions are not with his analytics, but with the underlying data that he starts with.
For example, when you search for all the results for "Tristan Louis" on Google, it reports "about 575,000". However you can only navigate through 703 results of the entire set. Perhaps this limit exists to more easily keep their indexes small and in RAM (which means they can stuff more indexes onto a single machine). Perhaps from a user (and business) perspective, their testing shows that almost no one except for researchers will go past the first 5 pages of results.
But if you can only view 703 results of about 575,000, where are the other 573,297 results? That's only 0.2% of the search results that the estimate claims. Where's the missing 99.8% of the search results?
Yahoo search says that there are 890,000 results for Tristan Louis.
However, I can only see 1000 results. That's also only 0.2% of the results that the estimate claims, the same viewable results to estimated results ratio as Google. Where are the other 889,000 results?
I don't know whether Tristan's analyses are correct, or if they are simply reflecting the low viewable vs. estimated results ratios of Google and Yahoo's search results. I would love to hear more from Yahoo and Google explaining the methodology behind their estimated results, and how can users access the full result sets for completeness, and frankly, for objective verification.
To be fair, these same questions must be asked of Technorati's results.Searching Technorati for "Tristan Louis" currently shows 566 posts. Now, that's a lot less than Google or Yahoo estimated results, but not far from their viewable results. Technorati's results are by default sorted by time, and thus when you traverse the result set to the 560-566th result, you see the 566th result, which is the first result in the timeline (250 days ago, as of the time of this post) that Technorati indexed that matched the search term. Thus 100% of the reported results count (at least with this example) are viewable, thus providing a viewable to reported results ratio of 1.
Here are the steps in the experiment, that you can try for yourself, and thus repeat/verify the results we found above, and see what viewable to reported ratios you come up with using each search engine:
For Google:
For Yahoo, here's the steps:
For Technorati, here's the steps:
I hope that this initiates some discussion about these issues. I'm frankly interested in making sure that researchers like Tristan are accurately comparing apples to apples, and I'm all for additional transparency and verifiability in the results that all search engines provide. Am I missing something here? Can someone from Google or Yahoo help me to understand why their reported results are sometimes 1000 times larger than their viewable results? I look forward to being educated.
Technorati Tags: blogosphere, blogs, google, event horizon, search, search engine, statistics, stats, technorati, yahoo
Posted by dsifry at June 22, 2005 03:03 AM | TrackBack | View blog reactions| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 |
| 23 | 24 | 25 | 26 | 27 | 28 | 29 |
| 30 |
Last 25 inbound links. Powered by Technorati.