Ad verba per numeros


Sunday, September 8, 2013, 12:48 PM
In my previous post I argued that there is no single parent of Web search engines but, instead, a number of pioneers that worked unaware of each other and eventually produced virtually identical achitectures.

Nevertheless, the researchers I surveyed in that post pioneered the first age of search engines; that is, the now (literally or de facto) defunct Altavista, Lycos or Excite. Google, in contrast, is an example of the second age of search engines to which Bing, Yandex or Baidu also belong.

It's not my aim to provide a lengthy explanation of the differences between Google (and family) and prior search engines since, first of all, Google is today very different from Google in 1998. Instead, what I'll try to do is to briefly explain why the idea by Brin & Page was both brilliant and it has had an enormous impact.

To start with, it's extremely difficult to explain in words how broken Web searching was back in 1997. If you are old enough (i.e. you were a user of Altavista or Lycos or Excite) you surely remember it. If you are not that old (i.e. Google has been your search engine since you can remember) I'll provide you a couple of examples (taken from the original Brin & Page papers).

First example: most major search engines of the time were unable to find themselves. That is, the first result you obtained when issuing the query "excite" in Excite, or "altavista" in Altavista was quite unlikely to be www.excite.com or www.altavista.com.

Second example: when submitting the query "bill clinton" (you know, the husband of Hillary Clinton, the POTUS at the time) a major search engine produced as one top result a webpage like this one.

So, we had search engines incapable to find themselves and that, at the same time, ignored that maybe www.whitehouse.gov was a better result for "bill clinton" than a crappy page mocking on him.

What was the problem with those search engines?

The short explanation is that they were trying to apply Information Retrieval to the Web. That had plenty of sense since, at the time, IR was very mature. In fact, the discipline was almost 40 years old and, therefore, all that know-how should apply nicely to the Web. Unfortunately, it simply did not work.

Hence, Web search engines devoted tons of energy to refine their respective "secret sauces" (I mean, ranking algorithms). Those "sauces" involved exploiting features from webpages ranging from size, to usage of headings, appearance of keywords in links, headings and other parts of the page, etc. Needless to say, secret sauces improved little (if anything at all).

So, again, what was the problem with mid-1990s search engines?

I hate to repeat myself but it was applying IR to the Web because, unfortunately, the Web is not a text collection. As Marchiori said, the Web is not made of text documents but of hypertext documents and, therefore, search engines should focus not on the textual features but in the "hyper" features.

Indeed, Marchiori suggested that webpages should be ranked on the basis the links they received and the rankings of the pages linking to them while mostly obviating the textual content.

Such an idea sounds crazy but it's simply mindblowing. Unfortunately, no matter the greatness of this idea Marchiori is not the parent of the modern search engine since he stated that such an approach was unfeasible in practice.

If you are a PhD student (or even a grown-up researcher) you should learn something from that: don't, make, bold, claims. A huge problem with bold claims is that sooner or later someone is going to make a fool on you.

In this particular case it was rather soon. Marchiori suggested in 1997 that search engines should focus on links while avoiding text while, at the same time, saying that such a thing was unatainable. In 1998 Kleinberg and Brin & Page described two different algorithms that working solely from links were able to rank webpages much better than state-of-the-art search engines.

The paper by Kleinberg describing HITS algorithms slightly predates the work by Brin & Page and, in fact, they cite it. However, there are a number of differences between HITS and PageRank.

First of all, Kleinberg did not intend to build a search engine but, instead, to produce a better ranking of a somewhat small set of pages obtained from a given search engine.

Secondly, Kleinberg algorithm relied on two characterizing features of webpages represented by two different (albeit mutually reinforcing) scores. Pages could be "authorities" (i.e. pages with highly relevant contents given a topic) or they could be "hubs" (i.e. pages with plenty of links to "authorities" in a given topic). Simply stated, HITS is an iterative algorithm that assign to pages in a graph (not the whole Web graph) both a hub and an authority score. The eventual ranking would be made on the basis of the authority score and, hence, a selection of highly relevant pages would be provided to the user.

In contrast, PageRank assumes that web pages are described by just one score which, quite annoyingly, is also named PageRank. The PageRank of a webpage dependes on the number of incoming links but also on the PageRank of web pages issuing those links. Besides, a webpage's PageRank is equally distributed among the outgoing links. This algorithm is, as HITS, iterative and it has got a number of nice properties:

  1. It's quite fast (i.e. the number of iterations is relatively small even for large graphs).
  2. Second, the initial scores are mostly irrelevant and the eventual ranking is virtually the same without regard to those initial values.
  3. Third, the global PageRank in the graph does not change accross iterations but, instead, it is distributed among webpages.
  4. A webpage PageRank can be considered a nice proxy for the chance of a user reaching that webpage by randomly visiting links. In other words, pages with larger PageRank values are "central" in the Web.
Certainly, as of today PageRank is just one component of Google's ranking methods but back in 1998 it was the core of Google's search engine and it was a huge improvement when compared against state-of-the-art products. However, according to legend, none of major search engines of the time were interested in Brin & Page method, neither Yahoo! that at the time was not in the search business. Therefore, Brin & Page were "forced" to start their own bussiness, the rest is history.

I would not like to close this post, however, without referring to some of the people that made Information Retrieval a mature area before the advent of Web-IR: Luhn, who is the solely inventor of information retrieval systems and automatic summarization, both achievements occuring in IBM during the 1950s; Maron & Kuhns, who invented the list of relevant documents in the 1960s; and, of course, the demigods and demigoddess of IR Gerard Salton, Stephen Robertson and Karen Sparck-Jones.

As Newton nicely put it,

If I have seen further it is by
standing on ye sholders of Giants.

As usual you can find me at Twitter: PFCdgayo



Back Next