Menu languages: (Articles in Finish can be found here)
When Google went public, the founders Sergey Brin and Lawrence Page suddenly joined the ranks of the world's wealthiest men.
Anyone wishing to imitate their success would first have to get hold of a gigantic computer and then create a catalog of all the world's websites. There are currently ...
This is an article from the book "Five-minute mathematics" by Ehrhard Behrends which was published in 2008 by the American Mathematical Society (AMS). It is reproduced here with the kind permission of the AMS.
When Google went public, the founders Sergey Brin and Lawrence Page suddenly joined the ranks of the world's wealthiest men. Anyone wishing to imitate their success would first have to get hold of a gigantic computer and then create a catalog of all the world's websites. There are currently about twenty billion web pages. (To get an idea of the size of this number, note that the distance along the Earth's surface from the North Pole to the South Pole is about 20 billion millimeters. For each page, one has to create an index of all terms of interest. That is surely a time-consuming task, but for a team of talented programmers, it is no insuperable challenge. Of course, all the really hard work is left to the computer!
Suppose that all these tasks have been satisfactorily completed. Unfortunately for you, you are still in no position to offer a competitive search engine. The reason is the enormous size of the Internet. Given a particular query, such as all sites containing both ``USA'' and ``hurricane,'' it is not difficult to produce all the web pages with the two search terms. The problem is how to present the results, since typically a search returns from hundreds of thousands to millions of hits. No one has the time or patience to sift through all those pages. One would like to have the most important pages presented first.
Those who have done much ``googling'' know that Google has done a surprisingly good job of solving this problem, for generally one finds what one is looking for among the first dozen or so pages. The secret is in the correct definition of ``important.'' Google's great idea was to consider a web page important if many important web pages refer to it. If one thinks of a web page as a point, with two points connected by an arrow if the page at the arrow's tail has a link to the page at the arrow's head, then the Internet can be pictured as a network of about two billion points connected by many more billions of arrows.
Pages at which many arrows terminate are ``important,'' particularly if the arrows emanate from ``important'' web pages. If we number the web pages 1,2,... and assign a weight Wi to page i as a measure of its importance, there should be a relationship among these numbers. For example, if page 2 is linked to from page 5, and page 5 is linked to by three pages altogether, then page 2 ``inherits'' one-third of the importance of page 5.
Perhaps page 7 also links to page 2 and has ten outgoing links altogether. Then page 2 inherits one-tenth of W7. Suppose that is all: no other pages link to page 2. That leads to the equation
W2=W5/3+W7/10.
Most web sites are connected in complex ways, and altogether, we end up with twenty billion equations in the twenty billion unknowns W1, W2. ....
The math you learned in school is not going to be of much help here. For most readers, two equations in two unknowns is the most they ever had to deal with. But even for professional mathematicians this number is too large to be handled by the standard methods, even given that some optimization problems with several hundred thousand or even several million variables can be solved.
Another path leads to the goal: one takes a ``random walk.'' Imagine an Internet addict, let us call her Isabella, who starts at, say, the web page ams.org. Isabella then searches the page at random for a link to another page and clicks on it. Arriving at that page, she sees another set of links, and randomly clicks on one. Thus a random walk through the worldwide web has begun, in which out of logical necessity, the ``more important'' pages are visited more frequently than the ``less important'' ones. It is a remarkable fact that the relative frequencies of the visits satisfy the system of equations described above. In short, the importance of a page can be determined by measuring the percentage of time that a surfer is to be found there. However, doing these calculations doesn't seem any easier than solving the initial problem. And if an exact solution is required, that is indeed the case. However, an approximate solution (in which, say, you would be satisfied with five decimal places of accuracy) can be had in several hours. Now the search engine is fully functional, since once you have the levels of importance, the rest falls into place: simply search for all pages with ``USA'' and ``hurricane'' and return them in order of importance.
That is a first approximation to Google's methodology. The details and refinements are complex and as secret as the formula for Coca Cola. As an example of a necessary refinement, we might mention that our random surfer Isabella would have a problem if she arrived at a site without any further links. To avoid such a scenario, at each step in the search, the links on a page are ignored with a certain probabilityp and one goes instead to a randomly selected web page somewhere in the Internet. (That may not be easy for Isabella, but with a directory of all web pages it is no problem.) It is rumored that Google uses the probability p=0.15, which seems to be a good value, based on experience.
Google is also constantly working to minimize the effects of ``Google bombing,'' which is a strategy of puffing up the importance of one's web site by artificially increasing the number of links to one's own pages. Google's competitors have also not been asleep at the switch. There is constant work underway at developing new approaches and calculational techniques for evaluating the ``importance'' of websites. For different users and different combinations of search terms, ``important'' can mean something quite different. But adding in such complexity then leads to the problem of whether - as with Google - the results of a search can be returned within a fraction of a second.