Matt Cutts, the former head of Google's Search Spam department, has recently shared his raw notes on an old search engine ranking document Google produced all about PageRank. The documents originate from the time before he joined Google, so makes fascinating reading.
Here is the tweet announcing the documents Matt Cutts found while looking through some old stuff:
Oh man. Looking through old stuff and found Google docs/papers I studied/annotated in '99 from before I even joined. pic.twitter.com/pI97pIdxB6— Matt Cutts (@mattcutts) January 3, 2016
As a word of warning much has changed in Search Engine Optimization techniques since 1999, so you shouldn't rely on any information contained in them.
For ease of reading we have reproduced a copy of the four different texts\pages below, although you will need to look at the original images in the tweet for Matt Cutts' annotations:
The first document gives a summary of the linking structure of the web, along with the terminology used (at least historically) by Google.
2.2 Link Structure of the Web
While estimates vary, the current graph of the crawlable Web has roughly 150 million nodes (pages) and 1.7 billion edges (links). Every page has some number of forward links (outedges) and backlinks (inedges) (see Figure1). We can never know whether we have found all the backlinks of a particular page but if we have downloaded it, we know all of its forward links at that time.
Web pages vary widely regarding the number of backlinks they have. For example, the Netscape home page has 62,804 backlinks in our current database compared to most pages which have just a few backlinks. Highly linked pages are more "important" than pages with few links. Simple citation counting has been used to speculate on the future winners of the Nobel Prize (San95). PageRank provides a more sophisticated method for doing citation counting.
The Reason that PageRank is interesting is that there are many cases where simple citation counting does not correspond to our common sense notion of importance. For example, if a web page has a link from the Yahoo home page, it may be just one link, but it is a very important one. This page should be ranked higher than many pages with more links but from obscure places. PageRank is an attempt to see how good an approximation to "importance" can be obtained just from the link structure.
2.3 Propagation of Ranking Through Links
Based on the discussion above, we give the following intuitive description of PageRank: a page has a high rank if the sum of the ranks of its backlinks is high. This covers both the case when a page has many backlinks and when a page has a few highly ranked backlinks.
2.4 Definition of PageRank
Let u be a web page. Then Let F u be the set of pages u points to and B u be the set of pages that point to u . Let N u =[F u ] be the number of links from u and let c be a factor used for normalization (so that the total rank of all web pages is constant).
We begin by defining a simple ranking, R which is a slightly simplified version of PageRank:
This continues from Document 1 but misses a few pages. We start our copy of the document at the first heading for ease of reading:
2.6 Computing PageRank
The computation of PageRank is fairly straightforward if we ignore the issues of scale. Let S be almost any vector over Web pages (for example E). Then PageRank may be computed as follows:
Note that the d factor increases the rate of convergence and maintains ||R||1. An alternative normalization is to multiply R by the appropriate factor. The use of d may have a small impact on the influence of E.
2.7 Dangling Links
One issue with this model is dangling links. Dangling links are simply links that point to any page with no outgoing links. They affect the model because it is not clear where their weight should be distributed, and there is a large number of them. Often these dangling links are simply pages that we have not downloaded yet since it is hard to sample the entire web (in our 42 million pages currently downloaded, we have 51 million URLs not downloaded yet, and hence dangling). Because dangling links do not affect the ranking of any other page directly, we simply remove them from the system until all the PageRanks are calculated. After all the PageRanks are calculated, they can be added back in, without affecting things significantly. Notice the normalization of the other links on the same page as a link which was removed will change slightly, but this should not have a large effect.
As part of the Stanford WebBase project (PB98), we have built a complete crawling and indexing system with a current repository of 24 million web pages. Any web crawler needs to keep a database of URLs so it can discover all the URLs on the web. To implement PageRank, the web crawler simply needs to build an index of links as it crawls. While a simple task, it is non-trivial because of the huge volumes involved. For example, to index our current 24 million page database in about five days, we need to process about 50 web pages per second. Since there about 11 links on an average page (depending on what you count as a link) we need to process 550 links per second. Also our database of 24 million pages references over 75 million unique URLs which each link must be compared against.
This seems to continue the document, again after missing several pages. Although the page starts half way through a topic, I have included it as it contains some interesting information about content authors.
Meta-Information of this form within a page, it would be problematic since a page author could not be trusted with this kind of evaluation. Many web page authors would simply claim that their pages were all the best and most used on the web.
It is important to note that the goal of finding a site that contains a great deal of information about "Wolverines" is a very different task than finding the common case "wolverine" site. There is an interesting system [Mar97] that attempts to find sites that discuss a topic in detail by propagating the textual matching score through the link structure of the web. It then tries to return the page on the most central path. This results in good results for queries like "flower"; the system will return good navigation pages from sites that deal with the topic of flowers in detail. Contrast that with the common case approach which might simply return a commonly used commercial site that had little information except how to buy flowers. It is our opinion that both of these tasks are important, and a general purpose web search engine should return results which fulfill the needs of both of these tasks automatically. In this paper, we are concentrating only on the common case approach.
5.5 Subcomponents of Common Case
It is instructive to consider what kind of common case scenarios PageRank can help represent. Besides a page which has a high usage, like the Wolverine Access cite, PageRank can also represent a collaborative notion of authority or trust. For example, a user might prefer a news story simply because it is linked directly from the New York Times home page. Of course, such a story will receive quite a high PageRank simply because it is mentioned by a very important page. This seems to capture a kind of collaborative true, since if a page was mentioned by a trustworthy or authoritative source, it is more likely to be trustworthy or authoritative. Similarly, quality or importance seems to fit within this kind of circular definition.
6 Personalized PageRank
An important component of the PageRank calculation is E - a vector over the Web pages which is used as a source of rank to make up for the rank sinks such as cycles with no outedges (see Section 2.4). However, aside from solving the problem of rank sinks, E turns out to be a powerful parameter to adjust the page ranks. Intuitively the E vector corresponds tot he distribution of web pages that a random surfer periodically jumps to. As we see below, it can be used to give broad general views of the Web or views which are focussed and personalized to a particular individual.
We have performed most experiments with an E vector that is uniform over all web pages with ||EZZ1 - 0.15. This corresponds to a random surfer periodically jumping to a random web page. This is a very democratic choice for E since all web pages are valued simply because they exist. Although this technique has been quite successful, there is an important problem with it. Some Web pages with many related links receive an overly high ranking. Examples of these include copyright warnings, disclaimers, and highly interlinked mailing list archives.
Another extreme is to have E consist entirely of a single web page. We tested two such E's - the Netscape home page, and the home page of a famous computer scientist, John McCarthy. For the Netscape home page we attempt to generate page ranks from the perspective of a novice user who has Netscape set as the default home page. In the case of John McCarthy's home page we want to calculate page ranks from the perspective of an individual who has given us considerable contextual information based on the links on his home page.
In both cases, the mailing list problem mentioned above did not occur. And, in both cases, the respective home page got the highest PageRank and was followed by its immediate links.
This last document is a copy of an email sent to the Global Brain Discussion mailing list, and gives some interesting insights about how Google recognized that their Search Results could be spammed. We all know that since 1999, we have had some major link spamming by automated tools to take advantage of this very problem. Of course, those kind of spamming methods are no longer successful, especially in the age of Google Penguin.
Some more thoughts about how to "spam" link analyzing algorithms:
To: Global Brain Discussion
From: Francis Heyligen
It is also possible to abuse these link methods;
The really easy way to improve one's standing on a link-counting engine is to do a multi-engine search for all sites referencing yours, and then submit these link lists to all other sites. Then the count of sites referencing yours will increase in each engine.
This is still legitimate though.
I have no problem with that. If these other sites accept a link to your site, this means they find some value in it, and so you are not abusing the system.
But then you can create thousands of these link lists, put them on different servers just in case, make them reference each other and your site and you can take over the Web.
I don't think that would work well with the PageRank and HITS methods, insofar that I understand them. First, as you already suggested, these methods will not count links within the same server, so you would need to put references to your server on lots of other servers, which demands a lot of effort and money. Second, even if you would create a "ring" of servers referring to each other, this should not have much effect on your link ranking according to these methods