Study Guides (380,000)
US (220,000)
UCLA (3,000)
MATH (300)
All (20)
Final

MATH 33AH Lecture Notes - Lecture 25: Terre Haute, Indiana, Pagerank, Linear AlgebraExam


Department
Mathematics
Course Code
MATH 33AH
Professor
All
Study Guide
Final

This preview shows pages 1-3. to view the full 11 pages of the document.
THE $25,000,000,000EIGENVECTOR
THE LINEAR ALGEBRA BEHIND GOOGLE
KURT BRYANAND TANYA LEISE
Abstract. Google’s success derives in large part from its PageRank algorithm, which ranks the importance
of webpages according to an eigenvector of a weighted link matrix. Analysis of the PageRank formula provides a
wonderful applied topic for a linear algebra course. Instructors may assign this article as a project to more advanced
students, or spend one or two lectures presenting the material with assigned homework from the exercises. This
material also complements the discussion of Markov chains in matrix algebra. Maple and Mathematica files supporting
this material can be found at www.rose-hulman.edu/bryan.
Key words. linear algebra, PageRank, eigenvector, stochastic matrix
AMS subject classifications. 15-01, 15A18, 15A51
1. Introduction. When Google went online in the late 1990’s, one thing that set it apart
from other search engines was that its search result listings always seemed deliver the “good stuff
up front. With other search engines you often had to wade through screen after screen of links
to irrelevant web pages that just happened to match the search text. Part of the magic behind
Google is its PageRank algorithm, which quantitatively rates the importance of each page on the
web, allowing Google to rank the pages and thereby present to the user the more important (and
typically most relevant and helpful) pages first.
Understanding how to calculate PageRank is essential for anyone designing a web page that they
want people to access frequently, since getting listed first in a Google search leads to many people
looking at your page. Indeed, due to Google’s prominence as a search engine, its ranking system
has had a deep influence on the development and structure of the internet, and on what kinds of
information and services get accessed most frequently. Our goal in this paper is to explain one of
the core ideas behind how Google calculates web page rankings. This turns out to be a delightful
application of standard linear algebra.
Search engines such as Google have to do three basic things:
1. Crawl the web and locate all web pages with public access.
2. Index the data from step 1, so that it can be searched efficiently for relevant keywords or
phrases.
3. Rate the importance of each page in the database, so that when a user does a search and
the subset of pages in the database with the desired information has been found, the more
important pages can be presented first.
This paper will focus on step 3. In an interconnected web of pages, how can one meaningfully define
and quantify the “importance” of any given page?
The rated importance of web pages is not the only factor in how links are presented, but it is a
significant one. There are also successful ranking algorithms other than PageRank. The interested
reader will find a wealth of information about ranking algorithms and search engines, and we list
just a few references for getting started (see the extensive bibliography in [9], for example, for a
more complete list). For a brief overview of how Google handles the entire process see [6], and for
an in-depth treatment of PageRank see [3] and a companion article [9]. Another article with good
concrete examples is [5]. For more background on PageRank and explanations of essential principles
of web design to maximize a website’s PageRank, go to the websites [4, 11, 14]. To find out more
about search engine principles in general and other ranking algorithms, see [2] and [8]. Finally, for
an account of some newer approaches to searching the web, see [12] and [13].
2. Developing a formula to rank pages.
THE APPROXIMATE MARKET VALUE OF GOOGLE WHEN THE COMPANY WENT PUBLIC IN 2004.
Department of Mathematics, Rose-Hulman Institute of Technology, Terre Haute, IN 47803; email:
kurt.bryan@rose-hulman.edu; phone: 812) 877-8485; fax: (812)877-8883.
Mathematics and Computer Science Department, Amherst College, Amherst, MA 01002; email:
tleise@amherst.edu; phone: (413)542-5411; fax: (413)542-2550.
1

Only pages 1-3 are available for preview. Some parts have been intentionally blurred.

2K. BRYAN AND T. LEISE
42
3
1
4
2
3
1
Fig. 2.1.An example of a web with only four pages. An arrow from page A to page B indicates a link from page
A to page B.
2.1. The basic idea. In what follows we will use the phrase “importance score” or just “score”
for any quantitative rating of a web page’s importance. The importance score for any web page will
always be a non-negative real number. A core idea in assigning a score to any given web page is
that the page’s score is derived from the links made to that page from other web pages. The links
to a given page are called the backlinks for that page. The web thus becomes a democracy where
pages vote for the importance of other pages by linking to them.
Suppose the web of interest contains npages, each page indexed by an integer k, 1 kn. A
typical example is illustrated in Figure 2.1, in which an arrow from page A to page B indicates a
link from page A to page B. Such a web is an example of a directed graph.1We’ll use xkto denote
the importance score of page kin the web. The xkis non-negative and xj> xkindicates that page j
is more important than page k(so xj= 0 indicates page jhas the least possible importance score).
A very simple approach is to take xkas the number of backlinks for page k. In the example in
Figure 2.1, we have x1= 2, x2= 1, x3= 3, and x4= 2, so that page 3 would be the most important,
pages 1 and 4 tie for second, and page 2 is least important. A link to page kbecomes a vote for
page k’s importance.
This approach ignores an important feature one would expect a ranking algorithm to have,
namely, that a link to page kfrom an important page should boost page k’s importance score more
than a link from an unimportant page. For example, a link to your homepage directly from Yahoo
ought to boost your page’s score much more than a link from, say, www.kurtbryan.com (no relation
to the author). In the web of Figure 2.1, pages 1 and 4 both have two backlinks: each links to
the other, but page 1’s second backlink is from the seemingly important page 3, while page 4’s
second backlink is from the relatively unimportant page 1. As such, perhaps we should rate page
1’s importance higher than that of page 4.
As a first attempt at incorporating this idea let’s compute the score of page jas the sum of the
scores of all pages linking to page j. For example, consider the web of Figure 2.1. The score of page
1 would be determined by the relation x1=x3+x4. Since x3and x4will depend on x1this scheme
seems strangely self-referential, but it is the approach we will use, with one more modification. Just
as in elections, we don’t want a single individual to gain influence merely by casting multiple votes.
In the same vein, we seek a scheme in which a web page doesn’t gain extra influence simply by
linking to lots of other pages. If page jcontains njlinks, one of which links to page k, then we will
boost page k’s score by xj/nj, rather than by xj. In this scheme each web page gets a total of one
vote, weighted by that web page’s score, that is evenly divided up among all of its outgoing links. To
quantify this for a web of npages, let Lk⊂ {1,2, . . . , n}denote the set of pages with a link to page
k, that is, Lkis the set of page k’s backlinks. For each kwe require
xk=X
jLk
xj
nj
,(2.1)
where njis the number of outgoing links from page j(which must be positive since if jLkthen
1A graph consists of a set of vertices (in this context, the web pages) and a set of edges. Each edge joins a pair
of vertices. The graph is undirected if the edges have no direction. The graph is directed if each edge (in the web
context, the links) has a direction, that is, a starting and ending vertex.

Only pages 1-3 are available for preview. Some parts have been intentionally blurred.

THE $25,000,000,000 EIGENVECTOR 3
5
42
3
1
4
2
3
1
Fig. 2.2.A web of five pages, consisting of two disconnected “subwebs” W1(pages 1 and 2) and W2(pages 3,
4, 5).
page jlinks to at least page k!). We will assume that a link from a page to itself will not be counted.
In this “democracy of the web” you don’t get to vote for yourself!
Let’s apply this approach to the four-page web of Figure 2.1. For page 1 we have x1=x3/1 +
x4/2, since pages 3 and 4 are backlinks for page 1 and page 3 contains only one link, while page 4
contains two links (splitting its vote in half). Similarly, x2=x1/3, x3=x1/3 + x2/2 + x4/2, and
x4=x1/3 + x2/2. These linear equations can be written Ax =x, where x= [x1x2x3x4]Tand
A=
0 0 1 1
2
1
3000
1
3
1
201
2
1
3
1
20 0
.(2.2)
This transforms the web ranking problem into the “standard” problem of finding an eigenvector
for a square matrix! (Recall that the eigenvalues λand eigenvectors xof a matrix Asatisfy the
equation Ax =λx,x6=0by definition.) We thus seek an eigenvector xwith eigenvalue 1 for the
matrix A. We will refer to Aas the “link matrix” for the given web.
It turns out that the link matrix Ain equation (2.2) does indeed have eigenvectors with eigen-
value 1, namely, all multiples of the vector [12 4 9 6]T(recall that any non-zero multiple of an
eigenvector is again an eigenvector). Let’s agree to scale these “importance score eigenvectors”
so that the components sum to 1. In this case we obtain x1=12
31 0.387, x2=4
31 0.129,
x3=9
31 0.290, and x4=6
31 0.194. Note that this ranking differs from that generated by simply
counting backlinks. It might seem surprising that page 3, linked to by all other pages, is not the
most important. To understand this, note that page 3 links only to page 1 and so casts its entire
vote for page 1. This, with the vote of page 2, results in page 1 getting the highest importance score.
More generally, the matrix Afor any web must have 1 as an eigenvalue if the web in question
has no dangling nodes (pages with no outgoing links). To see this, first note that for a general web
of npages formula (2.1) gives rise to a matrix Awith Aij = 1/njif page jlinks to page i,Aij = 0
otherwise. The jth column of Athen contains njnon-zero entries, each equal to 1/nj, and the
column thus sums to 1. This motivates the following definition, used in the study of Markov chains:
Definition 2.1. A square matrix is called a column-stochastic matrix if all of its entries
are nonnegative and the entries in each column sum to one.
The matrix Afor a web with no dangling nodes is column-stochastic. We now prove
Proposition 1. Every column-stochastic matrix has 1as an eigenvalue. Proof. Let Abe an
n×ncolumn-stochastic matrix and let edenote an ndimensional column vector with all entries equal
to 1. Recall that Aand its transpose AThave the same eigenvalues. Since Ais column-stochastic
it is easy to see that ATe=e, so that 1 is an eigenvalue for ATand hence for A.
In what follows we use V1(A) to denote the eigenspace for eigenvalue 1 of a column-stochastic
matrix A.
2.2. Shortcomings. Several difficulties arise with using formula (2.1) to rank websites. In this
section we discuss two issues: webs with non-unique rankings and webs with dangling nodes.
2.2.1. Non-Unique Rankings. For our rankings it is desirable that the dimension of V1(A)
equal one, so that there is a unique eigenvector xwith Pixi= 1 that we can use for importance
scores. This is true in the web of Figure 2.1 and more generally is always true for the special case of
You're Reading a Preview

Unlock to view full version