# MATH 33AH Lecture Notes - Lecture 25: Terre Haute, Indiana, Pagerank, Linear AlgebraExam

by OC2716258

Department

MathematicsCourse Code

MATH 33AHProfessor

AllStudy Guide

FinalThis

**preview**shows pages 1-3. to view the full**11 pages of the document.**THE $25,000,000,000āEIGENVECTOR

THE LINEAR ALGEBRA BEHIND GOOGLE

KURT BRYANā AND TANYA LEISEā”

Abstract. Googleās success derives in large part from its PageRank algorithm, which ranks the importance

of webpages according to an eigenvector of a weighted link matrix. Analysis of the PageRank formula provides a

wonderful applied topic for a linear algebra course. Instructors may assign this article as a project to more advanced

students, or spend one or two lectures presenting the material with assigned homework from the exercises. This

material also complements the discussion of Markov chains in matrix algebra. Maple and Mathematica ļ¬les supporting

this material can be found at www.rose-hulman.edu/ā¼bryan.

Key words. linear algebra, PageRank, eigenvector, stochastic matrix

AMS subject classiļ¬cations. 15-01, 15A18, 15A51

1. Introduction. When Google went online in the late 1990ās, one thing that set it apart

from other search engines was that its search result listings always seemed deliver the āgood stuļ¬ā

up front. With other search engines you often had to wade through screen after screen of links

to irrelevant web pages that just happened to match the search text. Part of the magic behind

Google is its PageRank algorithm, which quantitatively rates the importance of each page on the

web, allowing Google to rank the pages and thereby present to the user the more important (and

typically most relevant and helpful) pages ļ¬rst.

Understanding how to calculate PageRank is essential for anyone designing a web page that they

want people to access frequently, since getting listed ļ¬rst in a Google search leads to many people

looking at your page. Indeed, due to Googleās prominence as a search engine, its ranking system

has had a deep inļ¬uence on the development and structure of the internet, and on what kinds of

information and services get accessed most frequently. Our goal in this paper is to explain one of

the core ideas behind how Google calculates web page rankings. This turns out to be a delightful

application of standard linear algebra.

Search engines such as Google have to do three basic things:

1. Crawl the web and locate all web pages with public access.

2. Index the data from step 1, so that it can be searched eļ¬ciently for relevant keywords or

phrases.

3. Rate the importance of each page in the database, so that when a user does a search and

the subset of pages in the database with the desired information has been found, the more

important pages can be presented ļ¬rst.

This paper will focus on step 3. In an interconnected web of pages, how can one meaningfully deļ¬ne

and quantify the āimportanceā of any given page?

The rated importance of web pages is not the only factor in how links are presented, but it is a

signiļ¬cant one. There are also successful ranking algorithms other than PageRank. The interested

reader will ļ¬nd a wealth of information about ranking algorithms and search engines, and we list

just a few references for getting started (see the extensive bibliography in [9], for example, for a

more complete list). For a brief overview of how Google handles the entire process see [6], and for

an in-depth treatment of PageRank see [3] and a companion article [9]. Another article with good

concrete examples is [5]. For more background on PageRank and explanations of essential principles

of web design to maximize a websiteās PageRank, go to the websites [4, 11, 14]. To ļ¬nd out more

about search engine principles in general and other ranking algorithms, see [2] and [8]. Finally, for

an account of some newer approaches to searching the web, see [12] and [13].

2. Developing a formula to rank pages.

āTHE APPROXIMATE MARKET VALUE OF GOOGLE WHEN THE COMPANY WENT PUBLIC IN 2004.

ā Department of Mathematics, Rose-Hulman Institute of Technology, Terre Haute, IN 47803; email:

kurt.bryan@rose-hulman.edu; phone: 812) 877-8485; fax: (812)877-8883.

ā”Mathematics and Computer Science Department, Amherst College, Amherst, MA 01002; email:

tleise@amherst.edu; phone: (413)542-5411; fax: (413)542-2550.

1

Only pages 1-3 are available for preview. Some parts have been intentionally blurred.

2K. BRYAN AND T. LEISE

42

3

1

4

2

3

1

Fig. 2.1.An example of a web with only four pages. An arrow from page A to page B indicates a link from page

A to page B.

2.1. The basic idea. In what follows we will use the phrase āimportance scoreā or just āscoreā

for any quantitative rating of a web pageās importance. The importance score for any web page will

always be a non-negative real number. A core idea in assigning a score to any given web page is

that the pageās score is derived from the links made to that page from other web pages. The links

to a given page are called the backlinks for that page. The web thus becomes a democracy where

pages vote for the importance of other pages by linking to them.

Suppose the web of interest contains npages, each page indexed by an integer k, 1 ā¤kā¤n. A

typical example is illustrated in Figure 2.1, in which an arrow from page A to page B indicates a

link from page A to page B. Such a web is an example of a directed graph.1Weāll use xkto denote

the importance score of page kin the web. The xkis non-negative and xj> xkindicates that page j

is more important than page k(so xj= 0 indicates page jhas the least possible importance score).

A very simple approach is to take xkas the number of backlinks for page k. In the example in

Figure 2.1, we have x1= 2, x2= 1, x3= 3, and x4= 2, so that page 3 would be the most important,

pages 1 and 4 tie for second, and page 2 is least important. A link to page kbecomes a vote for

page kās importance.

This approach ignores an important feature one would expect a ranking algorithm to have,

namely, that a link to page kfrom an important page should boost page kās importance score more

than a link from an unimportant page. For example, a link to your homepage directly from Yahoo

ought to boost your pageās score much more than a link from, say, www.kurtbryan.com (no relation

to the author). In the web of Figure 2.1, pages 1 and 4 both have two backlinks: each links to

the other, but page 1ās second backlink is from the seemingly important page 3, while page 4ās

second backlink is from the relatively unimportant page 1. As such, perhaps we should rate page

1ās importance higher than that of page 4.

As a ļ¬rst attempt at incorporating this idea letās compute the score of page jas the sum of the

scores of all pages linking to page j. For example, consider the web of Figure 2.1. The score of page

1 would be determined by the relation x1=x3+x4. Since x3and x4will depend on x1this scheme

seems strangely self-referential, but it is the approach we will use, with one more modiļ¬cation. Just

as in elections, we donāt want a single individual to gain inļ¬uence merely by casting multiple votes.

In the same vein, we seek a scheme in which a web page doesnāt gain extra inļ¬uence simply by

linking to lots of other pages. If page jcontains njlinks, one of which links to page k, then we will

boost page kās score by xj/nj, rather than by xj. In this scheme each web page gets a total of one

vote, weighted by that web pageās score, that is evenly divided up among all of its outgoing links. To

quantify this for a web of npages, let Lkā {1,2, . . . , n}denote the set of pages with a link to page

k, that is, Lkis the set of page kās backlinks. For each kwe require

xk=X

jāLk

xj

nj

,(2.1)

where njis the number of outgoing links from page j(which must be positive since if jāLkthen

1A graph consists of a set of vertices (in this context, the web pages) and a set of edges. Each edge joins a pair

of vertices. The graph is undirected if the edges have no direction. The graph is directed if each edge (in the web

context, the links) has a direction, that is, a starting and ending vertex.

Only pages 1-3 are available for preview. Some parts have been intentionally blurred.

THE $25,000,000,000 EIGENVECTOR 3

5

42

3

1

5

4

2

3

1

Fig. 2.2.A web of ļ¬ve pages, consisting of two disconnected āsubwebsā W1(pages 1 and 2) and W2(pages 3,

4, 5).

page jlinks to at least page k!). We will assume that a link from a page to itself will not be counted.

In this ādemocracy of the webā you donāt get to vote for yourself!

Letās apply this approach to the four-page web of Figure 2.1. For page 1 we have x1=x3/1 +

x4/2, since pages 3 and 4 are backlinks for page 1 and page 3 contains only one link, while page 4

contains two links (splitting its vote in half). Similarly, x2=x1/3, x3=x1/3 + x2/2 + x4/2, and

x4=x1/3 + x2/2. These linear equations can be written Ax =x, where x= [x1x2x3x4]Tand

A=ļ£®

ļ£Æ

ļ£Æ

ļ£°

0 0 1 1

2

1

3000

1

3

1

201

2

1

3

1

20 0

ļ£¹

ļ£ŗ

ļ£ŗ

ļ£»

.(2.2)

This transforms the web ranking problem into the āstandardā problem of ļ¬nding an eigenvector

for a square matrix! (Recall that the eigenvalues Ī»and eigenvectors xof a matrix Asatisfy the

equation Ax =Ī»x,x6=0by deļ¬nition.) We thus seek an eigenvector xwith eigenvalue 1 for the

matrix A. We will refer to Aas the ālink matrixā for the given web.

It turns out that the link matrix Ain equation (2.2) does indeed have eigenvectors with eigen-

value 1, namely, all multiples of the vector [12 4 9 6]T(recall that any non-zero multiple of an

eigenvector is again an eigenvector). Letās agree to scale these āimportance score eigenvectorsā

so that the components sum to 1. In this case we obtain x1=12

31 ā0.387, x2=4

31 ā0.129,

x3=9

31 ā0.290, and x4=6

31 ā0.194. Note that this ranking diļ¬ers from that generated by simply

counting backlinks. It might seem surprising that page 3, linked to by all other pages, is not the

most important. To understand this, note that page 3 links only to page 1 and so casts its entire

vote for page 1. This, with the vote of page 2, results in page 1 getting the highest importance score.

More generally, the matrix Afor any web must have 1 as an eigenvalue if the web in question

has no dangling nodes (pages with no outgoing links). To see this, ļ¬rst note that for a general web

of npages formula (2.1) gives rise to a matrix Awith Aij = 1/njif page jlinks to page i,Aij = 0

otherwise. The jth column of Athen contains njnon-zero entries, each equal to 1/nj, and the

column thus sums to 1. This motivates the following deļ¬nition, used in the study of Markov chains:

Definition 2.1. A square matrix is called a column-stochastic matrix if all of its entries

are nonnegative and the entries in each column sum to one.

The matrix Afor a web with no dangling nodes is column-stochastic. We now prove

Proposition 1. Every column-stochastic matrix has 1as an eigenvalue. Proof. Let Abe an

nĆncolumn-stochastic matrix and let edenote an ndimensional column vector with all entries equal

to 1. Recall that Aand its transpose AThave the same eigenvalues. Since Ais column-stochastic

it is easy to see that ATe=e, so that 1 is an eigenvalue for ATand hence for A.ī

In what follows we use V1(A) to denote the eigenspace for eigenvalue 1 of a column-stochastic

matrix A.

2.2. Shortcomings. Several diļ¬culties arise with using formula (2.1) to rank websites. In this

section we discuss two issues: webs with non-unique rankings and webs with dangling nodes.

2.2.1. Non-Unique Rankings. For our rankings it is desirable that the dimension of V1(A)

equal one, so that there is a unique eigenvector xwith Pixi= 1 that we can use for importance

scores. This is true in the web of Figure 2.1 and more generally is always true for the special case of

###### You're Reading a Preview

Unlock to view full version