Module 10 notes
10.0 Crawling the Web
So far, we have assumed that an index for all the pages on the Web exists, and we have relied on that index
for answering simple and compound searches. In this module we examine how the index is created and
To create an index for the Web, we need to visit all existing webpages to gather the words and generate
appropriate postings lists. But how can we find all the Web’s pages before there is an index? It turns out that
the hyperlinks that point from one page to another form a web-like structure that we can crawl along,
gathering pages as we go.
But there are also problems: some webpages are not reachable by this form of crawling, the contents of
many webpages will have changed by the time we have finished our crawl, and some nefarious webpage
creators will try to interfere with our ability to crawl.
10.1 A Spider's Path
Crawl, v. trans.: “Of a program, esp. one associated with a search engine: to follow automatically links on
(the World Wide Web or a particular web site) in order to retrieve documents, typically for the purpose of
indexing.” [Oxford English Dictionary, Draft Additions, June 2009]
10.1.1 Indexing a Webpage
To create an index for the Web, we need to fetch each webpage, one after the other, and collect all the
terms used on that page. We saw in Module 8that we can build postings lists for a given page: after taking
into account stop words and stemming, we can record the word offset for each term that appears on the
page, sort the terms into alphabetical order, and record all the offsets for each term in corresponding
postings lists. Recall that these postings lists include the count of the number of postings (one per page),
where each posting includes the page’s URI, the count of the number of occurrences of the term on that
page, and the word offsets where the term appears:
gong → 3: www.Pg5; 1: www.Pg9; 2: www.Pg11; 2:
15 61, 79 11, 58 If someone would hand us webpages one after the other, we could form the postings for each page and
merge them into one collection of postings lists covering the whole collection of pages. This is the main
indexing task for each page, as detailed in the following two steps:
1. Read each word on the page, one after the other, and keep track of the count (i.e., the word
a. If it is a stop word, do not create a postings list for it (just continue with the next word)
b. Apply stemming to the word if needed, producing a term.
c. If this is the first time the term appears on the page, create a new posting with the
webpage’s URI, a count of 1, and the term’s offset.
d. Otherwise find the posting for the term, increase the count of the number of occurrences
by 1, and append the current offset to the list of offsets for that term.
2. Then, for each term and posting collected on the page:
a. If a postings list has already been started for this term (i.e., the term appears on at least
one page that has already been indexed), append the posting to the list and increment the
count of the number of postings by 1.
b. Otherwise, create a new postings list for the term, initializing the page count to 1 and
inserting the posting for this page as its only entry.
1. Assume that the page with URI www.mod10.ex1 has the following content:
true. 2. Show the postings lists that would exist after indexing this page if this were the first page to
be indexed. Assume no stop words and that no stemming is applied, but that punctuation is removed
and case folding is applied.
3. Assume now a second page with URI www.mod10.ex2 has the following content:
4. Show the postings lists that would exist after indexing this page as well as the page indexed
for Exercise 1. Again assume no stop words and no stemming, but punctuation is removed and case
folding is used.
There is a second important task that we need to accomplish while we are processing the page: we need to
extract all the hyperlinks to other pages so that we can follow them one by one to be able to read and index
those pages if we haven’t already.
The crawler therefore reads through the content of the webpage to find all the anchor () tags. If the
quoted value for the href attribute in such a tag starts with http://, then the value is a URI for a page. If
there is no protocol specified (such as in the tag ), then the value specifies a
file that can be found in the same directory as the current page, and we can combine the value with the
current page’s URI to form the URI for the referenced page.
So, by reading through the whole page currently being indexed, the crawler can produce a list of URIs for all
the pages referenced from the current page. This list can then be processed, one URI after the other, to
index each page in turn if it hasn’t been indexed already. We examine this in a little more detail in the next
section. 10.1.2 Keeping Track of Webpages to Index
Consider now the following network of pages, where the arrows represent hyperlinks from one page to
Assume we start by indexing P13. Within that page, we find hyperlinks to pages P4, P22, and P17. So we can
go on to index P4. From there we find a link to page P8, but we also still need to process P22 and P17 from
before. How can we keep track of our “to do list”?
The simplest way is to keep a queue of pages waiting to be processed (much like a queue waiting for a bus or
for a free teller at the bank). We take a page from the front of the queue, index it, and put on the back of
the queue any pages for which we find hyperlinks. We start with just P13 in the queue: yielding
We remove P13 from the queue, process it (finding references to pages P4, P22, and P17), and therefore
append those three pages to the end of the queue, yielding
P4 P22 P17 We now remove P4 and process it, appending P8 to the queue.
P22 P17 P8
We now remove P22 and process it, but find no new pages to add to the queue.
Next we remove page P17 and process it, finding references to pages P2, P28, and P43.
P8 P2 P28 P43
When we go on to remove and process P8, we come across a reference to P17. But wait, we’ve already
indexed P17! If we just blindly keep on putting pages on the end of the queue, we’ll be duplicating our
effort and get into unending cycles. In order to prevent this, we need to keep track of all the pages we’ve
encountered so far. An unending cycle would occur in our example because from P17, we’ll find P2, from
which we’ll find P8, from which we’ll get back to P17 again, from which we’ll get to P2 again, and this will
So as well as keeping track of the queue of pages waiting to be indexed, we also need to keep a list of all
pages already indexed. Symbolically, we’ll keep this list below our queue, and we’ll keep it in the order in
which we process the pages. Here is the situation when we’re processing page P8:
P2 P28 P43
P13, P4, P22,
We encounter a reference to page P17, but it has already been processed, so we do not put it on the end of
our queue. There is no other reference from P8, so we go on by removing P2 from the front of the queue,
add it to the list and process it. P2 refers only to P8, which has already been processed, so we go on to
remove P28 from the queue, add it to the list, and process it. Here we find a reference to page P37, which
has not already been processed, so we add it to the end of our queue, producing: P43 P37
Continuing in such a manner, when processing P43 we find no hyperlinks, and then processing P37 we again
find no hyperlinks. At this point the queue is empty, which means that we have crawled all our pages.
1. Consider the following network of pages, where arrows again symbolize hyperlinks. Starting with page P51, show the order in which pages will be visited. (If several hyperlinks appear
on one page, they can be added to the queue in any order you wish, although picking a consistent
order may help you not miss any links.)
10.1.3 Choosing Starting Webpages
One piece we are missing is how to start out our crawling. Somebody has to give us one or more starting
places to begin our queue. For example, we could decide to start from www.uwaterloo.ca. But what if not
every page on the Web is reachable from there; for example, what if no page from Waterloo references
www.mercedes.de, and no page referenced from Waterloo references that page either, no matter how many
pages’ hyperlinks you follow? In that case, we could restart our crawl from the Mercedes page, if someone
tells us that such a page exists.
In practice, crawlers start from pages that are known to be good seeds (such as www.yahoo.com) and also
start from pages submitted to them by users around the world. Google might also extract pages from email
messages sent to or from Gmail accounts, instant messages sent via GTalk, and pages visited by browsers
with the Google toolbar installed. 10.1.4 Result: Breadth-First Traversal
The connectivity of the Web can be modeled as a mathematical object known as a graph (more precisely,
a directed graph). Very simply, a graph is a set of objects, called nodes, and a set of directed edges, each
edge connecting a pair of nodes. For example, in the diagram for Practice Exercise 3 above, nodes are
represented by rectangles and directed edges are drawn as arrows; the diagram is an illustration of a graph
that has 12 nodes and 17 edges.
We can model each webpage as a node in a graph and each hyperlink as a directed edge that starts from the
node corresponding to the page that contains an href attribute and ends at the node corresponding to the
page whose URI matches the href’s value.
Given any graph and a set of seeds at which to start, the graph can be traversed using the algorithm
developed above. In summary:
1. Put all the given seeds into the queue.
2. Prepare to keep a list of “visited” nodes (initially empty).
3. As long as the queue is not empty:
a. Remove the first node from the queue.
b. Append that node to the list of “visited” nodes.
c. For each edge starting at that node:
i. If the node at the end of the edge already appears on the list of "visited" nodes or it
is already in the queue, then do nothing more with that edge
ii. Otherwise, append the node at the end of the edge to the end of the queue.
(In our example above, we showed the list of visited nodes beneath the queue of nodes that were currently
waiting to be visited.)
1. Confirm your understanding by watching an alternative explanation on YouTube, starting at
time=4:28 minutes. (That is, skip over the first half of the video, which discusses depth-
first traversal, an alternative approach that is not used for crawling and for which you are not
10.2 The Shape of the Web
Unfortunately, which nodes are encountered in a breadth-first traversal of a graph depends somewhat on
which nodes are used as starting seeds. Practice Exercises
1. Consider again the same network of webpages used for Exercise 3 in Section 10.1.2 (also shown
below). Which pages are traversed when you start a breadth-first traversal with P12 and P22 as
2. For the same network of webpages, which pages are traversed when you start a breadth-first
traversal with P47?
Ignoring the order in which pages are traversed, you will notice that the sets of pages visited when starting
with seeds P12 and P22 or starting with seed P47 are identical. In fact, starting with any page or subset of
pages marked in orange below will visit the same set of pages via a breadth-first traversal; only the order
will be different.