Class Notes (838,413)
Canada (510,882)
CS 100 (117)
Dan Brown (12)

Module 10 notes.docx

14 Pages
Unlock Document

Computer Science
CS 100
Dan Brown

Module 10 notes 10.0 Crawling the Web So far, we have assumed that an index for all the pages on the Web exists, and we have relied on that index for answering simple and compound searches. In this module we examine how the index is created and maintained. To create an index for the Web, we need to visit all existing webpages to gather the words and generate appropriate postings lists. But how can we find all the Web’s pages before there is an index? It turns out that the hyperlinks that point from one page to another form a web-like structure that we can crawl along, gathering pages as we go. But there are also problems: some webpages are not reachable by this form of crawling, the contents of many webpages will have changed by the time we have finished our crawl, and some nefarious webpage creators will try to interfere with our ability to crawl. 10.1 A Spider's Path 1 Crawl, v. trans.: “Of a program, esp. one associated with a search engine: to follow automatically links on (the World Wide Web or a particular web site) in order to retrieve documents, typically for the purpose of indexing.” [Oxford English Dictionary, Draft Additions, June 2009] 10.1.1 Indexing a Webpage To create an index for the Web, we need to fetch each webpage, one after the other, and collect all the terms used on that page. We saw in Module 8that we can build postings lists for a given page: after taking into account stop words and stemming, we can record the word offset for each term that appears on the page, sort the terms into alphabetical order, and record all the offsets for each term in corresponding postings lists. Recall that these postings lists include the count of the number of postings (one per page), where each posting includes the page’s URI, the count of the number of occurrences of the term on that page, and the word offsets where the term appears: gong → 3: www.Pg5; 1: www.Pg9; 2: www.Pg11; 2: 15 61, 79 11, 58 If someone would hand us webpages one after the other, we could form the postings for each page and merge them into one collection of postings lists covering the whole collection of pages. This is the main indexing task for each page, as detailed in the following two steps: 1. Read each word on the page, one after the other, and keep track of the count (i.e., the word offset). a. If it is a stop word, do not create a postings list for it (just continue with the next word) b. Apply stemming to the word if needed, producing a term. c. If this is the first time the term appears on the page, create a new posting with the webpage’s URI, a count of 1, and the term’s offset. d. Otherwise find the posting for the term, increase the count of the number of occurrences by 1, and append the current offset to the list of offsets for that term. 2. Then, for each term and posting collected on the page: a. If a postings list has already been started for this term (i.e., the term appears on at least one page that has already been indexed), append the posting to the list and increment the count of the number of postings by 1. b. Otherwise, create a new postings list for the term, initializing the page count to 1 and inserting the posting for this page as its only entry. Practice Exercises 1. Assume that the page with URI www.mod10.ex1 has the following content: He said that she said that he said that it was true. 2. Show the postings lists that would exist after indexing this page if this were the first page to be indexed. Assume no stop words and that no stemming is applied, but that punctuation is removed and case folding is applied. 3. Assume now a second page with URI www.mod10.ex2 has the following content: How could she have said that he said it was true? It was false. 4. Show the postings lists that would exist after indexing this page as well as the page indexed for Exercise 1. Again assume no stop words and no stemming, but punctuation is removed and case folding is used. There is a second important task that we need to accomplish while we are processing the page: we need to extract all the hyperlinks to other pages so that we can follow them one by one to be able to read and index those pages if we haven’t already. The crawler therefore reads through the content of the webpage to find all the anchor () tags. If the quoted value for the href attribute in such a tag starts with http://, then the value is a URI for a page. If there is no protocol specified (such as in the tag ), then the value specifies a file that can be found in the same directory as the current page, and we can combine the value with the current page’s URI to form the URI for the referenced page. So, by reading through the whole page currently being indexed, the crawler can produce a list of URIs for all the pages referenced from the current page. This list can then be processed, one URI after the other, to index each page in turn if it hasn’t been indexed already. We examine this in a little more detail in the next section. 10.1.2 Keeping Track of Webpages to Index Consider now the following network of pages, where the arrows represent hyperlinks from one page to another: Assume we start by indexing P13. Within that page, we find hyperlinks to pages P4, P22, and P17. So we can go on to index P4. From there we find a link to page P8, but we also still need to process P22 and P17 from before. How can we keep track of our “to do list”? The simplest way is to keep a queue of pages waiting to be processed (much like a queue waiting for a bus or for a free teller at the bank). We take a page from the front of the queue, index it, and put on the back of the queue any pages for which we find hyperlinks. We start with just P13 in the queue: yielding P13 We remove P13 from the queue, process it (finding references to pages P4, P22, and P17), and therefore append those three pages to the end of the queue, yielding P4 P22 P17 We now remove P4 and process it, appending P8 to the queue. P22 P17 P8 We now remove P22 and process it, but find no new pages to add to the queue. P17 P8 Next we remove page P17 and process it, finding references to pages P2, P28, and P43. P8 P2 P28 P43 When we go on to remove and process P8, we come across a reference to P17. But wait, we’ve already indexed P17! If we just blindly keep on putting pages on the end of the queue, we’ll be duplicating our effort and get into unending cycles. In order to prevent this, we need to keep track of all the pages we’ve encountered so far. An unending cycle would occur in our example because from P17, we’ll find P2, from which we’ll find P8, from which we’ll get back to P17 again, from which we’ll get to P2 again, and this will continue indefinitely. So as well as keeping track of the queue of pages waiting to be indexed, we also need to keep a list of all pages already indexed. Symbolically, we’ll keep this list below our queue, and we’ll keep it in the order in which we process the pages. Here is the situation when we’re processing page P8: P2 P28 P43 P13, P4, P22, P17, P8 We encounter a reference to page P17, but it has already been processed, so we do not put it on the end of our queue. There is no other reference from P8, so we go on by removing P2 from the front of the queue, add it to the list and process it. P2 refers only to P8, which has already been processed, so we go on to remove P28 from the queue, add it to the list, and process it. Here we find a reference to page P37, which has not already been processed, so we add it to the end of our queue, producing: P43 P37 P13, P4, P22, P17, P8, P2, P28 Continuing in such a manner, when processing P43 we find no hyperlinks, and then processing P37 we again find no hyperlinks. At this point the queue is empty, which means that we have crawled all our pages. P13, P4, P22, P17, P8, P2, P28, P43, P37 Practice Exercises 1. Consider the following network of pages, where arrows again symbolize hyperlinks. Starting with page P51, show the order in which pages will be visited. (If several hyperlinks appear on one page, they can be added to the queue in any order you wish, although picking a consistent order may help you not miss any links.) 10.1.3 Choosing Starting Webpages One piece we are missing is how to start out our crawling. Somebody has to give us one or more starting places to begin our queue. For example, we could decide to start from But what if not every page on the Web is reachable from there; for example, what if no page from Waterloo references, and no page referenced from Waterloo references that page either, no matter how many pages’ hyperlinks you follow? In that case, we could restart our crawl from the Mercedes page, if someone tells us that such a page exists. In practice, crawlers start from pages that are known to be good seeds (such as and also start from pages submitted to them by users around the world. Google might also extract pages from email messages sent to or from Gmail accounts, instant messages sent via GTalk, and pages visited by browsers with the Google toolbar installed. 10.1.4 Result: Breadth-First Traversal The connectivity of the Web can be modeled as a mathematical object known as a graph (more precisely, a directed graph). Very simply, a graph is a set of objects, called nodes, and a set of directed edges, each edge connecting a pair of nodes. For example, in the diagram for Practice Exercise 3 above, nodes are represented by rectangles and directed edges are drawn as arrows; the diagram is an illustration of a graph that has 12 nodes and 17 edges. We can model each webpage as a node in a graph and each hyperlink as a directed edge that starts from the node corresponding to the page that contains an href attribute and ends at the node corresponding to the page whose URI matches the href’s value. Given any graph and a set of seeds at which to start, the graph can be traversed using the algorithm developed above. In summary: 1. Put all the given seeds into the queue. 2. Prepare to keep a list of “visited” nodes (initially empty). 3. As long as the queue is not empty: a. Remove the first node from the queue. b. Append that node to the list of “visited” nodes. c. For each edge starting at that node: i. If the node at the end of the edge already appears on the list of "visited" nodes or it is already in the queue, then do nothing more with that edge ii. Otherwise, append the node at the end of the edge to the end of the queue. (In our example above, we showed the list of visited nodes beneath the queue of nodes that were currently waiting to be visited.) Practice Exercises 1. Confirm your understanding by watching an alternative explanation on YouTube, starting at time=4:28 minutes. (That is, skip over the first half of the video, which discusses depth- first traversal, an alternative approach that is not used for crawling and for which you are not responsible.) 10.2 The Shape of the Web Unfortunately, which nodes are encountered in a breadth-first traversal of a graph depends somewhat on which nodes are used as starting seeds. Practice Exercises 1. Consider again the same network of webpages used for Exercise 3 in Section 10.1.2 (also shown below). Which pages are traversed when you start a breadth-first traversal with P12 and P22 as seeds? 2. For the same network of webpages, which pages are traversed when you start a breadth-first traversal with P47? Ignoring the order in which pages are traversed, you will notice that the sets of pages visited when starting with seeds P12 and P22 or starting with seed P47 are identical. In fact, starting with any page or subset of pages marked in orange below will visit the same set of pages via a breadth-first traversal; only the order will be different.
More Less

Related notes for CS 100

Log In


Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.