Class Notes (839,471)
Canada (511,354)
CS 100 (117)
Dan Brown (12)
Lecture

Module 8 notes.docx

6 Pages
88 Views

Department
Computer Science
Course Code
CS 100
Professor
Dan Brown

This preview shows pages 1 and half of page 2. Sign up to view the full 6 pages of the document.
Description
Module 8 notes 8.0 Searching and Indexing Today there are more than one trillion webpages[1], containing an incredible amount of information on every topic imaginable. Furthermore, the number of webpages is growing by billions of pages each day. Like the libraries of old, it is impossible to find what you are looking for by browsing around from page to page. Even if a computer could do the browsing for you and could load 1000 pages per second, it would take over 31 years to check every page one after the other. As you have experienced, however, Google and other search engines can return answers to your requests almost instantaneously. How do they do it? Before answering that question, let’s look a bit closer at the types of searches, or queries, that Google supports. (Similar facilities are provided by many other engines, but we’ll look at Google to make our discussion more concrete.) 8.1 Structure of a Text Index How do you quickly find which pages of a textbook describe computing, or twelve-tone composition, or green newts? You might try the table of contents to find relevant sections and then browse from there. A better way is to refer to the index located at the back of the book, which lists some words or phrases from the book and the pages on which they appear. Similarly, search engines rely on an index to find which webpages mention which words. In this section you'll learn how a text index is organized, and in the next you’ll learn how such an index is used to answer simple queries. Later modules will examine how the same index is used for more complex queries, and how it helps to present the list of matching webpages such that those most likely to be of interest are near the top of the list. 8.1.1 Inverted Files A webpage can be viewed as a sequence of words mixed in with a sequence of tags. For the time being, we'll ignore the tags and concentrate on the words that make up a webpage's content. The following example includes four pages containing one sentence each. This structure can be “inverted” so that each word references the pages on which it is found, instead of each page referencing the words found on that page. Waterloo → Pg 1 Pg 2 study → Pg 1 Pg 3 I → Pg 3 Pg 4 the → Pg 1 Pg 3 Pg 4 in → Pg 2 Pg 3 Pg 4 Bruce → Pg 3 Pg 4 Peninsula → Pg 4 … Now, given a particular search word, we can immediately find the list of pages on which that word appears, which is known as its postings list. Since the number of different words used on the Web is very large, the inverted file needs to be organized to make it easy to find the postings list for any particular word. We will assume the simplest organization, which is to keep all the words in a sorted list, using alphabetical order, just like the index in the back of a book. There are also more sophisticated structures that could be used to keep the collection of words, but that is beyond the scope of this course. 8.2 Matching a One-word Query Now that you know what an index looks like, let’s review what happens when you type a one-word query into a search engine’s search box. 1. First the search engine has to convert the query to the same form as its index terms. To start, the search term is converted to lower case. If it is a stop word, the engine will have to report that it cannot find a match (or, because such terms are so common that they appear on almost any page, the engine could select any small set of pages and then merely scan them to verify that they do indeed include the query term). If the term is not a stop word, then the engine can apply whatever stemming it used (if any) when creating the index. 2. Next the engine has to find the converted query term within its dictionary of index terms. If the dictionary is an alphabetically ordered list of index terms, then it looks up the term much as you would look up a word in a conventional dictionary. If no matching entry is found, the engine reports that no webpages match your query. 3. If a match to your converted query term is found, then the engine retrieves the associated postings list. Using the count of the number of postings, it can immediately report the number of webpages that match your query. It can also display the URIs for 10 of those pages (or however many you wish to see). For each URI it displays, it can also show a suitable snippet from the corresponding page. If you want to see additional matches, it can show the next 10, and then the next 10, until you’ve seen enough. How satisfied are you likely to be with the results? In an upcoming module, we’ll examine how the engine chooses which pages to show first in response to your query. For now, we’re interested in examining which webpages can be found somewhere in the collection of results. One problem we haven’t yet considered in this context results from synonymy, the existence of multiple words having the same (or almost the same) meaning. Look again at our original example with four webpages. If you were to look for pages matching the query term found, you would have
More Less
Unlock Document

Only pages 1 and half of page 2 are available for preview. Some parts have been intentionally blurred.

Unlock Document
You're Reading a Preview

Unlock to view full version

Unlock Document

Log In


OR

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


OR

By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.


Submit