Module 8 notes
8.0 Searching and Indexing
Today there are more than one trillion webpages, containing an incredible amount of information on
every topic imaginable. Furthermore, the number of webpages is growing by billions of pages each day. Like
the libraries of old, it is impossible to find what you are looking for by browsing around from page to page.
Even if a computer could do the browsing for you and could load 1000 pages per second, it would take over
31 years to check every page one after the other.
As you have experienced, however, Google and other search engines can return answers to your requests
almost instantaneously. How do they do it?
Before answering that question, let’s look a bit closer at the types of searches, or queries, that Google
supports. (Similar facilities are provided by many other engines, but we’ll look at Google to make our
discussion more concrete.)
8.1 Structure of a Text Index
How do you quickly find which pages of a textbook describe computing, or twelve-tone composition, or
green newts? You might try the table of contents to find relevant sections and then browse from there. A
better way is to refer to the index located at the back of the book, which lists some words or phrases from
the book and the pages on which they appear.
Similarly, search engines rely on an index to find which webpages mention which words. In this section you'll
learn how a text index is organized, and in the next you’ll learn how such an index is used to answer simple
queries. Later modules will examine how the same index is used for more complex queries, and how it helps
to present the list of matching webpages such that those most likely to be of interest are near the top of the
8.1.1 Inverted Files
A webpage can be viewed as a sequence of words mixed in with a sequence of tags. For the time being, we'll
ignore the tags and concentrate on the words that make up a webpage's content. The following example
includes four pages containing one sentence each. This structure can be “inverted” so that each word references the pages on which it is found, instead of
each page referencing the words found on that page.
Waterloo → Pg 1 Pg 2
study → Pg 1 Pg 3
I → Pg 3 Pg 4
the → Pg 1 Pg 3 Pg 4
in → Pg 2 Pg 3 Pg 4
Bruce → Pg 3 Pg 4
Peninsula → Pg 4
… Now, given a particular search word, we can immediately find the list of pages on which that word appears,
which is known as its postings list. Since the number of different words used on the Web is very large, the
inverted file needs to be organized to make it easy to find the postings list for any particular word. We will
assume the simplest organization, which is to keep all the words in a sorted list, using alphabetical order,
just like the index in the back of a book. There are also more sophisticated structures that could be used to
keep the collection of words, but that is beyond the scope of this course.
8.2 Matching a One-word Query
Now that you know what an index looks like, let’s review what happens when you type a one-word query
into a search engine’s search box.
1. First the search engine has to convert the query to the same form as its index terms. To start, the
search term is converted to lower case. If it is a stop word, the engine will have to report that it
cannot find a match (or, because such terms are so common that they appear on almost any page,
the engine could select any small set of pages and then merely scan them to verify that they do
indeed include the query term). If the term is not a stop word, then the engine can apply whatever
stemming it used (if any) when creating the index.
2. Next the engine has to find the converted query term within its dictionary of index terms. If the
dictionary is an alphabetically ordered list of index terms, then it looks up the term much as you
would look up a word in a conventional dictionary. If no matching entry is found, the engine reports
that no webpages match your query.
3. If a match to your converted query term is found, then the engine retrieves the associated postings
list. Using the count of the number of postings, it can immediately report the number of webpages
that match your query. It can also display the URIs for 10 of those pages (or however many you wish
to see). For each URI it displays, it can also show a suitable snippet from the corresponding page. If
you want to see additional matches, it can show the next 10, and then the next 10, until you’ve
How satisfied are you likely to be with the results?
In an upcoming module, we’ll examine how the engine chooses which pages to show first in response to your
query. For now, we’re interested in examining which webpages can be found somewhere in the collection of
One problem we haven’t yet considered in this context results from synonymy, the existence of multiple
words having the same (or almost the same) meaning. Look again at our original example with four
webpages. If you were to look for pages matching the query term found, you would have