Module 11 notes
Module 11 Introduction
When searching for a good recipe that uses squash and ginger, there are many pages on the Web that might
satisfy you. You might try the query recipe squash ginger to identify those pages, and many pages that
include those words could be of interest. However, there are also many pages that use the words but in a
different context. For example, there’s a webpage containing various answers to the question “What things
do you remember from your schooldays?” and among the answers are the phrases “recipe for disaster,”
“used to squash all 8 of us,” and “a boy with ginger hair.” On the other hand, there may be other cookbook
pages that don’t happen to mention the word “recipe,” even though it’s obvious that what’s written is a
How can a search engine separate the pages you want from those that just happen to use many of the same
words? As well as using an internal algorithm to ignore pages that are deemed to be completely irrelevant to
a user’s query, search engines have an internal algorithm to order, or rank, pages that might be relevant.
The effectiveness of these algorithms is measured using metrics such as precision and recall.
The goals of this fourth unit examining search engines are:
o to explain how search engines order the set of pages to be returned for a user’s query;
o to expose the principles behind evaluating a search engine’s effectiveness in ranking webpages;
o to explain how additional keywords in a search affect the quality of the results;
o to expose two approaches to iterative searching.
More specifically, by the end of the module, you will be able to:
o calculate precision and recall for a returned set of pages, each marked as relevant or not relevant;
o calculate precision at 1, precision at 10, and so forth for a returned sequence of pages, each
marked as relevant or not relevant;
o explain the fundamental ideas behind PageRank;
o contrast relevance feedback and query refinement.
Again, the content presented here is supplemented by material from Web Dragons. In particular,
you must read Chapter 4, pp. 110-143, concentrating on pp. 113-118.
11.0 Ranking Among the billions of pages on the Web, which are the ones that should be returned to you when you pose a
query? If hundreds or thousands of webpages contain most of the words from your query, which ones should
be returned first and which should be further down in the list of results?
11.1 Measuring Retrieval Effectiveness
If you want to find the height of the CN tower, you might try to search the Web for a page that contains that
information. Perhaps you’ll find an article from Wikipedia, Guinness World Records, a Toronto tourism site,
or a local newspaper. You will not be satisfied with an article that gives the height of Shanghai’s Oriental
Pearl Tower and merely mentions that the CN Tower is not as big, without giving its actual height. The first
four articles are said to be relevant to your query, whereas the other one is said to be irrelevant, even
though it includes some matching terms.
Clearly most of the billions of webpages are irrelevant to any query you pose. The job of a search engine is
to identify relevant pages. If there are only a few relevant pages, they can all be shown to you. If many
webpages appear to be relevant to your search, then the engine must choose which ones to present first.
Unfortunately there is no way for the engine to determine whether or not a page is actually relevant for
your needs. You have some purpose in mind, you type in a few words that characterize the sorts of pages
you want to see, you examine the responses from the search engine, and you find that some of the
webpages contain the material you had in mind while others do not. It is ultimately you, the end user, who
decides which pages are relevant and which are not. In fact, the relevance of a page can differ from user to
user when entering the same query, since our backgrounds and information goals may be different.
Relevance is a particularly problematic concept when the user wishes to answer a negative question.
Imagine, for example, that you wanted to verify that no special event was scheduled at UW for May 15. You
might choose to search for Waterloo event “May 15”. If some page lists an event for that date, it is clearly
relevant to your query. However, if it turns out that there is no event scheduled for that day and no page
includes the statement “nothing is scheduled for May 15” or something similar, then which
pages are relevant for you to conclude that there is nothing scheduled for that day? Probably a page that
lists an event for May 14 is not relevant. However, is a page that lists several events in May, none of which
occur on the 15 , relevant or not? Luckily most search requests are intended to be positive queries rather
than verifying that something does not exist. In this course, we will concern ourselves with only positive
In order to make relevance judgement somewhat less subjective, we can imagine a jury of users who discuss
among themselves the intent of a search and which webpages among the billions out there are relevant to the search. In determining how well a search engine performs against a given search request, we assume the
unrealizable ideal in which every webpage is judged by our hypothetical jury to be either relevant or
irrelevant. (This ideal judgement is sometimes called the gold standard.) We then compare the pages
returned by the engine against this ideal, with four possible outcomes as follows:
o A page is relevant, and it is returned by the search engine.
o A page is relevant, but it is not returned by the search engine.
o A page is irrelevant, but it is returned by the search engine.
o A page is irrelevant, and it is not returned by the search engine.
A perfect search engine would return all the pages that are relevant and only pages that are relevant; that
is, every page would fall into either the first or the fourth category. In practice, however, this cannot be
achieved, and some pages fall into the second and third categories. For a given search and a particular
engine, we can count how many pages fall into each category, summarizing the results in a contingency
table as follows:
returned 8930 1121
not 749 46,943,271,401
Sample contingency table
(All numbers are made up for this example.) Now imagine that we could do this calculation for every
possible query and take an average; the resulting contingency table might have similar numbers in each
category, but we will label them A, B, C, and D.
returned A B
not C D
returned Useful search engines will have low values for B and C.
1. Assume that 9 out of a total of 20 pages are relevant for a given query, as follows:
URI Judgement URI Judgement
www/p1 irrelevant www/p11 irrelevant
www/p2 relevant www/p12 irrelevant
www/p3 irrelevant www/p13 relevant
www/p4 irrelevant www/p14 relevant
www/p5 relevant www/p15 irrelevant
www/p6 relevant www/p16 irrelevant
www/p7 irrelevant www/p17 relevant
www/p8 relevant www/p18 irrelevant
www/p9 relevant www/p19 irrelevant
www/p10 relevant www/p20 irrelevant
2. Calculate the contingency table for an engine that returns pages www/p3, www/p4, www/p6,
www/p9, www/p10, www/p12, www/p14, www/p15, www/p18, and www/p19 for that query.
3. Calculate the contingency table for an engine that returns only page www/p9 for that same query.
4. Calculate the contingency table for an engine that returns all 20 pages for that same query. 11.1.2 Precision
One way to tune a search engine is to concentrate on returning relevant pages only, i.e., trying to make the
value B in the contingency table shown above as small as possible. This goal is paramount when we don’t
wish to burden the user with sifting through irrelevant pages. For example, if you need to pay some cost for
looking at the list of returned pages (such as a per-character text messaging charge), if your time to search
through returned pages is severely limited, or if the page list has to be displayed on a small screen or read
out by a text-to-voice converter, you’ll want to have the engine return relevant pages only. In order to
evaluate how well an engine is performing in achieving this goal, we calculate the precision of its response
to queries, as follows:
P = A / ( A + B )
That is, we consider all the pages returned in response to a query (there are A+B such pages) and determine
the fraction of those pages that are relevant to the query. The higher the precision, the fewer irrelevant
pages are returned.
Consider, for example, the sample contingency table shown in the previous section. The precision for that
query is calculated as follows:
P = 8930 / ( 8930 + 1121 )
P = 8930 / 10051
P = .89
Notice that the precision must be a value between 0 and 1 (inclusive), and the closer it is to 1, the more
likely it is that pages shown us by the engine are relevant to our needs.
1. Calculate the precision for each of the contingency tables you produced in the previous exercises.
An alternative way to tune a search engine is to concentrate on returning as many relevant pages as
possible, i.e., trying to make the value C in the contingency table in Section 11.1.1 as small as possible.
This goal is paramount when the user does not want to miss out on seeing any pages that could be relevant.
For example, you might not want to miss any pages that describe the activities of your favourite actor, you
might want to see all the reviews of a restaurant or a recent concert, or you might be trying to get an accurate sense of the reception for a product you just released on the market. In order to evaluate how well
an engine is performing in achieving this goal, we calculate the recall of its response to queries, as follows:
R = A / ( A + C )
That is, we consider all the pages that are relevant to a query (there are A+C such pages) and determine the
fraction of those pages that are returned by the search engine. The higher the recall, the fewer relevant
pages are overlooked.
Consider again the sample contingency table shown in Section 11.1.1. The recall for that query is calculated
R = 8930 / ( 8930 + 749 )
R= 8930 / 9679
Notice that, like precision, the recall must also be a value between 0 and 1 (inclusive), and the closer it is to
1, the more likely it is that a relevant page will not be missed among the results returned by the engine.
1. Calculate the recall for each of the contingency tables you produced in the previous exercises.
Observe from your answers to Practice Exercises 4 and 5 that as more pages are returned to the user, the
recall might improve (when some of the additional pages returned are relevant to the query), but the
precision might deteriorate (when a larger fraction of the additional pages returned are irrelevant to the
11.1.4 Precision at 10
The computation of precision and recall may be valuable for situations in which the total number of relevant
pages is known and it is fairly small. However, this situation is not usually feasible for the Web, where
hundreds of thousands of pages are often relevant to a user’s search request. Instead, in response to a
search, an engine typically returns 10 pages, and if the user wants to see more, the next 10 pages are
returned. In this situation, precision is important, but recall is of little value in measuring a search engine’s
performance, since the user does not wish to see all the relevant webpages.
Therefore, to measure the effectiveness of a search engine for the Web, we usually do not consider the total
set of pages that could be returned, but instead consider the ordered sequence of pages that are returned. How many of the first 10 pages are relevant? If you ask for the next 10 pages, how many of those will be
As a result, the performance metric that is adopted for comparing the effectiveness of Web search engines
is precision at n: what fraction of the first nwebpages returned are relevant? Typically the value used for n is
10, in which case the measure is known as precision at 10.
1. Consider again the table of relevance judgements in Practice Exercise 1. Assume that a search
engine returns 16 pages in the following order:
1: 2: www/p2 3: www/p8 4:
5: www/p9 6: www/p4 7: www/p5 8:
9: www/p3 10: 11: 12:
www/p20 www/p14 www/p13
13: 14: 15: 16:
www/p17 www/p6 www/p19 www/p1
2. Calculate the precision at 10.
3. For that same ordered sequence of pages, calculate the precision at 5 and the precision at 12.
Precision at 10, 20, 30, etc. are valuable when a search engine returns 10 pages at a time to a user’s
requests. However, another special case that is quite useful is to determine how well a search engine is at
identifying a single page that is likely to contain the answer to a user’s question. That is, we might be
particularly interested in whether or not the first (and only the first) page returned has the answer to our
query. To this end, we need to calculateprecision at 1.
11.2 Improving Retrieval Effectiveness 11.2.1 Evidence of Relevance
A search engine does not actually know the relevance of a page with respect to a user’s query. In fact,
because relevance is a subjective measure, there is no way for an engine to calculate relevance. The only
approach that a search engine can take is to guess what the user had in mind when he or she entered a
query and to evaluate each page to guess whether or not that page could be relevant. The engine
then ranks the pages by putting the ones it deems to be more relevant ahead of ones it deems to be less
relevant. The hope is that this will improve measures such as precision at 10 and, by extension, the results
will be more useful to the user.
To rank pages, the search engine tries to find evidence that a page is likely relevant. A major piece of
evidence is that the page includes the terms that appear in the query. If Boolean connectors are used, the
engine checks whether the terms on the page match the existence or non-existence of the terms as dictated
by the placement of AND, OR, and AND NOT in the query. Matching pages can be found using the indexing
and search techniques we studied in the previous two modules, and these matching pages can be ranked
ahead of pages that do not include the appropriate terms.
However, consider again the query recipe squash ginger. Just having the correct terms does not necessarily
imply that a page is relevant. Therefore, more work needs to be done to refine the ranking of pages.
One technique is to measure how far apart the query terms appear from each other on a page. The
presumption here is that pages that include the terms recipe, squash, and ginger in close proximity to each
other are more likely to describe recipes that include squash and ginger as ingredients than are pages where
these terms are widely separated. This technique is particularly useful for queries such as computer science
education where pages that include that exact phrase are likely to be particularly relevant, even if the user
might also be satisfied with pages that do not use those three words in a row or contain instead phrases such
as “education of computer scientists.” Thus we also want to rank pages higher if the search terms appear in
the same order on the page, and especially if they all appear together as a phrase on the page.
A second technique is to rank pages that include the query terms in titles and headings ahead of pages that
include the terms elsewhere in the text. Here the presumption is that terms in titles and headings are likely
to describe what the page is about, and that users will be more satisfied with pages that focus on the topics
requested than those that merely mention them in passing. Following this idea, a page that has a title “My
favourite recipes” and a heading “Squash and ginger soup” is likely a good match for the query recipe
squash ginger and therefore should be ranked higher than other pages having those terms elsewhere.
Similarly, the terms might be especially indicative of a page’s content if they appear on that page in italics,
bold, or all uppercase.
A third technique is to count how often terms appear on a page, under the assumption that the more times
the term appears, the more likely that term reflects the topic covered by the page. For example a page that has 20 instances of the term “recipe” is quite likely to be a page of recipes, and a page that also often
mentions “squash” might be a page that describes several recipes for squash in particular. These pages are
more likely to be relevant than others that mention each term once or twice, and they should therefore be
ranked closer to the top of the list. Remember, however, that this is just a guess being made by the search
engine; it does not know what the pages are really about, nor what you really meant to find when you
entered a query with those three search terms!
The terms that appear on a webpage, and the placement of those terms on the page, are not the only
sources of evidence for relevance. By taking into account the nature of the Web, some other techniques can
also be used.
For example, every page also has a URI, and these often provide hints about the topics described on the
pages. Which page would you likely prefer to see in response to the query Air Canada: the one at
www.aircanada.com or a newspaper article describing the size of Air Canada’s fleet of airplanes?
A piece of evidence that turns out to be especially important is the anchor text for hyperlinks that point to a
page. Suppose you came across a page that included the following text as part of its content:
While browsing the net last week, I stumbled across a great recipe
for squash and ginger soup. You should try it.
Even though this page uses all the search terms from the query recipe squash ginger, and they appear close
together and in the correct order, it may notitself be a relevant page for the query. However, the link’s
target page at www.cookbook.com/r456.html is almost certainly relevant, regardless of how the terms
appear on that page. If the authors of many other webpages also claim that that page is the one to see for
“squash recipes” or “cooking with ginger” or other expressions using the search terms, then the likelihood of
it being relevant to the query increases even more.
In fact, it turns out that how other people describe your page in the anchor texts of their hyperlinks to your
page is often a much better characterization of your page than is the specific collection of terms you have
chosen to use on your page. This realization, more than any other, has improved the effectiveness of search
engines’ ranking of pages in response to searches. It is also one of the major contributors to Google’s ability
to provide a button labeled “I’m Feeling Lucky” that takes you directly to the page that it ranks as the best
match instead of showing a list of the top 10 matches. This feature is possible because examining the
hyperlinks to each page, rather than concentrating on the contents of each page in isolation, tremendously
improves precision at 1.
1. Open the page www.google.ca. Type the query University of Waterloo and click on the button
labeled “I’m Feeling Lucky.” Try this with recipe squash ginger and with several other searches and notice that overall Google has excellent performance when measured in terms of precision at 1. Of
course no search engine can have perfect precision at 1, since it can’t read your mind. (For
example, when you entered recipe squash ginger, Google could not determine whether you meant
to find soup recipes or purée recipes.)
Using the information in hyperlinks is beneficial in ranking pages even if the anchor texts within those links
are ignored. Just the fact that many pages link to a specific page, say www.p102, indicates that lots of
people believe www.p102 to be an important page. Thus, the existence of hyperlinks to a page is evidence
of that page’s prestige. Furthermore, if many of the pages containing those links are themselves prestigious,
that raises the target page’s prestige even more.
If the techniques listed in Section 11.2.1 result in many pages being ranked approximately equal to each
other, we could try to improve our precision at 10 (or at any other cut-off value) by ordering these pages
according to their prestige. With this approach, it is more likely that users will be satisfied with the pages
ranked highly by a search engine. For example, if a page from the Imprint or from MathNews seems to have
the same rank as one from the Globe and Mail when using the techniques from the previous section, more
users are likely to consider the Globe and Mail article as relevant because of its higher prestige.
220.127.116.11 A Simple Model for the Random Surfer
When Sergey Brin and Larry Page initially developed the Google search engine, they created a measure of
prestige called PageRank. The idea is to estimate the importance of each page by evaluating how many
other important pages reference it, while simultaneously taking into account how discriminating those other
pages are in selecting this page and not some others. Computing the PageRank for billions of webpages
involves sophisticated programming techniques, but you can understand the basic ideas without the complex
mathematics underlying the calculations.
Imagine a user who naïvely browses from one page to another, from there to a third page, and so forth,
without end. We will assume that if there are 10 hyperlinks on a page, then this user arbitrarily chooses one
of them to follow, with each of the 10 links equally likely to be chosen. If there are 50 links on a page, then
there is a 1 in 50 chance of choosing a