Class Notes (810,430)
Canada (494,121)
CS 100 (114)
Dan Brown (12)

Module 11 notes.docx

20 Pages
Unlock Document

University of Waterloo
Computer Science
CS 100
Dan Brown

Module 11 notes Module 11 Introduction When searching for a good recipe that uses squash and ginger, there are many pages on the Web that might satisfy you. You might try the query recipe squash ginger to identify those pages, and many pages that include those words could be of interest. However, there are also many pages that use the words but in a different context. For example, there’s a webpage containing various answers to the question “What things do you remember from your schooldays?” and among the answers are the phrases “recipe for disaster,” “used to squash all 8 of us,” and “a boy with ginger hair.” On the other hand, there may be other cookbook pages that don’t happen to mention the word “recipe,” even though it’s obvious that what’s written is a recipe. How can a search engine separate the pages you want from those that just happen to use many of the same words? As well as using an internal algorithm to ignore pages that are deemed to be completely irrelevant to a user’s query, search engines have an internal algorithm to order, or rank, pages that might be relevant. The effectiveness of these algorithms is measured using metrics such as precision and recall. The goals of this fourth unit examining search engines are: o to explain how search engines order the set of pages to be returned for a user’s query; o to expose the principles behind evaluating a search engine’s effectiveness in ranking webpages; o to explain how additional keywords in a search affect the quality of the results; o to expose two approaches to iterative searching. More specifically, by the end of the module, you will be able to: o calculate precision and recall for a returned set of pages, each marked as relevant or not relevant; o calculate precision at 1, precision at 10, and so forth for a returned sequence of pages, each marked as relevant or not relevant; o explain the fundamental ideas behind PageRank; o contrast relevance feedback and query refinement. Again, the content presented here is supplemented by material from Web Dragons. In particular, you must read Chapter 4, pp. 110-143, concentrating on pp. 113-118. 11.0 Ranking Among the billions of pages on the Web, which are the ones that should be returned to you when you pose a query? If hundreds or thousands of webpages contain most of the words from your query, which ones should be returned first and which should be further down in the list of results? 11.1 Measuring Retrieval Effectiveness 11.1.1 Relevance If you want to find the height of the CN tower, you might try to search the Web for a page that contains that information. Perhaps you’ll find an article from Wikipedia, Guinness World Records, a Toronto tourism site, or a local newspaper. You will not be satisfied with an article that gives the height of Shanghai’s Oriental Pearl Tower and merely mentions that the CN Tower is not as big, without giving its actual height. The first four articles are said to be relevant to your query, whereas the other one is said to be irrelevant, even though it includes some matching terms. Clearly most of the billions of webpages are irrelevant to any query you pose. The job of a search engine is to identify relevant pages. If there are only a few relevant pages, they can all be shown to you. If many webpages appear to be relevant to your search, then the engine must choose which ones to present first. Unfortunately there is no way for the engine to determine whether or not a page is actually relevant for your needs. You have some purpose in mind, you type in a few words that characterize the sorts of pages you want to see, you examine the responses from the search engine, and you find that some of the webpages contain the material you had in mind while others do not. It is ultimately you, the end user, who decides which pages are relevant and which are not. In fact, the relevance of a page can differ from user to user when entering the same query, since our backgrounds and information goals may be different. Relevance is a particularly problematic concept when the user wishes to answer a negative question. Imagine, for example, that you wanted to verify that no special event was scheduled at UW for May 15. You might choose to search for Waterloo event “May 15”. If some page lists an event for that date, it is clearly relevant to your query. However, if it turns out that there is no event scheduled for that day and no page includes the statement “nothing is scheduled for May 15” or something similar, then which pages are relevant for you to conclude that there is nothing scheduled for that day? Probably a page that lists an event for May 14 is not relevant. However, is a page that lists several events in May, none of which th occur on the 15 , relevant or not? Luckily most search requests are intended to be positive queries rather than verifying that something does not exist. In this course, we will concern ourselves with only positive queries. In order to make relevance judgement somewhat less subjective, we can imagine a jury of users who discuss among themselves the intent of a search and which webpages among the billions out there are relevant to the search. In determining how well a search engine performs against a given search request, we assume the unrealizable ideal in which every webpage is judged by our hypothetical jury to be either relevant or irrelevant. (This ideal judgement is sometimes called the gold standard.) We then compare the pages returned by the engine against this ideal, with four possible outcomes as follows: o A page is relevant, and it is returned by the search engine. o A page is relevant, but it is not returned by the search engine. o A page is irrelevant, but it is returned by the search engine. o A page is irrelevant, and it is not returned by the search engine. A perfect search engine would return all the pages that are relevant and only pages that are relevant; that is, every page would fall into either the first or the fourth category. In practice, however, this cannot be achieved, and some pages fall into the second and third categories. For a given search and a particular engine, we can count how many pages fall into each category, summarizing the results in a contingency table as follows: relevant irrelevant returned 8930 1121 not 749 46,943,271,401 returned Sample contingency table (All numbers are made up for this example.) Now imagine that we could do this calculation for every possible query and take an average; the resulting contingency table might have similar numbers in each category, but we will label them A, B, C, and D. relevant irrelevant returned A B not C D returned Useful search engines will have low values for B and C. Practice Exercises 1. Assume that 9 out of a total of 20 pages are relevant for a given query, as follows: URI Judgement URI Judgement www/p1 irrelevant www/p11 irrelevant www/p2 relevant www/p12 irrelevant www/p3 irrelevant www/p13 relevant www/p4 irrelevant www/p14 relevant www/p5 relevant www/p15 irrelevant www/p6 relevant www/p16 irrelevant www/p7 irrelevant www/p17 relevant www/p8 relevant www/p18 irrelevant www/p9 relevant www/p19 irrelevant www/p10 relevant www/p20 irrelevant 2. Calculate the contingency table for an engine that returns pages www/p3, www/p4, www/p6, www/p9, www/p10, www/p12, www/p14, www/p15, www/p18, and www/p19 for that query. 3. Calculate the contingency table for an engine that returns only page www/p9 for that same query. 4. Calculate the contingency table for an engine that returns all 20 pages for that same query. 11.1.2 Precision One way to tune a search engine is to concentrate on returning relevant pages only, i.e., trying to make the value B in the contingency table shown above as small as possible. This goal is paramount when we don’t wish to burden the user with sifting through irrelevant pages. For example, if you need to pay some cost for looking at the list of returned pages (such as a per-character text messaging charge), if your time to search through returned pages is severely limited, or if the page list has to be displayed on a small screen or read out by a text-to-voice converter, you’ll want to have the engine return relevant pages only. In order to evaluate how well an engine is performing in achieving this goal, we calculate the precision of its response to queries, as follows: P = A / ( A + B ) That is, we consider all the pages returned in response to a query (there are A+B such pages) and determine the fraction of those pages that are relevant to the query. The higher the precision, the fewer irrelevant pages are returned. Consider, for example, the sample contingency table shown in the previous section. The precision for that query is calculated as follows: P = 8930 / ( 8930 + 1121 ) P = 8930 / 10051 P = .89 Notice that the precision must be a value between 0 and 1 (inclusive), and the closer it is to 1, the more likely it is that pages shown us by the engine are relevant to our needs. Practice Exercises 1. Calculate the precision for each of the contingency tables you produced in the previous exercises. 11.1.3 Recall An alternative way to tune a search engine is to concentrate on returning as many relevant pages as possible, i.e., trying to make the value C in the contingency table in Section 11.1.1 as small as possible. This goal is paramount when the user does not want to miss out on seeing any pages that could be relevant. For example, you might not want to miss any pages that describe the activities of your favourite actor, you might want to see all the reviews of a restaurant or a recent concert, or you might be trying to get an accurate sense of the reception for a product you just released on the market. In order to evaluate how well an engine is performing in achieving this goal, we calculate the recall of its response to queries, as follows: R = A / ( A + C ) That is, we consider all the pages that are relevant to a query (there are A+C such pages) and determine the fraction of those pages that are returned by the search engine. The higher the recall, the fewer relevant pages are overlooked. Consider again the sample contingency table shown in Section 11.1.1. The recall for that query is calculated as follows: R = 8930 / ( 8930 + 749 ) R= 8930 / 9679 R= .92 Notice that, like precision, the recall must also be a value between 0 and 1 (inclusive), and the closer it is to 1, the more likely it is that a relevant page will not be missed among the results returned by the engine. Practice Exercises 1. Calculate the recall for each of the contingency tables you produced in the previous exercises. Observe from your answers to Practice Exercises 4 and 5 that as more pages are returned to the user, the recall might improve (when some of the additional pages returned are relevant to the query), but the precision might deteriorate (when a larger fraction of the additional pages returned are irrelevant to the query). 11.1.4 Precision at 10 The computation of precision and recall may be valuable for situations in which the total number of relevant pages is known and it is fairly small. However, this situation is not usually feasible for the Web, where hundreds of thousands of pages are often relevant to a user’s search request. Instead, in response to a search, an engine typically returns 10 pages, and if the user wants to see more, the next 10 pages are returned. In this situation, precision is important, but recall is of little value in measuring a search engine’s performance, since the user does not wish to see all the relevant webpages. Therefore, to measure the effectiveness of a search engine for the Web, we usually do not consider the total set of pages that could be returned, but instead consider the ordered sequence of pages that are returned. How many of the first 10 pages are relevant? If you ask for the next 10 pages, how many of those will be relevant? As a result, the performance metric that is adopted for comparing the effectiveness of Web search engines is precision at n: what fraction of the first nwebpages returned are relevant? Typically the value used for n is 10, in which case the measure is known as precision at 10. Practice Exercises 1. Consider again the table of relevance judgements in Practice Exercise 1. Assume that a search engine returns 16 pages in the following order: 1: 2: www/p2 3: www/p8 4: www/p12 www/p16 5: www/p9 6: www/p4 7: www/p5 8: www/p10 9: www/p3 10: 11: 12: www/p20 www/p14 www/p13 13: 14: 15: 16: www/p17 www/p6 www/p19 www/p1 2. Calculate the precision at 10. 3. For that same ordered sequence of pages, calculate the precision at 5 and the precision at 12. Precision at 10, 20, 30, etc. are valuable when a search engine returns 10 pages at a time to a user’s requests. However, another special case that is quite useful is to determine how well a search engine is at identifying a single page that is likely to contain the answer to a user’s question. That is, we might be particularly interested in whether or not the first (and only the first) page returned has the answer to our query. To this end, we need to calculateprecision at 1. 11.2 Improving Retrieval Effectiveness 11.2.1 Evidence of Relevance A search engine does not actually know the relevance of a page with respect to a user’s query. In fact, because relevance is a subjective measure, there is no way for an engine to calculate relevance. The only approach that a search engine can take is to guess what the user had in mind when he or she entered a query and to evaluate each page to guess whether or not that page could be relevant. The engine then ranks the pages by putting the ones it deems to be more relevant ahead of ones it deems to be less relevant. The hope is that this will improve measures such as precision at 10 and, by extension, the results will be more useful to the user. To rank pages, the search engine tries to find evidence that a page is likely relevant. A major piece of evidence is that the page includes the terms that appear in the query. If Boolean connectors are used, the engine checks whether the terms on the page match the existence or non-existence of the terms as dictated by the placement of AND, OR, and AND NOT in the query. Matching pages can be found using the indexing and search techniques we studied in the previous two modules, and these matching pages can be ranked ahead of pages that do not include the appropriate terms. However, consider again the query recipe squash ginger. Just having the correct terms does not necessarily imply that a page is relevant. Therefore, more work needs to be done to refine the ranking of pages. One technique is to measure how far apart the query terms appear from each other on a page. The presumption here is that pages that include the terms recipe, squash, and ginger in close proximity to each other are more likely to describe recipes that include squash and ginger as ingredients than are pages where these terms are widely separated. This technique is particularly useful for queries such as computer science education where pages that include that exact phrase are likely to be particularly relevant, even if the user might also be satisfied with pages that do not use those three words in a row or contain instead phrases such as “education of computer scientists.” Thus we also want to rank pages higher if the search terms appear in the same order on the page, and especially if they all appear together as a phrase on the page. A second technique is to rank pages that include the query terms in titles and headings ahead of pages that include the terms elsewhere in the text. Here the presumption is that terms in titles and headings are likely to describe what the page is about, and that users will be more satisfied with pages that focus on the topics requested than those that merely mention them in passing. Following this idea, a page that has a title “My favourite recipes” and a heading “Squash and ginger soup” is likely a good match for the query recipe squash ginger and therefore should be ranked higher than other pages having those terms elsewhere. Similarly, the terms might be especially indicative of a page’s content if they appear on that page in italics, bold, or all uppercase. A third technique is to count how often terms appear on a page, under the assumption that the more times the term appears, the more likely that term reflects the topic covered by the page. For example a page that has 20 instances of the term “recipe” is quite likely to be a page of recipes, and a page that also often mentions “squash” might be a page that describes several recipes for squash in particular. These pages are more likely to be relevant than others that mention each term once or twice, and they should therefore be ranked closer to the top of the list. Remember, however, that this is just a guess being made by the search engine; it does not know what the pages are really about, nor what you really meant to find when you entered a query with those three search terms! The terms that appear on a webpage, and the placement of those terms on the page, are not the only sources of evidence for relevance. By taking into account the nature of the Web, some other techniques can also be used. For example, every page also has a URI, and these often provide hints about the topics described on the pages. Which page would you likely prefer to see in response to the query Air Canada: the one at or a newspaper article describing the size of Air Canada’s fleet of airplanes? A piece of evidence that turns out to be especially important is the anchor text for hyperlinks that point to a page. Suppose you came across a page that included the following text as part of its content: While browsing the net last week, I stumbled across a great recipe for squash and ginger soup. You should try it. Even though this page uses all the search terms from the query recipe squash ginger, and they appear close together and in the correct order, it may notitself be a relevant page for the query. However, the link’s target page at is almost certainly relevant, regardless of how the terms appear on that page. If the authors of many other webpages also claim that that page is the one to see for “squash recipes” or “cooking with ginger” or other expressions using the search terms, then the likelihood of it being relevant to the query increases even more. In fact, it turns out that how other people describe your page in the anchor texts of their hyperlinks to your page is often a much better characterization of your page than is the specific collection of terms you have chosen to use on your page. This realization, more than any other, has improved the effectiveness of search engines’ ranking of pages in response to searches. It is also one of the major contributors to Google’s ability to provide a button labeled “I’m Feeling Lucky” that takes you directly to the page that it ranks as the best match instead of showing a list of the top 10 matches. This feature is possible because examining the hyperlinks to each page, rather than concentrating on the contents of each page in isolation, tremendously improves precision at 1. Practice Exercises 1. Open the page Type the query University of Waterloo and click on the button labeled “I’m Feeling Lucky.” Try this with recipe squash ginger and with several other searches and notice that overall Google has excellent performance when measured in terms of precision at 1. Of course no search engine can have perfect precision at 1, since it can’t read your mind. (For example, when you entered recipe squash ginger, Google could not determine whether you meant to find soup recipes or purée recipes.) 11.2.2 PageRank Using the information in hyperlinks is beneficial in ranking pages even if the anchor texts within those links are ignored. Just the fact that many pages link to a specific page, say www.p102, indicates that lots of people believe www.p102 to be an important page. Thus, the existence of hyperlinks to a page is evidence of that page’s prestige. Furthermore, if many of the pages containing those links are themselves prestigious, that raises the target page’s prestige even more. If the techniques listed in Section 11.2.1 result in many pages being ranked approximately equal to each other, we could try to improve our precision at 10 (or at any other cut-off value) by ordering these pages according to their prestige. With this approach, it is more likely that users will be satisfied with the pages ranked highly by a search engine. For example, if a page from the Imprint or from MathNews seems to have the same rank as one from the Globe and Mail when using the techniques from the previous section, more users are likely to consider the Globe and Mail article as relevant because of its higher prestige. A Simple Model for the Random Surfer When Sergey Brin and Larry Page initially developed the Google search engine, they created a measure of prestige called PageRank. The idea is to estimate the importance of each page by evaluating how many other important pages reference it, while simultaneously taking into account how discriminating those other pages are in selecting this page and not some others. Computing the PageRank for billions of webpages involves sophisticated programming techniques, but you can understand the basic ideas without the complex mathematics underlying the calculations. Imagine a user who naïvely browses from one page to another, from there to a third page, and so forth, without end. We will assume that if there are 10 hyperlinks on a page, then this user arbitrarily chooses one of them to follow, with each of the 10 links equally likely to be chosen. If there are 50 links on a page, then there is a 1 in 50 chance of choosing a
More Less

Related notes for CS 100

Log In


Don't have an account?

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.