Class Notes (837,691)
Canada (510,396)
CS 100 (117)
Dan Brown (12)

Module 9 notes.docx

22 Pages
Unlock Document

Computer Science
CS 100
Dan Brown

Module 9 notes 9.0 Advanced Querying When you are searching for some information, you express your needs as a query, which could be as simple as writing down one or two words. In the last module we saw that the search engine uses postings lists to store the locations of index terms within pages. The task of the search engine when servicing your request is to match your query against the collection of index terms and related postings lists to find the pages that match your needs. In this module we’ll examine more advanced ways to express those needs in order to produce better search results. 9.1 Boolean Queries As we saw at the end of the last module, searching for individual words often does not provide enough power to express what you’re really seeking. Sometimes webpage authors choose from a variety of words to describe one concept, and other times a single word has a wide variety of meanings, each of which would return too many irrelevant pages for your purposes. The standard solution is to use several words, and many systems allow you to indicate explicitly how the search engine should relate those to the index terms extracted from webpages. 9.1.1 Boolean Operators A postings list represents the set of webpages that includes a particular term. When you are searching for certain webpages, you use those same terms to describe the set of pages that you wish to retrieve. In effect, you are trying to describe your desired set of pages as a combination of the sets known to the search engine. In 1854, George Boole devised a simple system for expressing combinations of sets, and this so- called Boolean algebra is the most common system used today. Twenty-five years later, John Venn developed a pictorial way to interpret Boolean operators, and these have become known as Venn diagrams. For example, we can represent the fact that Ashley, Jean, Lee, and Nicky have all taken CS 100 by drawing a circle for students who have taken CS 100and including those four students inside the circle: Similarly, we can represent the fact that Alex, Jean, Jo, Nicky, and Pat are in the Faculty of Arts: We can also show both these facts simultaneously as follows: Notice that Nicky and Jean are inside both circles, since they are in both CS 100 and the Faculty of Arts, but Ashley and Lee are in the CS 100 circle only and Alex, Jo, and Pat are in the Arts circle only. A similar diagram could be used to represent this information about a set of 7 different students, or about 35 students or 35,000 students, except that we would not have the space to write each student’s name. Therefore when using Venn diagrams, we omit the names of individuals, and understand the meanings of the regions of the circles as if there were names present: the leftmost region includes all students taking CS 100 and not in the Faculty of Arts, the middle region includes all students in CS 100 and the Faculty of Arts, and the rightmost region represents students in the Faculty of Arts that are not in CS 100. Furthermore, instead of writing the names of the members of the sets inside the circles, the names for the sets themselves are often written there instead, as in the following illustration: Boolean AND Let’s assume that we want to find webpages that mention both Ontario and Quebec. The search term Ontario identifies one set of pages, and the search term Quebec identifies another set of pages. The pages that we want are those that appear in both sets, which can be expressed by the query Ontario AND Quebec. Similarly webpages about Canadian universities can be requested with the expression Canada AND university. (We will follow the practice used by Google and other search engines to use uppercase for writing Boolean operators.) The operator AND specifies that the webpages of interest are those that appear on both lists. This can be expressed pictorially using a Venn diagram as follows: Ontario AND Quebec The circle on the left represents all the webpages that include the term Ontario, and the one on the right represents all those that include the termQuebec. The red segment, which falls inside both circles, represents all the webpages that include both terms. We can use the AND operator to connect more than two terms. For example, to find webpages describing games that use money and dice you could pose the query game AND money AND dice. This query can also be represented by a Venn diagram: game AND money game AND money AND dice In both these figures, the upper left circle represents all webpages that mention game, the upper right circle represents all pages that mention money, and the lower circle represents all pages that mention dice. The red segment in the diagram on the left therefore represents all webpages that include both the terms game and money (compare it to the previous diagram, ignoring the third circle). This diagram can be produced by shading in everything inside the circle for game and then erasing the shading that is outside the circle for money. The diagram on the right represents the complete query, formed by also erasing the shading that falls outside the circle for dice. In the result, the red segment is the part that falls inside all three circles and therefore represents webpages that include all three terms. Boolean OR If we wish to find all webpages about people in the theatre, we might notice that most pages use the term actor, but some pages use the word actressinstead. Now, rather than finding all webpages that mention both terms, we need to find all those that mention either one of the terms. In Boolean algebra we specify this query as actor OR actress. In response, if a webpage includes the term actor or that page includes the word actress (or it includes both words), it should be included in our results. Again we can represent the query using a Venn diagram: actor OR actress When we use the Boolean OR, we want to find all pages that fall inside either circle. Notice that if a webpage happens to use both terms actor and actress, it will also be included in the results. This property is sometimes stated explicitly as being an inclusive or, as opposed to an exclusive or (XOR), which specifies one side or the other but not both. Practice Exercises 1. Draw the Venn diagram for animal OR vegetable OR mineral. Boolean queries may contain both AND and OR operators, but it is important to apply these in the correct order. In the same way that it is important to know whether to multiply or add first when working out an arithmetic formula (recall BEDMAS from high school math), you need to be aware of theprecedence of operators when working out a Boolean formula. In the absence of parenthesis, AND is always applied before OR, but we can (and sometimes must) use parentheses to signal which operators to apply first. 1. Draw a Venn diagram for three terms, and label the circles sunny, warm, and pleasant. We will develop a Venn diagram that represents the query (sunny AND warm) OR pleasant, where the parentheses are used to indicate that the AND is to be applied to the first two terms before the OR is applied to that result and the third term. o In pencil, lightly shade in the first circle to represent all pages that match the one-word query sunny. o Next, to represent the pages that match sunny AND warm, we need to represent the operator AND. Erase all shading outside the circle labeled warm; that is, we will leave whatever was shaded before that also falls inside the circle for warm. (The result should look like the Venn diagram for game AND money above.) o Now we need to process the next operator, namely the OR. To do this we need to also include everything inside the circle for the term pleasant. Therefore, shade in the rest of the circle representing pleasant. (sunny AND warm) OR pleasant 1. Following the same form of steps, draw the Venn diagram for sunny AND (warm OR pleasant): this time start with the part inside parentheses by shading in the circle for warm, then shading in the rest of the circle for pleasant, and finally erasing everything that is not inside the circle for sunny. Note that the final diagram is not the same as the one for Exercise 2, which shows that the two queries (which differ only in where the parentheses are placed) are not identical in meaning. What are the characteristics of the pages that would be included in the answer to the query for Exercise 2 that would not be included in the answer to the query here? Are there any pages that would be included in the answer to this query but not included in the answer to the query for Exercise 2? 2. Using the step-by-step technique described in the previous question, draw the Venn diagram for (rainy OR stormy) AND cold. 3. Draw a Venn diagram to represent webpages that mention actor or actress but do not mention both terms, that is, actor XOR actress, where XOR represents exclusive or. 4. Draw Venn diagrams to represent the following two queries: (coins OR stamps) AND collect (coins AND collect) OR (stamps AND collect) What can you conclude about these two Boolean queries? Be careful when using Boolean AND; common English usage does not always agree with the Boolean interpretation. For example, if I am going to Europe, I might ask you to lend me books on France and Italy. However, I am really asking for books on either country: a Boolean request for these books would be France OR Italy. When posing a Boolean query, think about the Venn diagrams to be sure you are using the correct operators. Boolean AND NOT To explicitly eliminate some sets of webpages from your results, you can specify that a term does not appear on the page. For example, if you wish to see pages about planets other than the Earth, you could use the query planet AND NOT Earth, which corresponds to the Venn diagram shown below on the right: planet and Earth planet and not Earth The pages that match this query are all those that include the term planet except for those that also include the term Earth. Notice the relationship between the query planet AND NOT Earth and the query planet AND Earth: all pages that include the term planet match either one of the queries, but not both. In more formal approaches, a separate operator NOT is defined instead of specifying the operator AND NOT. However, in practice NOT is almost always used in the context AND NOT, which is the only form we will use in this course. Placing a - in front of a query term is interpreted by many search engines as the operator AND NOT; thus France –Italy is requesting pages that mention France and do not mention Italy. Practice Exercises 1. Draw the Venn diagram for (sunny OR warm) AND NOT pleasant. 2. Draw the Venn diagram for sunny OR (warm AND NOT pleasant). Note that this answer is different from the previous one — the placement of the parentheses again makes a difference! 3. Compare the queries (jaguar AND NOT car) AND NOT cat to the query jaguar AND NOT (car AND NOT cat) by drawing Venn diagrams for both. Are these queries equivalent to each other (that is, are the sets of pages that match them necessarily the same)? If not, what pages would be included in the one but not the other? Which interpretation does Google use for the query jaguar –car –cat? 4. Express what is represented by the following Venn diagram as a Boolean query: Start by identifying one of the terms that is partly included in the answer, say fruit. Next determine what needs to be added or removed: we need to add some of grain. Therefore, so far we have fruit OR grain. Draw the Venn diagram for fruit OR grain to see whether we are done yet. Continue to determine whether there are more parts that need to be added or removed: we need to remove the parts that are inside the circle for food. So that gives us (fruit OR grain) AND NOT food. Note that the parts can be added or removed in many different orders: for example, (fruit AND NOT food) OR (grain AND NOT food) is another query that gives the same result. 5. Express what is represented by the following Venn diagram as a Boolean query: (There are several correct answers; any one is acceptable.) 6. There is a kids’ game known as rock-paper-scissors. What query would you issue to find pages that discuss any of those objects? What query would you issue to find pages that discuss all of the objects (on a single page)? How about the last two objects but not the first one? 7. Queries are not restricted to having at most three terms; they can get as large as you wish. (Venn diagrams for more than three terms, however, are too complex to consider.) What query would you issue to find pages that discuss any of the five Great Lakes (Huron, Ontario, Michigan, Erie, Superior)? 8. What query would you issue to find pages that describe banks but do not mention either finances or rivers? A Note On Precedence When evaluating a query with mutilple operators, it's important to understand which one has precedence, i.e., which one goes first. This was alluded to briefly in the discussion of Boolean OR, and compared to the use of BEDMAS for evaluating arithmetic expressions. It turns out that the order of operation is immaterial when there are two AND operators in a row (such as movies AND actors AND Frenchman) or two OR operators in a row (such as plays OR actors OR Greek): for these two cases, identical results are obtained whichever operator is evaluated first. However, if the operators are mixed (ANDs with ORs) or the operator AND NOT is used, then it makes a difference which is evaluated first. Practice Exercises 1. Draw the VENN diagrams for the two queries  o (plants AND NOT flowers) AND NOT European, and o plants AND NOT (flowers AND NOT Eurpoean). Note that they are not identical. This means that the results of evaluating these queries will not always be the same. Therefore, when given the query plants AND NOT flowers AND NOT European, we must use a convention to determine which AND NOT operator to evaluate first. To avoid any confusion, in this course we will always include parentheses to show which operator to evaluate first unless the order is immaterial: always evaluate the operator inside the parentheses first. 9.1.2 Merging Postings Lists In this section we’ll examine how a search engine processes a Boolean query. As you saw in the last module, each term corresponds to a postings list. Here we describe how the Boolean operators can be applied to those lists to produce the correct results. Note that when dealing with Boolean operators, the counts per page and offsets in postings lists are ignored. A real search engine might use this extra information to help determine the importance of the results, i.e. they would influence the ranking algorithm, which we’ll explore more in Module 11. Boolean AND Let’s look again at the query Ontario AND Quebec. If we assume for now that each posting has only the unique identifier for the page (its pageid, such as its URI), the first few entries on the postings list for the term Ontario might include the following entries: Pg3 Pg6 Pg13 Pg16 Pg19 Pg29 Pg31 Pg32 Similarly, the start of the postings list for the term Quebec might include the following entries: Pg5 Pg6 Pg11 Pg12 Pg16 Pg27 Pg32 Pg33 Pg37 For simplicity, let us assume that the postings shown are the only postings on those lists (that is, the first list includes eight pages and the second list includes nine). Let us also assume that the postings are sorted in increasing order by pageid. Which pages match the query? Think of the Venn diagram — we want the pages that are ―inside both circles‖ — meaning we want to find the pages whose pageids appear on both postings lists. For our example, this means the pages with pageids Pg6, Pg16, and Pg32. The operation we need to carry out to produce the result of Term1AND Term i2 to intersect the postings lists for Ter1 and Term2. Here is one method you can use to can find their intersection: Intersection algorithm For each element in the first list, look up whether it also exists in the second list. 1. If it does, then include the element in the result. 2. Otherwise, omit that element from the result. A precise set of instructions to perform a particular task, such as the steps above to find the intersection of two lists, is called an algorithm. We could write an algorithm for finding our way to the nearest bookstore (usually called directions), knitting a baby blanket (usually called knitting instructions), or preparing a dish of curried lamb (usually called a recipe). Let’s apply the intersection algorithm to the example above. We will indicate the algorithm’s location in each of the postings lists by showing the corresponding pageid in red. We start by considering the first element from the first list: Pg3 Pg6 Pg13 Pg16 Pg19 Pg29 Pg31 Pg32 Pg5 Pg6 Pg11 Pg12 Pg16 Pg27 Pg32 Pg33 Pg37 Since Pg3 cannot be found in the second list, we omit it from the result and move on to the next element from that list: Pg3 Pg6 Pg13 Pg16 Pg19 Pg29 Pg31 Pg32 Pg5 Pg6 Pg11 Pg12 Pg16 Pg27 Pg32 Pg33 Pg37 This element is found on the second list, so the first pageid in the intersection is Pg6, which represents the first page that includes both of the index terms Ontario and Quebec. We now go on to the next element: Pg3 Pg6 Pg13 Pg16 Pg19 Pg29 Pg31 Pg32 Pg5 Pg6 Pg11 Pg12 Pg16 Pg27 Pg32 Pg33 Pg37 Again it is not found on the second list, and so Pg13 is omitted from the result, and we proceed to the next element: Pg3 Pg6 Pg13 Pg16 Pg19 Pg29 Pg31 Pg32 Pg5 Pg6 Pg11 Pg12 Pg16 Pg27 Pg32 Pg33 Pg37 Because Pg16 is also on the second list, it becomes the second pageid in the intersection. We continue in this way, finding that Pg32 is the third and last pageid in the intersection. The result of intersecting two lists can be easily made into a third list; for the above example, the resulting list can be represented as: Pg6 Pg16 Pg32 This algorithm can be illustrated using a spreadsheet: If the values of the two input postings lists are given in columns B and C, as shown above, the intersection can be computed in column E by looking up each element from column B in the table of values given in column C. Practice Exercises 1. What formula would you put in cell E3 and fill down to calculate the intersection of the two postings lists? (You will need to use either the function ISNA() or the function ISERROR() to test whether or not VLOOKUP() fails to find a match.) 2. Find the intersection of the following two lists: Pg7 Pg9 Pg15 Pg17 Pg21 Pg3 Pg4 Pg9 Pg10 Pg17 Pg18 Notice that the number of elements in an intersection can be no more than the number of elements in the smaller of the two input postings lists. The result of intersecting two lists is a list that looks just like any other postings list, except it doesn’t have counts per page and offset values. (Unfortunately there is no simple way to have the spreadsheet automatically squeeze the empty cells out of the result list, so the illustration above shows the result list in column I rewritten by hand.) Therefore, it is easy to compute the results of Ontario AND Quebec AND Alberta: merely take the resulting list from Ontario AND Quebec and intersect it with the postings list for Alberta. If the list for Alberta contains the following entries: Pg5 Pg7 Pg8 Pg12 Pg16 Pg21 Pg22 Pg30 Pg32 then the result of intersecting it with the intersection of the lists for Ontario AND Quebec (as above) would be Pg16 Pg32 which represents a postings list for those pages that include all three index terms. In describing each algorithm in this module, we present one method to match the given type of query, where the method is chosen, in part, because it can be illustrated using spreadsheets. In practice, commercial search engines use a variety of approaches to shorten the time to perform matches, and so the algorithms
More Less

Related notes for CS 100

Log In


Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.