false

Class Notes
(837,691)

Canada
(510,396)

University of Waterloo
(18,601)

Computer Science
(798)

CS 100
(117)

Dan Brown
(12)

Lecture

Unlock Document

Computer Science

CS 100

Dan Brown

Winter

Description

Module 9 notes
9.0 Advanced Querying
When you are searching for some information, you express your needs as a query, which could be as simple
as writing down one or two words. In the last module we saw that the search engine uses postings lists to
store the locations of index terms within pages. The task of the search engine when servicing your request is
to match your query against the collection of index terms and related postings lists to find the pages that
match your needs. In this module we’ll examine more advanced ways to express those needs in order to
produce better search results.
9.1 Boolean Queries
As we saw at the end of the last module, searching for individual words often does not provide enough power
to express what you’re really seeking. Sometimes webpage authors choose from a variety of words to
describe one concept, and other times a single word has a wide variety of meanings, each of which would
return too many irrelevant pages for your purposes. The standard solution is to use several words, and many
systems allow you to indicate explicitly how the search engine should relate those to the index terms
extracted from webpages.
9.1.1 Boolean Operators
A postings list represents the set of webpages that includes a particular term. When you are searching for
certain webpages, you use those same terms to describe the set of pages that you wish to retrieve. In effect,
you are trying to describe your desired set of pages as a combination of the sets known to the search engine.
In 1854, George Boole devised a simple system for expressing combinations of sets, and this so-
called Boolean algebra is the most common system used today. Twenty-five years later, John
Venn developed a pictorial way to interpret Boolean operators, and these have become known as Venn
diagrams. For example, we can represent the fact that Ashley, Jean, Lee, and Nicky have all taken CS 100 by
drawing a circle for students who have taken CS 100and including those four students inside the circle: Similarly, we can represent the fact that Alex, Jean, Jo, Nicky, and Pat are in the Faculty of Arts:
We can also show both these facts simultaneously as follows:
Notice that Nicky and Jean are inside both circles, since they are in both CS 100 and the Faculty of Arts, but
Ashley and Lee are in the CS 100 circle only and Alex, Jo, and Pat are in the Arts circle only.
A similar diagram could be used to represent this information about a set of 7 different students, or about 35
students or 35,000 students, except that we would not have the space to write each student’s name.
Therefore when using Venn diagrams, we omit the names of individuals, and understand the meanings of the
regions of the circles as if there were names present: the leftmost region includes all students taking CS 100
and not in the Faculty of Arts, the middle region includes all students in CS 100 and the Faculty of Arts, and
the rightmost region represents students in the Faculty of Arts that are not in CS 100. Furthermore, instead
of writing the names of the members of the sets inside the circles, the names for the sets themselves are
often written there instead, as in the following illustration: Boolean AND
Let’s assume that we want to find webpages that mention both Ontario and Quebec. The search
term Ontario identifies one set of pages, and the search term Quebec identifies another set of pages. The
pages that we want are those that appear in both sets, which can be expressed by the query Ontario AND
Quebec. Similarly webpages about Canadian universities can be requested with the expression Canada AND
university. (We will follow the practice used by Google and other search engines to use uppercase for writing
Boolean operators.)
The operator AND specifies that the webpages of interest are those that appear on both lists. This can be
expressed pictorially using a Venn diagram as follows:
Ontario AND Quebec
The circle on the left represents all the webpages that include the term Ontario, and the one on the right
represents all those that include the termQuebec. The red segment, which falls inside both circles,
represents all the webpages that include both terms.
We can use the AND operator to connect more than two terms. For example, to find webpages describing
games that use money and dice you could pose the query game AND money AND dice. This query can also be
represented by a Venn diagram: game AND money game AND money AND dice
In both these figures, the upper left circle represents all webpages that mention game, the upper right
circle represents all pages that mention money, and the lower circle represents all pages that mention dice.
The red segment in the diagram on the left therefore represents all webpages that include both the
terms game and money (compare it to the previous diagram, ignoring the third circle). This diagram can be
produced by shading in everything inside the circle for game and then erasing the shading that is outside the
circle for money. The diagram on the right represents the complete query, formed by also erasing the
shading that falls outside the circle for dice. In the result, the red segment is the part that falls inside all
three circles and therefore represents webpages that include all three terms.
Boolean OR
If we wish to find all webpages about people in the theatre, we might notice that most pages use the
term actor, but some pages use the word actressinstead. Now, rather than finding all webpages that
mention both terms, we need to find all those that mention either one of the terms. In Boolean algebra we
specify this query as actor OR actress. In response, if a webpage includes the term actor or that page
includes the word actress (or it includes both words), it should be included in our results.
Again we can represent the query using a Venn diagram:
actor OR actress When we use the Boolean OR, we want to find all pages that fall inside either circle. Notice that if a
webpage happens to use both terms actor and actress, it will also be included in the results. This property is
sometimes stated explicitly as being an inclusive or, as opposed to an exclusive or (XOR), which specifies
one side or the other but not both.
Practice Exercises
1. Draw the Venn diagram for animal OR vegetable OR mineral.
Boolean queries may contain both AND and OR operators, but it is important to apply these in the correct
order. In the same way that it is important to know whether to multiply or add first when working out an
arithmetic formula (recall BEDMAS from high school math), you need to be aware of theprecedence of
operators when working out a Boolean formula. In the absence of parenthesis, AND is always applied before
OR, but we can (and sometimes must) use parentheses to signal which operators to apply first.
1. Draw a Venn diagram for three terms, and label the circles sunny, warm, and pleasant. We will
develop a Venn diagram that represents the query (sunny AND warm) OR pleasant, where the
parentheses are used to indicate that the AND is to be applied to the first two terms before the OR
is applied to that result and the third term.
o In pencil, lightly shade in the first circle to represent all pages that match the one-word query
sunny.
o Next, to represent the pages that match sunny AND warm, we need to represent the operator AND.
Erase all shading outside the circle labeled warm; that is, we will leave whatever was shaded before
that also falls inside the circle for warm. (The result should look like the Venn diagram for game
AND money above.)
o Now we need to process the next operator, namely the OR. To do this we need to also include
everything inside the circle for the term pleasant. Therefore, shade in the rest of the circle
representing pleasant.
(sunny AND warm) OR pleasant 1. Following the same form of steps, draw the Venn diagram for sunny AND (warm OR pleasant): this
time start with the part inside parentheses by shading in the circle for warm, then shading in the
rest of the circle for pleasant, and finally erasing everything that is not inside the circle for sunny.
Note that the final diagram is not the same as the one for Exercise 2, which shows that the two
queries (which differ only in where the parentheses are placed) are not identical in meaning. What
are the characteristics of the pages that would be included in the answer to the query for Exercise 2
that would not be included in the answer to the query here? Are there any pages that would be
included in the answer to this query but not included in the answer to the query for Exercise 2?
2. Using the step-by-step technique described in the previous question, draw the Venn diagram for
(rainy OR stormy) AND cold.
3. Draw a Venn diagram to represent webpages that mention actor or actress but do not mention both
terms, that is, actor XOR actress, where XOR represents exclusive or.
4. Draw Venn diagrams to represent the following two queries:
(coins OR stamps) AND collect
(coins AND collect) OR (stamps AND collect)
What can you conclude about these two Boolean queries?
Be careful when using Boolean AND; common English usage does not always agree with the Boolean
interpretation. For example, if I am going to Europe, I might ask you to lend me books on France and Italy.
However, I am really asking for books on either country: a Boolean request for these books would be France
OR Italy. When posing a Boolean query, think about the Venn diagrams to be sure you are using the correct
operators.
Boolean AND NOT
To explicitly eliminate some sets of webpages from your results, you can specify that a term does not appear
on the page. For example, if you wish to see pages about planets other than the Earth, you could use the
query planet AND NOT Earth, which corresponds to the Venn diagram shown below on the right:
planet and Earth
planet and not Earth The pages that match this query are all those that include the term planet except for those that also include
the term Earth. Notice the relationship between the query planet AND NOT Earth and the query planet AND
Earth: all pages that include the term planet match either one of the queries, but not both.
In more formal approaches, a separate operator NOT is defined instead of specifying the operator AND NOT.
However, in practice NOT is almost always used in the context AND NOT, which is the only form we will use
in this course. Placing a - in front of a query term is interpreted by many search engines as the operator
AND NOT; thus France –Italy is requesting pages that mention France and do not mention Italy.
Practice Exercises
1. Draw the Venn diagram for (sunny OR warm) AND NOT pleasant.
2. Draw the Venn diagram for sunny OR (warm AND NOT pleasant). Note that this answer is different
from the previous one — the placement of the parentheses again makes a difference!
3. Compare the queries (jaguar AND NOT car) AND NOT cat to the query jaguar AND NOT (car AND NOT
cat) by drawing Venn diagrams for both. Are these queries equivalent to each other (that is, are the
sets of pages that match them necessarily the same)? If not, what pages would be included in the
one but not the other? Which interpretation does Google use for the query jaguar –car –cat?
4. Express what is represented by the following Venn diagram as a Boolean query:
Start by identifying one of the terms that is partly included in the answer, say fruit. Next determine
what needs to be added or removed: we need to add some of grain. Therefore, so far we have fruit
OR grain. Draw the Venn diagram for fruit OR grain to see whether we are done yet. Continue to
determine whether there are more parts that need to be added or removed: we need to remove the
parts that are inside the circle for food. So that gives us (fruit OR grain) AND NOT food. Note that
the parts can be added or removed in many different orders: for example, (fruit AND NOT food) OR
(grain AND NOT food) is another query that gives the same result.
5. Express what is represented by the following Venn diagram as a Boolean query: (There are several correct answers; any one is acceptable.)
6. There is a kids’ game known as rock-paper-scissors. What query would you issue to find pages that
discuss any of those objects? What query would you issue to find pages that discuss all of the objects
(on a single page)? How about the last two objects but not the first one?
7. Queries are not restricted to having at most three terms; they can get as large as you wish. (Venn
diagrams for more than three terms, however, are too complex to consider.) What query would you
issue to find pages that discuss any of the five Great Lakes (Huron, Ontario, Michigan, Erie,
Superior)?
8. What query would you issue to find pages that describe banks but do not mention either finances or
rivers?
A Note On Precedence
When evaluating a query with mutilple operators, it's important to understand which one has precedence,
i.e., which one goes first. This was alluded to briefly in the discussion of Boolean OR, and compared to the
use of BEDMAS for evaluating arithmetic expressions. It turns out that the order of operation is immaterial
when there are two AND operators in a row (such as movies AND actors AND Frenchman) or two OR
operators in a row (such as plays OR actors OR Greek): for these two cases, identical results are obtained
whichever operator is evaluated first. However, if the operators are mixed (ANDs with ORs) or the operator
AND NOT is used, then it makes a difference which is evaluated first.
Practice Exercises
1. Draw the VENN diagrams for the two queries
o (plants AND NOT flowers) AND NOT European, and
o plants AND NOT (flowers AND NOT Eurpoean). Note that they are not identical. This means that the results of evaluating these queries will not
always be the same. Therefore, when given the query plants AND NOT flowers AND NOT
European, we must use a convention to determine which AND NOT operator to evaluate first.
To avoid any confusion, in this course we will always include parentheses to show which operator to evaluate
first unless the order is immaterial: always evaluate the operator inside the parentheses first.
9.1.2 Merging Postings Lists
In this section we’ll examine how a search engine processes a Boolean query. As you saw in the last module,
each term corresponds to a postings list. Here we describe how the Boolean operators can be applied to
those lists to produce the correct results.
Note that when dealing with Boolean operators, the counts per page and offsets in postings lists are
ignored. A real search engine might use this extra information to help determine the importance of the
results, i.e. they would influence the ranking algorithm, which we’ll explore more in Module 11.
Boolean AND
Let’s look again at the query Ontario AND Quebec. If we assume for now that each posting has only the
unique identifier for the page (its pageid, such as its URI), the first few entries on the postings list for the
term Ontario might include the following entries:
Pg3 Pg6 Pg13 Pg16 Pg19 Pg29 Pg31 Pg32
Similarly, the start of the postings list for the term Quebec might include the following entries:
Pg5 Pg6 Pg11 Pg12 Pg16 Pg27 Pg32 Pg33 Pg37
For simplicity, let us assume that the postings shown are the only postings on those lists (that is, the first list
includes eight pages and the second list includes nine). Let us also assume that the postings are sorted in
increasing order by pageid. Which pages match the query? Think of the Venn diagram — we want the pages
that are ―inside both circles‖ — meaning we want to find the pages whose pageids appear on both postings
lists. For our example, this means the pages with pageids Pg6, Pg16, and Pg32.
The operation we need to carry out to produce the result of Term1AND Term i2 to intersect the postings
lists for Ter1 and Term2. Here is one method you can use to can find their intersection:
Intersection algorithm For each element in the first list, look up whether it also exists in the second list.
1. If it does, then include the element in the result.
2. Otherwise, omit that element from the result.
A precise set of instructions to perform a particular task, such as the steps above to find the intersection of
two lists, is called an algorithm. We could write an algorithm for finding our way to the nearest bookstore
(usually called directions), knitting a baby blanket (usually called knitting instructions), or preparing a dish
of curried lamb (usually called a recipe).
Let’s apply the intersection algorithm to the example above. We will indicate the algorithm’s location in
each of the postings lists by showing the corresponding pageid in red. We start by considering the first
element from the first list:
Pg3 Pg6 Pg13 Pg16 Pg19 Pg29 Pg31 Pg32
Pg5 Pg6 Pg11 Pg12 Pg16 Pg27 Pg32 Pg33 Pg37
Since Pg3 cannot be found in the second list, we omit it from the result and move on to the next element
from that list:
Pg3 Pg6 Pg13 Pg16 Pg19 Pg29 Pg31 Pg32
Pg5 Pg6 Pg11 Pg12 Pg16 Pg27 Pg32 Pg33 Pg37
This element is found on the second list, so the first pageid in the intersection is Pg6, which represents the
first page that includes both of the index terms Ontario and Quebec. We now go on to the next element:
Pg3 Pg6 Pg13 Pg16 Pg19 Pg29 Pg31 Pg32
Pg5 Pg6 Pg11 Pg12 Pg16 Pg27 Pg32 Pg33 Pg37
Again it is not found on the second list, and so Pg13 is omitted from the result, and we proceed to the next
element: Pg3 Pg6 Pg13 Pg16 Pg19 Pg29 Pg31 Pg32
Pg5 Pg6 Pg11 Pg12 Pg16 Pg27 Pg32 Pg33 Pg37
Because Pg16 is also on the second list, it becomes the second pageid in the intersection. We continue in
this way, finding that Pg32 is the third and last pageid in the intersection.
The result of intersecting two lists can be easily made into a third list; for the above example, the resulting
list can be represented as:
Pg6 Pg16 Pg32
This algorithm can be illustrated using a spreadsheet:
If the values of the two input postings lists are given in columns B and C, as shown above, the intersection
can be computed in column E by looking up each element from column B in the table of values given in
column C.
Practice Exercises
1. What formula would you put in cell E3 and fill down to calculate the intersection of the two postings
lists? (You will need to use either the function ISNA() or the function ISERROR() to test whether or
not VLOOKUP() fails to find a match.)
2. Find the intersection of the following two lists:
Pg7 Pg9 Pg15 Pg17 Pg21 Pg3 Pg4 Pg9 Pg10 Pg17 Pg18
Notice that the number of elements in an intersection can be no more than the number of elements in the
smaller of the two input postings lists.
The result of intersecting two lists is a list that looks just like any other postings list, except it doesn’t have
counts per page and offset values. (Unfortunately there is no simple way to have the spreadsheet
automatically squeeze the empty cells out of the result list, so the illustration above shows the result list in
column I rewritten by hand.) Therefore, it is easy to compute the results of Ontario AND Quebec AND
Alberta: merely take the resulting list from Ontario AND Quebec and intersect it with the postings list for
Alberta. If the list for Alberta contains the following entries:
Pg5 Pg7 Pg8 Pg12 Pg16 Pg21 Pg22 Pg30 Pg32
then the result of intersecting it with the intersection of the lists for Ontario AND Quebec (as above) would
be
Pg16 Pg32
which represents a postings list for those pages that include all three index terms.
In describing each algorithm in this module, we present one method to match the given type of query, where
the method is chosen, in part, because it can be illustrated using spreadsheets. In practice, commercial
search engines use a variety of approaches to shorten the time to perform matches, and so the algorithms

More
Less
Related notes for CS 100

Join OneClass

Access over 10 million pages of study

documents for 1.3 million courses.

Sign up

Join to view

Continue

Continue
OR

By registering, I agree to the
Terms
and
Privacy Policies

Already have an account?
Log in

Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.