.pf{position:relative;background-color:#fff;overflow:hidden;margin:0;border:0}.pc{position:absolute;border:0;padding:0;margin:0;top:0;left:0;width:100%;height:100%;overflow:hidden;display:block;transform-origin:0 0;-ms-transform-origin:0 0;-webkit-transform-origin:0 0}.bi{position:absolute;border:0;margin:0}.c{position:absolute;border:0;padding:0;margin:0;overflow:hidden;display:block}.t{position:absolute;white-space:pre;font-size:1px;transform-origin:0 100%;-ms-transform-origin:0 100%;-webkit-transform-origin:0 100%;unicode-bidi:bidi-override;-moz-font-feature-settings:"liga" 0}.t:after{content:''}.t:before{content:'';display:inline-block}.t span{position:relative;unicode-bidi:bidi-override}._{display:inline-block;color:transparent;z-index:-1}.pi{display:none}.d{position:absolute;transform-origin:0 100%;-ms-transform-origin:0 100%;-webkit-transform-origin:0 100%}@media screen{.pf{margin:13px auto;box-shadow:1px 1px 3px 1px #333;border-collapse:separate}}.ff1{font-family:ff1;line-height:.691406;font-style:normal;font-weight:400;visibility:visible}.ff2{font-family:ff2;line-height:.973145;font-style:normal;font-weight:400;visibility:visible}.ff3{font-family:ff3;line-height:.946289;font-style:normal;font-weight:400;visibility:visible}.ff4{font-family:ff4;line-height:.946289;font-style:normal;font-weight:400;visibility:visible}.ff5{font-family:ff5;line-height:.952148;font-style:normal;font-weight:400;visibility:visible}.ff6{font-family:ff6;line-height:.944824;font-style:normal;font-weight:400;visibility:visible}.ff7{font-family:ff7;line-height:1.336;font-style:normal;font-weight:400;visibility:visible}.ff8{font-family:ff8;line-height:.910156;font-style:normal;font-weight:400;visibility:visible}.m0{transform:matrix(.320260,0,0,.320260,0,0);-ms-transform:matrix(.320260,0,0,.320260,0,0);-webkit-transform:matrix(.320260,0,0,.320260,0,0)}.m1{transform:matrix(1.281042,0,0,1.281042,0,0);-ms-transform:matrix(1.281042,0,0,1.281042,0,0);-webkit-transform:matrix(1.281042,0,0,1.281042,0,0)}.v1{vertical-align:18.24px}.ls1{letter-spacing:0}.ls0{letter-spacing:.106080px}.ls2{letter-spacing:.1392px}.sc0{text-shadow:-.015em 0 transparent,0 .015em transparent,.015em 0 transparent,0 -.015em transparent}@media screen and (-webkit-min-device-pixel-ratio:0){.sc0{-webkit-text-stroke:.015em transparent;text-shadow:none}}.ws3{word-spacing:-48px}.ws0{word-spacing:-11.55312px}.ws2{word-spacing:-11.41392px}.ws1{word-spacing:-7.224px}.ws4{word-spacing:0}._0{width:1.061760px}.fc3{color:#30302f}.fc2{color:teal}.fc1{color:#3f3f3f}.fc0{color:#000}.fs3{font-size:24px}.fs2{font-size:37.92px}.fs0{font-size:44.16px}.fs5{font-size:48px}.fs4{font-size:70.080000px}.fs1{font-size:96px}.y0{bottom:0}.y1b{bottom:19.202813px}.y1a{bottom:40.263136px}.y19{bottom:157.299094px}.y18{bottom:247.215401px}.y17{bottom:268.467881px}.y16{bottom:289.528204px}.y15{bottom:310.742252px}.y14{bottom:331.802576px}.y13{bottom:353.016624px}.y12{bottom:374.076947px}.y11{bottom:416.966219px}.y10{bottom:471.53859px}.yf{bottom:492.778259px}.ye{bottom:513.838582px}.yd{bottom:553.653354px}.yc{bottom:600.6932px}.yb{bottom:621.907249px}.ya{bottom:642.967572px}.y9{bottom:679.554119px}.y8{bottom:700.652874px}.y7{bottom:721.866922px}.y6{bottom:742.927245px}.y5{bottom:779.513792px}.y4{bottom:800.574116px}.y3{bottom:821.788164px}.y2{bottom:861.449211px}.y1{bottom:908.975853px}.h3{height:27.958594px}.h6{height:28.328906px}.h1{height:30.122812px}.h4{height:35.935313px}.h8{height:50.64px}.h5{height:53.004844px}.h7{height:65.320309px}.h2{height:72.609375px}.h0{height:1014.58492px}.w3{width:30.424737px}.w2{width:35.190212px}.w1{width:56.869278px}.w4{width:104.858376px}.w5{width:104.871187px}.w6{width:115.58838px}.w0{width:783.997438px}.x0{left:0}.x3{left:4.756507px}.x2{left:87.509230px}.x1{left:92.265738px}.x4{left:144.386195px}.x5{left:179.589217px}.x6{left:210.48794px}.x7{left:316.276352px}.x8{left:422.077575px}

CS100 Lecture Notes - Terra Incognita, Bow Tie, Robots Exclusion Standard

So far, we have assumed that an index for all the pages on the web exists, and we have relied on that index for answering simple and compound searches. In this module we examine how the index is created and maintained. To create an index for the web, we need to visit all existing webpages to gather the words and generate appropriate postings lists. It turns out that the hyperlinks that point from one page to another form a web-like structure that we can crawl along, gathering pages as we go. To create an index for the web, we need to fetch each webpage, one after the other, and collect all the terms used on that page. If someone would hand us webpages one after the other, we could form the postings for each page and merge them into one collection of postings lists covering the whole collection of pages.

Canada

CS100 Lecture Notes - Terra Incognita, Bow Tie, Robots Exclusion Standard

Document Summary

Get access

Related Documents

CS100 Lecture Notes - Lecture 10: Web Search Engine, Web Crawler, Directed Graph

CS100 Lecture Notes - Polysemy, Apple Inc., Inverted Index

CS100 Lecture Notes - Lecture 8: Index Term, Text Segmentation, Stemming