CS100 Lecture Notes - Terra Incognita, Bow Tie, Robots Exclusion Standard
Document Summary
So far, we have assumed that an index for all the pages on the web exists, and we have relied on that index for answering simple and compound searches. In this module we examine how the index is created and maintained. To create an index for the web, we need to visit all existing webpages to gather the words and generate appropriate postings lists. It turns out that the hyperlinks that point from one page to another form a web-like structure that we can crawl along, gathering pages as we go. To create an index for the web, we need to fetch each webpage, one after the other, and collect all the terms used on that page. If someone would hand us webpages one after the other, we could form the postings for each page and merge them into one collection of postings lists covering the whole collection of pages.