CS 410 Lecture Notes - Lecture 4: Vertical Search, Labeled Data, Citeseerx

35 views6 pages

Document Summary

Lecture 4. 1 web search introduction and web crawler. How to serve many user queries quickly: low quality information and spams. Spam detection and robust ranking: dynamics of the web. New pages are constantly created and some pages may be updated very quickly. Opportunities: many additional heuristics (e. g. , links) can be leveraged to improve search accuracy, link analysis and multi-feature ranking. Techniques that leverage extra information to improve the search engine. Component i: crawler/spider/robot: building a toy crawler is easy. Start with a set of seed pages in a priority queue. Parse fetched pages for hyperlinks; add them to the queue. Follow the hyperlinks in the queue: a real crawler is much more complicated . Crawling courtesy (server load balance, robot exclusion, etc. ) Handling file types (images, pdf files, etc. ) Url extensions (cgi script, internal references, etc. ) Discover hidden urls (e. g. truncating a long url) Major crawling strategies: breadth-first is common (balance server load, parallel crawling is natural, variation: focused crawling.

Get access

Grade+20% off
$8 USD/m$10 USD/m
Billed $96 USD annually
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
40 Verified Answers
Class+
$8 USD/m
Billed $96 USD annually
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
30 Verified Answers