Computer Science 1033A/B Lecture Notes - Search Engine Optimization, Pagerank, Web Crawler
34 views6 pages
9 Feb 2013
School
Department
Course
Professor

COMP SCI – WEEK 7
Warm up questions
•how many clicks should it take a user to get to any page from the home page?
◦3
•what is the maximum time a web page with graphics should take to download for a user with a
56K modem?
◦25-30 seconds
•which of the following is a tag?
◦<title>
◦(title)
◦[title]
•which of the following is a bad tag?
◦<b>
◦<u>
◦<hr>
◦<ul>
HTML tables and Dreamweaver
•tables are made up of rows and columns
•the point where a row meets a column is called a cell
•tables expressed in terms of percentage %
◦% of the browser, not the entire screen
•tables expressed in term of pixels
◦resolution affects the way a page is displayed
◦in general, most people do NOT have their resolution 800 by 600. Thus, if we make our
page 800 pixels wide EVERYONE should be able to view it. If we make our page 1200
pixels wide, some people (the ones whose resolution is still 800 by 600) will have to scroll
horizontally—a big no!
Hints for using tables to organize your overall appearance
•set borders to 0
•use merge + split to help figure out areas
•know there is a difference between cell padding and spacing
Publishing your website
•move your website from the machine you built the site on to the web server
•Q: what program do we use use to perform the move?
◦Ftm, win, filezilla, fugu
•NOTE: not all web servers allow certain FTP software to connect to them because of security
reasons
Marketing you website
•include the web address for your new site:
◦as part of your signature on outgoing email
◦on all printed material—letterhead, business cards, labels, catalogues, posters, media
advertisements
•try to make your website be in the “first ten listed” when your customers search using search
engines for you site
Finding information in the WWW
•two basic types of searches you can perform on the WW
◦directories
◦search engines
Directories
•not automated
•real people decide on how to organize it
◦pick categories and hierarchies
•drill down into a category, then into another subcategory, then into another subcategory, and so
on, till you get to a website
•Webmaster submits his/her site to the directory and then humans decide whether or not it is
“worthy” to be in the directory and if so, which categories to put it in
•Q: which site is best known for having a directory type of search?
◦Yahoo
◦yahoo charges $299 for you to submit your site (doesn't even guarantee you will get in)
▪this charge just speeds up the process of having an editor to judge it, it does NOT
guarantee the editor will include it in the directory
•don't usually get as much traffic as with a search engine, HOWEVER it is quality traffic as
usually the person who is looking and drilling down on the categories, is looking to find the
products or services that you offer, otherwise they would be looking in a different category
◦real people are looking over your overall site from a human point of view
Search engines
•Questions
◦what is the most popular search engine?
▪Google
◦what new search engine is getting a good share of the search market?
▪BING
◦what % of the market does the most popular search engine have?
▪65%
•how does a search engine work?
◦2 parts
▪part 1: finding all the data on the web and building a database. Similar to librarians
getting new books, cataloguing them and putting them in the library
▪part 2: given keywords from a searcher (person looking for a topic), returning the
“BEST/MOST APPROPRIATE” pages. Similar to a person walking into a library, going
to the card catalogue and looking for a book
Part 1: building the search database
•web crawlers or web spiders crawl the internet constantly, going from web page to web page
via links, looking at all the words on the page, building an index (database)
•index contains list of alphabetical list of words it finds, where within the page the word was and
the links (URL to the page) where if found the words
•words are called keywords
•index is stored in a really big database
•the title is the most important place to find a word
Part 2: how does the search engine decide which pages to return to the search?
•uses index to decide which pages have the given keywords
•every engine uses slightly different algorithms to decide the order of displaying the returned
pages
•Google uses the “Page Rank” algorithm is ONE of the factors to decide what order to present
the pages it found for you
◦the higher the ranking, the closer it is to the top of the list
What is Google's Page Rank algorithm?
•Algorithm gives each web page returned from the keyword search a weight between 0 and 1
•the higher the weight given to the page, the more likely it is that this page will be displayed first
to you
How Page Rank works
•first, assume we only have 4 pages: page A, B, C, D on the internet to simplify this
•each page is given a weight of 0.25 (1 divided by 4)
•scenario 1:
◦pages B, C, and D all have a link to page A (thus page A must be very useful because
everyone is pointing at it)
◦then pages B, C, and D are each given their 0.25 rank to A, so A get a ranking 0.75
•scenario 2:
◦page B links to A and C (0.25 divided between 2 pages)
◦page C just links to A (all of 0.25 goes to A)
◦page D links to A, B, and C (0.25 divide between 3 pages)
◦the weight of A is now:
▪0.25/2 (Bs ranking) + 0.25 (Cs ranking) + 0.25/3 (Ds ranking
▪0.125 + 0.25 + 0.083 = 0.458
•THUS, pages with lots of links pointing at them, must be important so they got the highest
weight/ranking
•summary:
◦Page Rank evaluates 2 key factors
▪how many links are there to a web page
▪what is the quality of the linking sites (although a high ranked page with lots of links on
it may pass you less because it is spread too thinly)
◦Page Rank does not take into account the content of the page (thus frequent content updates
don't improve Page Rank necessarily)
◦Page Rank ranks web pages NOT web sites
◦each inbound link is important in the overall total except for banned sites, they don't count
◦each Page Rank level is progressively harder to reach, it is thought to be calculated on a
logarithmic scale