Computer Science 1033A/B Lecture Notes - Search Engine Optimization, Pagerank, Web Crawler

34 views6 pages
Page:
of 6
COMP SCI – WEEK 7
Warm up questions
how many clicks should it take a user to get to any page from the home page?
3
what is the maximum time a web page with graphics should take to download for a user with a
56K modem?
25-30 seconds
which of the following is a tag?
<title>
(title)
[title]
which of the following is a bad tag?
<b>
<u>
<hr>
<ul>
HTML tables and Dreamweaver
tables are made up of rows and columns
the point where a row meets a column is called a cell
tables expressed in terms of percentage %
% of the browser, not the entire screen
tables expressed in term of pixels
resolution affects the way a page is displayed
in general, most people do NOT have their resolution 800 by 600. Thus, if we make our
page 800 pixels wide EVERYONE should be able to view it. If we make our page 1200
pixels wide, some people (the ones whose resolution is still 800 by 600) will have to scroll
horizontally—a big no!
Hints for using tables to organize your overall appearance
set borders to 0
use merge + split to help figure out areas
know there is a difference between cell padding and spacing
Publishing your website
move your website from the machine you built the site on to the web server
Q: what program do we use use to perform the move?
Ftm, win, filezilla, fugu
NOTE: not all web servers allow certain FTP software to connect to them because of security
reasons
Marketing you website
include the web address for your new site:
as part of your signature on outgoing email
on all printed material—letterhead, business cards, labels, catalogues, posters, media
advertisements
try to make your website be in the “first ten listed” when your customers search using search
engines for you site
Finding information in the WWW
two basic types of searches you can perform on the WW
directories
search engines
Directories
not automated
real people decide on how to organize it
pick categories and hierarchies
drill down into a category, then into another subcategory, then into another subcategory, and so
on, till you get to a website
Webmaster submits his/her site to the directory and then humans decide whether or not it is
“worthy” to be in the directory and if so, which categories to put it in
Q: which site is best known for having a directory type of search?
Yahoo
yahoo charges $299 for you to submit your site (doesn't even guarantee you will get in)
this charge just speeds up the process of having an editor to judge it, it does NOT
guarantee the editor will include it in the directory
don't usually get as much traffic as with a search engine, HOWEVER it is quality traffic as
usually the person who is looking and drilling down on the categories, is looking to find the
products or services that you offer, otherwise they would be looking in a different category
real people are looking over your overall site from a human point of view
Search engines
Questions
what is the most popular search engine?
Google
what new search engine is getting a good share of the search market?
BING
what % of the market does the most popular search engine have?
65%
how does a search engine work?
2 parts
part 1: finding all the data on the web and building a database. Similar to librarians
getting new books, cataloguing them and putting them in the library
part 2: given keywords from a searcher (person looking for a topic), returning the
“BEST/MOST APPROPRIATE” pages. Similar to a person walking into a library, going
to the card catalogue and looking for a book
Part 1: building the search database
web crawlers or web spiders crawl the internet constantly, going from web page to web page
via links, looking at all the words on the page, building an index (database)
index contains list of alphabetical list of words it finds, where within the page the word was and
the links (URL to the page) where if found the words
words are called keywords
index is stored in a really big database
the title is the most important place to find a word
Part 2: how does the search engine decide which pages to return to the search?
uses index to decide which pages have the given keywords
every engine uses slightly different algorithms to decide the order of displaying the returned
pages
Google uses the “Page Rank” algorithm is ONE of the factors to decide what order to present
the pages it found for you
the higher the ranking, the closer it is to the top of the list
What is Google's Page Rank algorithm?
Algorithm gives each web page returned from the keyword search a weight between 0 and 1
the higher the weight given to the page, the more likely it is that this page will be displayed first
to you
How Page Rank works
first, assume we only have 4 pages: page A, B, C, D on the internet to simplify this
each page is given a weight of 0.25 (1 divided by 4)
scenario 1:
pages B, C, and D all have a link to page A (thus page A must be very useful because
everyone is pointing at it)
then pages B, C, and D are each given their 0.25 rank to A, so A get a ranking 0.75
scenario 2:
page B links to A and C (0.25 divided between 2 pages)
page C just links to A (all of 0.25 goes to A)
page D links to A, B, and C (0.25 divide between 3 pages)
the weight of A is now:
0.25/2 (Bs ranking) + 0.25 (Cs ranking) + 0.25/3 (Ds ranking
0.125 + 0.25 + 0.083 = 0.458
THUS, pages with lots of links pointing at them, must be important so they got the highest
weight/ranking
summary:
Page Rank evaluates 2 key factors
how many links are there to a web page
what is the quality of the linking sites (although a high ranked page with lots of links on
it may pass you less because it is spread too thinly)
Page Rank does not take into account the content of the page (thus frequent content updates
don't improve Page Rank necessarily)
Page Rank ranks web pages NOT web sites
each inbound link is important in the overall total except for banned sites, they don't count
each Page Rank level is progressively harder to reach, it is thought to be calculated on a
logarithmic scale