Class Notes (836,173)
Canada (509,677)
Biology (2,256)
BIOL 365 (1)

BIOL 365 Summary (1-16).pdf

43 Pages
Unlock Document

BIOL 365
Brendan Mc Conkey

Lecture  1  Introduction  to  Bioinformatics     What  is  Bioinformatics?   • “Bioinformatics”  was  initially  coined  in  the  1980’s  for  analysis  of  biological  sequence   data   o It  is  the  combination  of  biology  and  computer  science   o Primary  challenge:    to  analyze  the  wealth  of  sequence  data  in  order  to  understand  this   information  in  terms  of  protein  structure,  function  and  evolution   o Continuing  to  expand-­‐  now  could  include  structure  databases,  interaction  maps,  and   the  ‘-­‐omics’  family  (genomics,  proteomics,  metabolomics,  etc.)   o Compilation  of  molecular  data  in  databases  with  global  access   • Important  components  of  Bioinformatics   o Analysis  and  interpretation   o Algorithm  development   o Tool  development   o Information  management     Definitions  of  Bioinformatics:   • National  Institutes  of  Health  (U.S.)  definition:   o “  Research,  development,  or  application  of  computational  tools  and  approaches  for   expanding  the  use  of  biological,  medical,  behavioral,  or  health  data,  including  those  to   acquire,  store,  organize,  analyze,  or  visualize  such  data”   • Computational  Biology  (according  to  N.I.H.)  is:   o “  The  development  and  application  of  data-­‐analytical  and  theoretical  methods,   mathematical  modeling  and  computational  simulation  techniques  to  the  study  of   biological,  behavioral,  and  social  systems”     Evolution  of  bioinformatics:   • Protein  sequencing   o 1955:    first  complete  protein  sequence  (insulin,  Ryle  et  al.  1955.    Bioch.  J.  60:    541-­‐546)   o 1965:    ~20  proteins  sequenced   o 1980:    ~1500  full  sequences  of  proteins   o The  NCBI  nr  database  as  of  May  2009:  8,738,517  proteins   • Nucleotide  sequencing   o Structure  of  DNA  (Watson  &  Crick,  1953)   o 1960’s  and  1970’s  –  only  small  RNAs  were  sequenced   o Development  of  Cloning  and  then  later  PCR  greatly  increased     sequencing  of  DNA   o Simpler  techniques  in  sequencing  and  automation  greatly  increased  ability  to  collect   nucleotide  data   • Development  of  these  techniques  laid  foundation  for  the  “sequence  revolution”  of  the  1980’s   and  1990’s   • Development  of  more  efficient  computer  technology   • The  combination  of  the  above  resulted  in  the  birth  of  the  field  of  Bioinformatics     Bioinformatics  links  (H1N1)   • Data  collection,  storage,  management:   o Databases   o NCBI,  others   • Data  analysis  1  (genome/nucleotide)   o Sequence  alignments   o Gene  (protein)  finding   o Phylogenetics   • Data  analysis  2  (proteins,  expression)   o Sequence/structure  alignments   o Genome  comparisons   o More  phylogenetics   • Applications  in  broader  context   o In  conjunction  with  biochemistry,  immunology,  epidemiology,  etc   • Bioinformatics  tools  to  apply:   o Nucleotide  sequence  alignments   o Protein  sequence  alignments   o Multiple  sequence  alignments   o Clustering  (phylogeny)   o Models  of  evolution     Database  resources   • Deposition  of  newly  sequenced  genomes  or  nucleotide  data   • Retrieval  and  analysis  of  existing  genome,  gene,  and  protein  data   • Major  databases:  NCBI,  EBI,  DDBJ,  Expasy,  PDB,  etc   • Influenza  specific  databases:  NCBI  Influenza  Virus  Resource,  National  Institute  of  Allergy  and   Infectious  diseases     Phylogeny  of  2009  H1N1   • A  major  branch  of  bioinformatics  is  Clustering  and  Phylogenetics   • Clustering   o Grouping  by  similarity   o Can  assume  changes  are  additive   o Can  assume  changes  are  not  additive   • Phylogenetics     o Clustering  nucleotide  or  protein  sequences  by  similarity   o More  detailed  than  simple  clustering   o Changes  are  additive,     o Typically  follows  an  evolutionary  model   o Can  use  different  models  and  assumptions   • Output  of  both  is  a  grouping  a  data  points,  often  as  a  tree     Other  bioinformatics  tools  can  contribute  to  other  analyses:   • Geographic  distributions,  mechanisms  of  virulence,  links  to  immune  response,  etc.       Lecture  2  Databases  and  Exploring  NCBI     Databases   • Starting  point  for  most  bioinformatics  research   • Storage,  sharing  and  describing  of  data  as  well  as  preliminary  analysis  through  DB-­‐associated   tools   • Simple  to  very  complicated,  local  to  international   • Databases  are  very  large  and  continue  to  grow   • Managing  much  of  this  information  is  the  responsibility  of  large  consortiums  (NCBI,  EBI,   DDBJ)   • This  is  now  a  huge  industry  in  science  (and  a  large  employer!)   • Databases  have  many  different  structures;  we  will  deal  with:   o Flat  file  databases   § Data  (e.g.,  sequences)  are  stored  as  a  text  file  or  a  collection  of  text  files   • Flat,  as  in  a  sheet  of  paper   § Easy  to  input,  distribute,  search  and  retrieve  data   o Relational  databases   § Most  common  type  of  biological  database   § Data  stored  within  a  number  of  tables  linked  together  by  a  shared  field,  the  key   • A  key  must  be  unique  to  each  record   § Handles  huge  amounts  of  data   • Reducing  data  in  memory   • Faster  search  and  retrieval     Primary  vs.  secondary  (derived)  data   • Primary  data  =  data  derived  from  experimental  results   o Sequence  data  (genomic,  ESTs,  etc.)   o Gene  expression  data  (microarray  data,  ESTs)   o Structural  data  (X-­‐ray,  NMR)   • Derived  (secondary)  data  =  results  utilizing  primary  data   o Conserved  sequence  motifs   o Conserved  protein/domain  families   • Functional  annotations   o Signal  peptides   o Binding/catalytic  sites   o Many  of  these  annotations  are  automated     Data  quality  and  information  content   • Redundant  vs.  non-­‐redundant   o Efficiency  (many  of  the  algorithms  require  processing  of  a  large  amount  of  data,  the   less  duplication  the  faster  the  process)   o Databases  are  screened  to  reduce  redundancy   • Databases  are  under  automatic  and  manual  quality  control   o e.g.,  UniProtKB  consists  of:   § SWISS-­‐Prot  (highly  annotated  manual  component)   § TrEMBL  (automatically  annotated)   Reliability  of  Databases   • Record  may  not  always  be  what  it  pretends  to  be   • Computer  error   o Incorrect  annotations   o Missed  relationships  (insufficient  information  extraction)   • Human  error   o Contributions  of  many  different  people  all  with  different  standards  of  accuracy   o Vector  sequence  left  in  sequence   o PCR  chimeras   o Taxonomic  misidentification   o Trivial  data  entries…Example:    Accession  A00674:    cactaa,  patented  six  nucleotides,   unknown  source/organism,  by  random  chance  this  sequence  is  probably  found  in   numerous  sequences,  Entry  now  deleted   • Be  critical,  be  careful     Types  of  sequence  data   • A  given  target  gene/protein  typically  has  multiple  records:     DNA,  RNA  (as  cDNA),  and  protein,  and  can  be  from     numerous  organisms  and/or  tissues   • For  many  searches,  it  helps  to  clearly  define  what  you  are  looking  for   • Genomic  and  gene  data  (DNA)   o The  target  sequence  can  be   o Part  of  a  chromosome   o Part  of  a  large  DNA  fragment  (BAC,  YAC,  or  cosmid)   o A  gene,  including  introns,  exons,  and  untranslated  regions   o A  Sequence  Tagged  Site  (STS).  STSs  are  small  fragments  of  DNA,  often  ~500  bp,  with   known  location  within  a  genome.  Used  to  link  genetic  and  physical  maps.   • cDNAs  and  ESTs   o The  target  gene/protein  can  be   o A  complete  cDNA  copy  of  an  mRNA  transcript   o An  EST  -­‐  Expressed  Sequence  Tag  -­‐  ESTs  are  partial  copies  of  RNA  transcripts,  often   300-­‐800  bp   o Both  mRNAs  and  ESTs  can  be  tissue  specific,  identifying  a  gene/protein  that  is   expressed  at  a  certain  time  within  a  certain  tissue   • Protein   o A  sequenced  protein  (rare)   o Sequence  translated  from  DNA-­‐>mRNA  (common)   o Sequence  corresponding  to  a  known  protein  structure  (PDB  data)   • RefSeq  -­‐  Best  Representative  Reference  Sequence   o Goal:  to  unify  sequence  records  and  provide  the  most  reliable  sequence  data   o One  RefSeq  per  gene  per  organism   o Somewhat  redundant:  ~37,900  human  RefSeq  records.   • UniGene  -­‐  Unique  Gene  Project  (RNA  transcripts)     o Goal:  to  create  one  entry  for  a  given  gene  and  collect  all  ESTs  associated  with  that   gene.     o Organism  specific   o Created  by  clustering  similar  ESTs   o Can  be  used  to  determine  expression  patterns   o ~124,000  human  Unigene  entries     Features  within  Flatfile   • Perform  a  biological  function   • Affect  or  are  the  result  of  the  expression  of  a  biological  function     • Interact  with  other  molecules   • Affect  replication  of  a  sequence   • Affect  or  are  the  result  of  recombination  of  different  sequences   • Are  a  recognizable  repeated  unit   • Have  secondary  or  tertiary  structure   • Exhibit  variation   • Have  been  revised  or  corrected         Lecture  3  Database  and  Additional  Information  Available  Through  NCBI     PubMed  Information  Retrieval   • PubMed  provides  a  gateway  to  biomedical  literature     o i.e.  great  for  medicine,  genetics,  biochemistry,  etc.   o Less  useful  for  computer  algorithms,  astrophysics,  ...   • PubMed  includes  some  out-­‐of-­‐scope  articles,  primarily  from  multidisciplinary  journals   (Science,  Nature,  PNAS,  ...)   • Often  provides  direct  links  to  journal  articles   o Some  are  directly  accessible,  if  not,  try  going  through  the  UW  library  online  journal   lists   • Can  limit  search  by  Publication  type  (e.g.  review)   • Or  by  publication  date   • Or  by  medical  subject  area  (MeSH),  etc.     Controlled  Vocabularies  and  Ontologies   • CV  -­‐  defines  data  types  (amino  acid,  mutation,  etc)   • Ontologies  define  relations:   (red=‘is  a’,  green=‘is  attribute’,  blue=  ‘part  of’)       MeSH  –  Medical  subject  heading   • Controlled  vocabulary  indexing  of  PubMed  articles     • Useful  for  refining  searches  and  using  correct  search  terms   • MeSH  is  curated  and  continually  updated   • Provides  general  headings  and  subheadings     o Subheadings  provide  additional  search  terms   o Can  be  used  to  define  a  targeted  search   • Displays  a  summary  definition  of  the  term   • Displays  a  tree  of  related  items  in  the  subject  headings     Taxbrowser  –  taxonomy  search   • Use  to  search  for  lineage  and  scientific  names  of  organisms   • Links  to   o Taxonomy  resources   o Genetic  codes  (alternate  codons  by  species  and  organelle)   o Sequence  data  for  extinct  organisms   o Recent  classification  changes   • Provides  full  lineage,  Kingdom,  Phylum,  …,  sub-­‐species     • Individual  records  provide  links  to  Entrez  records   o Nucleotide  sequences   o Protein  sequences   o PopSet  (population  study  data  sets)   o etc.       OMIM  –  Online  Mendelian  Inheritance  in  Man   • Recently  separated  from  NCBI,  now  at   • Catalog  of  human  genes  and  genetic  disorders   • Links  to  PubMed,  Entrez,  sequence  data,  ...   • Over  20,000  entries  so  far   • Not  complete  genome  coverage  (lots  to  go  yet)   • Can  search  records,  also  gene  maps     Automation,  curation  and  quality   • Redundancy,  non-­‐redundant  databases   • Automated  data  checking   • Initial  analysis  is  often  automated   • Manually  curated  databases  are  highest  quality   • All  databases  have  some  level  of  error   • Error  sources  can  be  experimental,  computational,  inferential,  or  contextual     Database  considerations  –  Summary             Lecture  4  Introductions  to  Sequence  Alignments     Importance  of  Sequence  Alignments   • Sequence  alignment  is  the  most  fundamental  tool  of  bioinformatics   • Used  to  determine  if  two  sequences  are  evolutionarily  related  (homologous)   o Also  which  regions   • Used  to  identify  common  domains  or  motifs  between  different  proteins   • Can  uncover  patterns  of  conservation  and  variability     • Useful  in  genome  annotation     What  is  sequence  alignment?   • Sequence  alignment  is  the  identification  of  residue-­‐residue  or  nucleotide-­‐nucleotide   matches,  preserving  their  order     • Or  stated  differently:  is  a  pairwise  match  between  characters  in  each  sequence   • Need  to  define  criteria  so  an  algorithm  can  choose  the  best  alignment:    SCORE   • GAPS  are  introduced  to  maximize  overall  score   • Can  be  used  to  decide  if  sequences  are  homologous  -­‐  if  the  alignment  reflects  an  evolutionary   relationship  between  two  or  more  sequences  that  share  a  common  ancestor     v Sequence  similarity  implies  structural  similarity   v Structural  similarity  can  indicate  related  function     Homology   • Sequences  are  from  a  common  ancestral  sequence,  e.g.  the  sequences  are  related   • Orthologs  -­‐  related  sequences  in  different  species  that  have  diverged  through  speciation   o Whale  and  horse  myoglobin   • Paralogs  -­‐  sequences  within  a  genome  that  share  similarity  due  to  duplicated  ancestral  gene   o Human  alpha  and  beta  chains  of  hemoglobin   • Xenologs  -­‐  relationship  due  to  lateral  gene  transfer   o e.g.  some  types  of  antibiotic  resistance  in  bacteria     Homology,  identity  and  similarity   • Homology  is  a  ‘yes  or  no’  question,  sequences  are  homologous  or  they  aren’t.  The  degree  of   sequence  identity  (or  similarity)  between  sequences  can  be  used  to  answer  this   • Percent  Identity  is  the  percent  of  exact  matches  within  a  pair  of  aligned  sequences   • Similarity  includes  functionally  similar  amino  acids  as  approximate  matches   o e.g.  Valine  and  Isoleucine  (hydrophobic),  Lysine  and  Glutamine  (charged,  hydrophilic)     Pair-­‐wise  sequence  alignment   • “The  process  of  lining  up  two  sequences  to  achieve  maximal  levels  of  identity  (and   conservation,  in  the  case  of  amino  acid  sequences)  for  the  purpose  of  assessing  the  degree  of   similarity  and  the  possibility  of  homology.”  NCBI  glossary   • Aligned  by  placing  identical  or  similar  characters  in  the  same  column   • Optimal  alignment:    non-­‐identical  characters  and  gaps  are  placed  to  bring  as  many  similar   characters  together   v It  is  possible  for  sequences  to  appear  highly  similar  even  if  they  are  not  homologous,  e.g.  as  a   result  of  convergent  evolution.  Typically  these  are  shorter  peptide  sequences,  repeated   sequences,  or  low-­‐complexity  regions.     Changes  in  homologous  sequences   • Changes  that  occur  through  divergence   o Substitutions  (A  to  T,  T  to  C  etc.  or  amino  acids)   o Insertions  (one  or  more  nucleotides/amino  acids)   o Deletions  (one  or  more  nucleotides/amino  acids)     Should  amino  acids  or  nucleotides  be  aligned?   • Depends  on  the  goal.   o Protein  alignments  are  often  more  informative   o 20  amino  acids  vs.  4  nucleotides  (less  noise)   o Many  amino  acids  have  related  biophysical  properties   o Codon  degeneracy:  changes  in  the  third  position  often  do  not  alter  the  amino  acid  that   is  specified   o Protein  sequences  offer  a  longer  “look-­‐back”  time   o DNA  sequences  can  be  translated  into  protein,  and  then  used  in  pairwise  alignments   • In  some  cases,  DNA  alignments  are  more  appropriate   o To  confirm  the  identity  of  a  cDNA   o To  study  noncoding  regions  of  DNA   o To  study  DNA  polymorphisms   • DNA  sequences  change  at  a  faster  rate  than  protein  sequence   • DNA  may  be  more  informative  for  closely  related  species  comparisons,  or  comparisons   within  a  species   • When  in  doubt,  compare  protein  sequence  data  first     Global  vs.  local   • Global:  attempt  to  align  entire  sequence   • Local:  stretches  of  sequence  with  highest  density  of  matches  are  aligned   • There  is  also  ‘semi-­‐global’:  like  a  global  alignment,  but  doesn’t  count  gaps  at  ends  of   sequences     Multi-­‐domain  proteins   • Possible  that  only  a  subset  of  domains  share  homology   • In  this  case,  a  local  alignment  should  be  used,  or  subsets  of  the  complete  protein  sequences   selected   • Global  alignments  can  be  useful  for  aligning  known  domains  of  low  similarity,  e.g.  to  a  known   structure             Dot  Plots   • Simplest  method  for  evaluating  similarity  between  two  sequences:   visual  inspection   o One  sequence  is  assigned  to  horizontal  axis  of  a  plot  space   the  other  to  the  vertical  axis   o Dots  are  placed  in  each  position  where  both  sequence   elements  are  identical   o Regions  of  identity  are  indicated  by  diagonal  lines   • Searches  for  possible  alignment  of  characters  between  two   sequences   • Reveals  presence  of  insertions  &  deletions,  repeats,  self-­‐ complementary  regions  in  RNA   • Many  programs  don’t  provide  an  alignment     Sliding  windows   • Detection  of  matching  regions  improved  by  filtering  using  a  sliding  window  and  specifying   stringency   o DNA:    window  of  15,  stringency  of  10   o Protein:    window  of  2  or  3,  stringency  of  2   • Larger  window  size  for  DNA  sequences  due  to  greater  possibility  of  random  matches   • Objective  is  to  draw  attention  to  regions  of  significant  similarity   • Trial  and  error  with  new  data  sets   Lecture  5  Methods  of  Pairwise  Sequence  Alignment     Summary   • Sequence  alignment  is  the  identification  of  residue-­‐residue  or  nucleotide-­‐nucleotide   matches,  preserving  their  order     • A  pairwise  match  between  characters  of  each  sequence   • Need  to  define  criteria  so  algorithm  can  choose  the  best  alignment:    SCORE   • A  true  alignment  of  nucleotide  or  amino  acid  sequences  is  one  that  reflects  evolutionary   relationship  between  two  or  more  sequences  that  share  a  common  ancestor  (homology)   • An  alignment  represents  an  HYPOTHESIS     Scores  (nucleotides)   • In  a  gap  free  alignment,  score  is  determined  by  the  amount  of  aligned  pairs  of  identical   residues  and  aligned  non-­‐identical  residues   o Match  score  =  identical   o Mismatch  score  =  non-­‐identical     Gaps   • Gaps  are  included  to  obtain  the  best  possible  alignment  between  two  sequences   • Insertion  and  deletion  events  complicate  sequence  alignment     • If  scoring  an  alignment  that  contains  gaps,  a  “gap  penalty”  must  be  included   • Penalties  are  given  for  each  gap  introduced  (deducted  from  alignment  score)   • Rationale  is  that  insertion  and  mutation  events  are  rare,  however  they  may  affect  nearby   residues  and  nucleotides       Gap  Penalties   •   • There  may  be  a  number  of  equally  optimal  alignments  between  two  sequences   • As  insertions  and  deletions  are  rare,  a  better  alignment  may  be  one  with  fewer  but  longer   gaps   • It  is  more  likely  that  the  difference  in  length  between  2  sequences  is  due  to  one  indel  than   several  (e.g.  above  sequence)   • Thus,  we  can  bias  the  alignment  scoring  function  to  take  this  into  account  and  gap  penalty  is   divided  into  two  parts   o Gap  origination  penalty  (G)   o Length  (or  extension)  penalty,  L   o Gap  penalty  =  G  +  nL           Causes  of  gaps   • A  single  mutation  can  create  a  gap  (very  common).     • Unequal  crossover  in  meiosis  can  lead  to  insertion  or  deletion  of  strings  of  bases.   • DNA  slippage  in  the  replication  procedure  can  result  in  the  repetition  of  a  string.     • Retrovirus  insertions.     • Translocations  of  DNA  between  chromosomes.       Dynamic  programming  algorithms  for  sequence  alignment   • Exhaustive  alignment  methods  are  not  feasible   – E.g.  two  sequences,  length  100  and  95,  equals  ~55  million  possible  alignments   • Dynamic  programming  algorithms  can  provide  an  optimal  alignment  between  two  sequences   (mathematically  proven)   • Compares  every  pair  of  characters  in  two  sequences  and  generates  an  alignment   • Matches  between  identical  or  related  characters  are  maximized  in  the  alignment   • Global  alignments:  Needleman-­‐Wunsch  algorithm  (1970)   • Local  alignments:    Smith-­‐Waterman  algorithm  (1981)   • Work  by  breaking  big  problems  into  smaller  ones   • Two  possible  alignments  for  a  pair  of  sequences:     •   • The  first  four  positions  are  the  same  in  both.   • A  dynamic  programming  algorithm  would  calculate  a  score  for  the  first  four  positions  (for   example),  and  use  that  score  as  part  of  the  score  for  the  larger  alignments.   • Saves  recalculating  the  entire  score  for  each  alignment     A  simple  scoring  function       Dynamic  programming:  The  Needleman  and  Wunsch  Algorithm   • How  to  calculate  the  scores  in  the  table:   1. Initialize  the  first  row  and  column   with  multiples  of  the  gap  penalty   2. For  each  box,  possible  scores  are   generated  from  box  above,  box  to     the  left,  and  box  diagonally  up     and  left.  The  best  result  is  kept.   3. Here,  possibilities  are     X+1  (match),  X+0  (mismatch),     Y-­‐1  (gap),  or  Z-­‐1  (gap).     Semi-­‐global  alignments   • Like  a  global  alignment,  but  ignores  terminal  gaps:    gaps  at  either  end  of  the  sequences   • Terminal  gaps  often  the  result  of  incomplete  data  acquisition,  in  which  case  there  is  little   biological  relevance   • Needleman-­‐Wunsch  can  be  adapted  to  deal  with  this  easily   o Initialize  first  row  and  first  column  to  zeros  instead  of  multiples  of  the  gap  penalty  (a   gap  at  the  beginning)   o Start  from  the  maximum  value  in  the  last  row  or  column(a  gap  at  the  end  of  the   sequence)   • Based  on  Needleman-­‐Wunsch  algorithm   • Differences:     o Gap  penalty  is  zero  in  first  row  and  column   o Start  at  max  score  last  row  and  column             Lecture  6  Local  Sequence  Alignments  and  Scoring     Local  alignments   • Say  you  are  comparing  two  large  regions  of  DNA  (e.g.  two  chromosomes),  or  DNA  with   several  genes  or  coding  regions   • or,  two  multi-­‐domain  proteins  that  share  only  one  common  domain   • Global  alignments  won’t  work  well,  as  they  depend  on  order  of  genes  or  domains  as  well  as   local  sequence   • Need  a  method  to  recognize  smaller  regions  of  similar  sequence   • Can  recognize  a  small  sub-­‐sequence  within  a  coding  region,  or  several  regions  in  different   order  within  complete  sequences   • To  do  this,  a  LOCAL  alignment  algorithm  is  needed   • Local  alignments  are  often  more  informative  than  global  alignments   • LOCAL  alignments  focus  on  regions  with  significant  similarity     The  Smith-­‐Waterman  algorithm   • A  local  alignment  algorithm,  adapted  from  the  Needleman  &  Wunsch  procedure   • Will  recognize  regions  of  local  similarity  within  larger  sequences   • Can  also  recognize  global  similarity,  if  similarity  exists  (e.g.  regions  are  similar  and  in  the   same  sequence  order)   • Uses  2D  array,  similar  to  Needleman  &  Wunsch     • Produces  optimal  alignments,  like  N&W   • Requires  a  harsher  penalty  for  mismatches,   e.g.  match  =  1,  mismatch  =  -­‐1,  gap  =  -­‐2   • Uses  the  same  basic  rules  as  Needleman  &  Wunsch   o Array  entry  (i,j)  is  maximum  of:     [  (i-­‐1,  j-­‐1)  +  (match  or  mismatch),        (i-­‐1,  j)  +  gap  penalty,      (i,  j-­‐1)  +  gap  penalty  ]   • New  rules:   o Zero  is  the  minimum  value  for  any  entry  in  table   o When  tracing  back  through  matrix  to  produce  the  alignment,  start  at  the  maximum   value  in  the  table  (not  lower  right  corner)     o Stop  traceback  when  a  zero  is  reached     General  approach  to  pairwise  alignment   • Choose  two  sequences   •  Select  an  algorithm  that  generates  a  score   •  Allow  gaps  (insertions,  deletions)   •  Total  score  reflects  degree  of  similarity   •  Alignments  can  be  global  or  local   •  Estimate  probability  that  the  alignment  occurred  by  chance  (e.g.  using  E-­‐values)     Scoring  functions   • Scoring  functions  can  be  derived  for  nucleotides  and  amino  acids  (particularly  amino  acids)   • This  can  be  viewed  purely  statistically:   o The  scoring  function  can  be  based  on  observed  frequencies  of  exchange   • The  underlying  biochemical  mechanisms  can  be  considered  as  well  (in  part  to  explain   observed  frequencies)     Possible  changes  in  nucleotides     Example  of  BLASTN  matrix  score     Representing  amino  acid  sequences     Possible  changes  in  amino  acid  sequences   • Observed  changes  in  amino  acid  sequence  are  due  to  changes  in  the  corresponding  nucleotide   coding  sequence   • Changes  in  nucleotide  sequence  can     -­‐  result  in  a  change  of  amino  acid     -­‐  have  no  effect  on  AA  (synonymous  substitution)     -­‐  result  in  chain  termination  or  elongation   • Amino  acids  are  often  under  functional  constraint     o The  amino  acid  (or  a  similar  amino  acid)  is  required  for  the  function  or  structure  of   the  protein   • Amino  acids  under  functional  constraint  will  have  a  lower  observed  rate  of  change  over  time       Changes  in  amino  acid  sequences   • The  observed  probability  of  an  amino  acid  exchange  is  a  function  of  a  number  of  factors,   including:   o Functional  constraint   o Frequency  of  amino  acid   o Codon  frequency  and  codon  bias   o Mutability  of  amino  acid   § Relative  functional  importance   § Relative  chemical  similarity     • A  scoring  function  can  be  calculated  from  observed  amino  acid  exchange  rates  in   homologous  proteins,  effectively  integrating  the  above  factors       Normalized  frequencies  of  amino  acids       The  relative  mutability  of  amino  acids   ASN   134      MET   94         GLY   49       SER   120      GLN   93       TYR   41     ASP   106      VAL   74       PHE   41   GLU   102      HIS   66       LEU   40     ALA   100      ARG   65         CYS   20   THR   97      LYS   56         TRP   18   ILE   96        PRO   56         Substitution  Matrices   • A  substitution  matrix  contains  values  based  on  the  probability  that  amino  acid  i  mutates  into   amino  acid  j  for  all  pairs  of  amino  acids.     • Substitution  matrices  are  constructed  by  assembling  a  large  and  diverse  sample  of  verified   pairwise  alignments  (or  multiple  sequence  alignments)  of  amino  acids.   • Substitution  matrices  should  reflect  the  true  probabilities  of  mutations  occurring  through  a   period  of  evolution.  The  two  major  types  of  substitution  matrices  are  PAM  and  BLOSUM.     Protein  Scoring  Matrices   • Protein  scoring  matrices  are  derived  from  examining  actual  substitution  rates  in  nature   • One  common  matrix  is  PAM:    Point  Accepted  Mutation.  This  was  the  first  widely  used  amino   acid  scoring  matrix   • BLOSUM:  matrix  derived  from  observing  rates  among  similar  proteins   o BLOSUM-­‐62**   o BLOSUM-­‐80             Pairwise  alignment  of  protein  sequences   • The  same  optimal  alignment  methods  can  be  used:   o Needleman-­‐Wunsch  (global)   o Smith-­‐Waterman  (local)   • Typically  match/mismatch  scoring  is  replaced  by  amino  acid  specific  scores,  i.e.  (L,L)  =  6,   (L,V)  =  2,  (L,D)  =  -­‐4.       • Scores  are  based  on  observed  frequencies  of  substitution   • Gap  penalties  are  usually  based  on  the  origination/extension  model  (affine  gap  penalties),   though  linear  gap  penalties  may  also  be  used.           Lecture  7  Scoring  matrices     Derivation  of  scoring  matrices   • Scoring  functions  are  based  on  probabilistic  approach,  usually  in  the  form  of  a  comparison   between  alternate  models.   • Two  standard  models  for  comparison  are  i)  a  random  model  and  ii)  a  non-­‐random   (evolutionary)  model   • For  any  given  position  in  a  pairwise  alignment,  probabilities  could  be  estimated  for  each   model   • Models  can  be  compared  by  the  ratio  of  probabilities,  or  the  odds  ratio   • Many  scoring  systems  use  some  variation  of  a  log-­‐odds  ratio     • Define  q a,b    the  probability  that  an  alignment  of  amino  acids  a  and  b  occur  in  an   homologous  sequence  pair   • Define  p ,a  pb  as  the  overall  frequencies  of  occurrence  of  amino  acids  a  and  b   • Probability  of  a  random  alignment  is  the  product  of  the  amino  acid  frequencies,  p p a b   • The  odds  ratio  for  an  alignment  of  a  and  b  in  a  sequence  is   Odds  ratio  =q a,  b a b     PAM  matrices   • The  first  widely  used  scoring  system  for  aligning  proteins,  created  by  Dayhoff  et  al.  (1968,   1978).  PAM  matrices  are  based  on  global  alignments  of  closely  related  proteins  (>85%   sequence  identity)   • An  accepted  point  mutation  is  the  replacement  of  one  amino  acid  by  another,  and  accepted  by   natural  selection   • Dayhoff  et  al.  constructed  a  mutation  probability  matrix  M,  where  element  Mij  represents  the   probability  of  amino  acid  j  being  replaced  by  amino  acid  i  over  a  given  interval.   • For  the  PAM1  matrix,  this  interval  is  equivalent  to  onechange  per  100  amino  acids.             Extrapolation  in  PAM  matrices   • The  original  Dayhoff  matrices  used  similar  sequences  (>85%  identity)  to  create  high  quality   alignments.   • Probabilities  were  derived  from  these  alignments,  and  then  adjusted  to  give  an  average   probability  of  change  of  1  in  100.  This  produced  a  PAM1  matrix.     • Matrices  for  higher  degrees  of  divergence  were  generated  by  extrapolation  from  the  PAM1   mutation  probability  matrix:   250 • e.g.  PAM250  =  (PAM1)       • These  matrices  are  then  converted  to  log-­‐odds  scores.     BLOSUM  matrices   • BLOSUM  matrices  are  based  on  local  alignments  (developed  after  the  PAM  matrices  and  with   more  data)   • Derived  from  BLOCKS  database,  containing  a  large  number  of  ungapped  multiple  sequence   alignments   • Developed  by  Henikoff  and  Henikoff,  1991,  1992.  BLOSUM  =  Blocks  Substitution  Matrices     • Differences  with  PAM  matrices:   o Contain  quite  dissimilar  sequences  (still  homologous)   o Sequences  are  compared  directly  (no  evolutionary  model)   o Intent  is  to  identify  conserved  regions  in  sequences     Derivation  of  BLOSUM  matrices   • One  potential  issue  is  that  similar  sequences  in  the  training  set  will  bias  results.  To  address   this,  sequences  were  weighted  using  similarity  thresholds   • Similar  sequences  greater  than  a  given  threshold  (80%,  62%  etc.)  were  clustered  into  a  single   sequence.   • e.g.  for  BLOSUM62,  no  two  (averaged)  sequences  used  have  more  than  62%  identity.   • Given  a  filtered  database  of  homologous  sequence  alignments,  the  database  can  be  reduced  to   a  set  of  amino  acid  pair  frequencies ij.f   • Observed  probability  of  observing  a  given  pair:   • This  is  the  top  half  of  the  log-­‐odds  equation.   • What  is  the  probability  of  a  random  match?   • Probability  of  residue  type  i  in  an  i,j  pair  is     • The  expected  probability  of  a  random  match  is  therefore     e  =  pp  for  i  =  j   ii i j e ij  = i j     j i i j for  i  ≠  j   • In  the QN  and NN  ample,  p Q =  0.029  +  (0.114  /  2)  =  0.086,   • Or  equivalently  p Q =  (3  +  12/2)*105  =  9/105  =  0.086.   • Here,  e QN  =  2i j  =  2*0.086*0.114  =  0.0196.   • Log  odds  ratio  =  l2g ij  q ij /  e   • Score  in  BLOSUM  matrix  =  S  (i,j)  =  2 ijlog ij  q  /  e   • Where  2  is  a  scaling  factor  and  values  are  rounded  to  the   nearest  integer.   Lecture  8  More  on  Gaps,  and  an  introduction  to  BLAST     What  about  gap  penalties   • For  good  or  bad,  gap  penalties  typically  are  not  derived  as  rigorously  as  scoring  functions   • A  heuristic  approach  is  used,  i.e.  default  gap  penalties  are  chosen  because  they  appear  to   produce  good  results     • We  have  used  linear  gap  penalties  in  alignment  algorithms:   • Affine  gap  penalties,  which  have  separate  origination  (G )  andO  extension  (G )  penaEties,  are   used  in  most  algorithms   g(n gap )  = O-­‐ gap G E    n or   g(n gap )  = O-­‐ gap -­‐1)GE.     Integration  of  affine  gaps  into  dynamic  programming   • One  option  is  to  calculate  scores  for  each  path  in     local  and  global  algorithms   • Brute  force  option  results  in  time  complexity  O(mn )  2 • It  is  however  possible  to  derive  a  second  recursive  term  ij    the  matrix  can  be  calculated  in   time  O(mn)       Database  searching   • Sequence  alone  is  not  informative       -­‐  want  to  deduce  function     • BLAST  is  a  set  of  algorithms  to  compare  a  query  sequence  against  all  sequences  in  database   • Similarity  is  measured  by  aligning  two  sequences   • Each  comparison  is  given  a  score  reflecting  degree  of  similarity  (higher  the  greater  the   similarity)   • The  chance  (expectation  value)  that  the  match  could  have  occurred  as  a  random  hit  is   estimated     • Remember  similarity  does  not  always  equal  function!   • Sequentially  align  a  query  sequence  to  each  subject  sequence  in  the  database   • Results  are  reported  as  a  ranked  list  with  scores  and  statistics   • Heuristic  methods  are  required  due  to  the  huge  size  of  many  databases:    Word  Based   Methods  (k-­‐tuples  or  k-­‐mers)   • Searching  an  entire  database  using  optimal  methods  can  take  a  considerable  amount  of  time   • Smith-­‐Waterman  and  Needleman-­‐Wunsch  require  a  minimum  of  O(mn)  time,     • FASTA  and  BLAST:  faster,  heuristic  methods  of  database  searching   • Cost  is  that  they  are  no  longer  guaranteed  to  find  the  best  result;  they  examine  only  part  of   the  search  space.  (They  work  very  well  though)     BLAST   • BLAST  (Basic  Local  Alignment  Search  Tool)  allows  rapid  sequence  comparison  of  a  query   sequence  against  a  database.   • The  BLAST  2.0  algorithm  is  fast,  accurate,  and  web-­‐accessible.       Why  use  BLAST?   • BLAST  searching  is  fundamental  to  understanding   • The  relatedness  of  a  query  sequence  to  other     • Known  proteins  or  DNA  sequences.   • Applications  include   o  Identifying  orthologs  and  paralogs   o  Discovering  new  genes  or  proteins   o  Discovering  variants  of  genes  or  proteins   o  Investigating  expressed  sequence  tags  (ESTs)   o  Exploring  protein  structure  and  function     BLAST  continued   • Basic  Local  Alignment  Search  Tool   • Developed  to  perform  a  sequence  similarity  search  by  an  algorithm  faster  than  Smith-­‐ Waterman   • Word  Based  Method  that  finds  ungapped,  locally  optimal  sequences  alignments  (words  are   typically  2-­‐3  amino  acids)   • Permits  inexact  word  matches     • Heuristic  procedure   • Can  specify  word  length:    minimum  required  to  achieve  a  significant  word  score   • 2  or  3  for  proteins   • 7,  11,  15  for  nucleotide  seq.     Four  components  to  a  BLAST  search   1. Choose  the  sequence  (query)   2. Select  the  BLAST  program   3. Choose  the  database  to  search   4. Choose  optional  parameters   Then  click  “BLAST”     Comments  on  BLAST  searching   • Protein  alignments  are  more  useful  for  inferring  structure  and  function  from  sequence   • Look  at  the  scores   o E  >  1  is  likely  to  indicate  a  chance  alignment   o E  <  0.05  or  0.1  often  used  to  represent  biological  significance   • Low  complexity  regions  can  be  similar  without  being  related   o Usually  blocked  out  by  BLAST   • Significant  similarity  most  often  means  homology,  but  homology  does  always  result  in   sequence  similarity   • Identity  =  %  identical  residues   • Similarity  =  %  similar  residues  (e.g.  I    V)   • Homology  is  still  a  yes  or  no  question         Lecture  9  Basic  Local  Alignment  Search  Tool  (BLAST)     How  the  original  BLAST  algorithm  works:  3  phases   Phase  1:     compile  a  list  of  word  pairs  (w=3)  above  threshold  T   Example:     for  a  human  RBP  query   …FSGTWYA…     A  list  of  words  from  the  given  sequence:   FSG  SGT  GTW  TWY  WYA  ...     Phase  2:     Scan  the  database  for  entries  that  match  the  compiled  list.     Phase  3:     When  you  manage  to  find  a  hit  (i.e.  a  match  between  a  “word”  and  a  database   entry),  extend  the  hit  in  either  direction.  Keep  track  of  the  score  (use  a  scoring   matrix),  and  stop  when  the  score  drops  a  given  amount  below  the  maximum   score  attained.  Keep  locally  optimal  segments  or  “high  scoring  pairs”     How  to  interpret  a  BLAST  search:  expect  value   • It  is  important  to  assess  the  statistical  significance  of  search  results.   • For  global  alignments,  the  statistics  are  poorly  characterized.   • For  local  alignments  (including  BLAST  search  results),  the  scores  follow  an  extreme  value   distribution  (EVD)  rather  than  a  normal  distribution.   • The  expect  value  E  is  the  number  of  alignments  with  scores  greater  than  or  equal  to  score  S   that  are  expected  to  occur  by  chance  in  a  database  search.   • An  E  value  is  related  to  a  probability  value  P,   Where     P  =  1  -­‐  e           (1)   For  values  of  E  and  P  <  0.05,    E  ≈  P   • Given  the  cumulative  distribution  function  of  the  extreme  value  distribution,  the  probability   of  a  random  score  S  being  below  a  given  threshold  score  x  is       P(S  <  x)  =  exp(-­‐e -­(x-­µ) )       (2a)   Or,  S  greater  than  or  equal  to  a  given  threshold  x  is     P(S  ≥  x)  =  1  -­‐  exp(-­‐eλ(x-­‐µ))       (2b)   Where  the  parameters  λ  and  µ  describe  the  shape  of  the  curve  and  are  dependent  on  scoring   function  and  the  size  of  the  database.     •
More Less

Related notes for BIOL 365

Log In


Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.