Monday, September 29, 2008
Lecture 7: What is gene & where do I find it? (Part II)
2) Assays of gene function
Readings: Alberts textbook, Ch 8, pp. 563-572
Now to determine the function
For genome projects, gene prediction is often - Alternative splicing complicates gene prediction b/c final spliced
difficult. Which of the following factors is/are NOT product can vary in different tissues.
a concern? - Degeneracy of Genetic code is the answer b/c we know exactly
a) Alternative splicing how the genetic code is degenerate so we can predict based on the
b) The degeneracy of the genetic code genetic code exactly which sequences will code for coding
c) Possible presence of introns sequences which is what’s important in gene sequences.
d) Variability in consensus sequences for Degenerate refers to the fact that DNA sequences don’t code to
transcription initiation factors greatest capacity as in 64 codons but we only have 20 AAs and
e) None of the above some function as stop codons.
- Possible presence of introns can disrupt the reading frame –
resulting in frame shifts which may cause stop codons that may
actually occur prematurely, introns is huge problem for predicting
- Variability in consensus sequences for transcription initiation – if
there was no variation in consensus sequences for transcription
then we can use it as 100% marker for predicting genes but in fact,
there is certain amount of variability always.
- We’re halfway through the bioinformatics investigation.
Phylogenetic analysis (looking for relatedness in a family tree)
- Bioinformatics on one hand can tell certain amount of things, but
complementary to them are the experimental investigations in the
- DNA detective – provides clues to what the mystery sequence
might be – none are definitive – can be tested in the lab to confirm
- If sequence matched with that in database perfectly, then implies
that it isn’t a predicted gene that came out of analysis of genome
projects but actually came out from experimental investigations &
conclusions backed by all that experimental evidence.
- All of the subsequent analysis are really just clues to finding
what the sequence really is. - Mystery sequence: Blasted it to the NCBI database but we didn’t
find exact match – but we found things that are very similar (high
score/low E value). Take some of those other sequences that came
up in BLAST search & see if our gene is similar to genes known
of other organisms. Could have been gene extensively studied
experimentally in model organisms (mouse) but not in organism
we’re interested in.
- Clues come from sequence alignment & calculating sequence
similarities – already did multiple sequence alignment, found it
was similar to some of the sequences we pulled out but is it similar
enough to be described as a gene that is previously identified. To
make that clear, we have to perform further investigation which
would involve phylogenetic analysis – if it is gene that has already
been isolated but from other organisms or if it is member of
entirely new gene family but related to other gene families that
- We used BLAST searches to find similar sequences to 1 in
databases to the mystery sequence & took those sequences that are
most similar and used ClustalW to align those sequences with
mystery sequence. To assess how similar mystery sequence is to
others, did calculations of percent difference/identity so that we
have some measure of relationship b/w mystery to other sequences
in alignment – not a very complete description of that relationship.
To define relationship, we need to do phylogenetic analysis.
- Phylogenetic analysis is a computer algorithm that creates an
evolution tree from the data inputted.
2) “Multiple sequence alignment” – basis on which it starts to
estimate evolutionary relationships – (from Clustal W) it is needed
for any Phylogenetic analysis program.
3) “Phylogenetic Tree” – describing those evolutionary
relationships – to describe the relation between the different
4) “Minimizes” – assumes evol’n progressed from fewest # of
- What are some of assumptions in analysis?
1. Evol’n proceeds by shortest, most parsimonious path.
2. Evol’n proceeds via bifurcating path – that’s what gives rise to
3. Rate of change of sequences on phylogeny is very slow.
4. Sites in protein (nucleotide sequence) evolved independently
from each other so each 1 can be viewed as an independent
estimate of the evolutionary relationships among the species.
(Parsimony means being very careful, not wasteful. In this context
it means reducing the max # of steps possible) NEXUS (multiple sequence alignment along in particular type of
- NEXUS – Somewhat like a computer script, there are commands
within body of the NEXUS file.
- First two bioinformatics tools were widely available web-base
tools – now not as user-friendly, no website you can go to –
actually have to go & download program & run program locally.
- Some of these methods are much more computationally
intensive, not just how widely used they are.
- From this multiple sequence alignment, this computer algorithm
is going to estimate a phylogenetic tree.
- Gene is over 300 AAs long - the picture is only a segment.
- Interpretation of tree: next most closest sequence to mystery
sequence is the human violet sequence and the next most closest
sequence to both of these is chicken violet opsin.
- Can conclude that this mystery sequence is most likely a violet
- This analysis highlights one of the serious potential pitfalls of
doing this type of analysis incautiously – with so few sequences
up there, it’s actually difficult to say anything conclusive about
what the mystery sequence is – may have incomplete sampling of
- Say we were very cautious, these are our preliminary results but
we decide to gather more sequences & redo this analysis, would
we get something different? If we gather many more sequences by
going further down BLAST search results, pulling those sequences
out again and aligning them again with Clustal & then performing
phylogenic analysis on extended multiple sequence alignment.
- Looks very different from previous tree even though sequences
from previous tree are still in this phylogeny here – with more data
& better sampling, results become more accurate.
- In this phonology, mystery sequence is actually contained within
UV violet opsin clade – either ultraviolet or violet opsins – makes
it quite clear that if we had just been happy with previous much
smaller analysis, we may concluded that mystery sequence was
just a violet opsin – that would not have been necessarily correct.
Now we know it could be a violet opsin or an ultraviolet opsin or
- This phylogeny has branch length info on it – branch length info
is actually proportional to evolutionary distance b/w the
- We have some clues to what the mystery sequence might be.
Recap: Bioinformatics tools - Easiest way of assessing protein function is experimentally but
1) Finding similar sequences in the database we still have bioinformatics to work on it with.
- Even if you had a lot of that protein, what do you do with it?
2) Aligning sequences to the mystery sequence - What do you look for first before making lots of the protein, you
look at the protein and see if it gives you any clues to what’s going
on. 3) Assessing how similar the mystery sequence is to
4) Determining the relationship of the mystery
sequence to others
5) Assessing prote