Lecture 6: What is a gene and where can I find it?
1) Definition of a gene
Readings: Alberts textbook, Ch 8, pp. 550-552
Which is fold degeneracy of primer constructed for following AA sequence? NQWWMSY
Fold degeneracy of primer reflects degeneracy of codon table for those particular AA residues – Ex: N (aspirgine
residue AA) can be coded for by 2 different codons in codon table – differ by transition at 3 codon position.
Glutamine (Q) – 2 different codons. Tryptophan (W) – not at all degenerate; only 1 codon that codes. Methionine
(M) – 1. Serine (S) – highly degenerate; 6 different codons that it can code for. Tyrosine (Y) – 2. Fold degeneracy
is combinatory thing – how many possible combos of nucleotides can give rise to this AA sequence? – Multiply
2*2*1*1*6*2 = 48.
Questions of the Day
1) What is gene?
2) What sequence is this anyway?
3) What is its function?
Molecular Definition of Gene
Entire nucleic acid sequence (usually DNA) that is necessary for synthesis of protein or RNA. In other words,
genes are segments of DNA that are transcribed by RNA polymerase into RNA. Remember there are 2 types of
1. Protein-coding – traditional view of what gene is – gene would code for protein.
2. RNA-coding – sometimes RNA itself is final product that has function inside cell & doesn’t need to be
subsequently translated into protein in order to accomplish that function – tRNAs, ribosomal RNAs.
- Blue are stop codons
- Only 1 with open reading frame is #2 – no stops – if you get stops, not
going to get full protein (protein synthesis will terminate).
- Red regions: no stop codons.
- 1 with largest reading frame is #1.
- If you look at any sequence of fragment of DNA – double stranded with
orientation 5’ to 3’ – if we were to try to translate this piece of DNA but
we didn’t know which way we should be going & if we didn’t know
exactly which reading frame we should be in, there are 6 possible ways to
try & translate this sequence.
- Reading frame #1: take 1 3 nucleotides TTA – translates into leucine, 2 3 TTT: phenylalanine, TAT is
tyrosine & contains stop codon. If you were to shift it by 1 & start reading frame with T – TAT – tyrosine,
TTT is phenylalanine & so on – that’s how you get reading frames.
- 1 clue as to whether or not you might be looking at protein-coding gene sequence would be that proteins
are composed of generally long stretches of AA sequences so we would not expect stop codon in middle of
protein. Look at all these open reading frames & rule out ones with stop codons in them.
- Blue lines are stop codons – are fairly common if you’re not in protein-coding reading frame. According to
this kind of thinking, can just look at 6 different reading frames & choose 1 with longest open reading
frame – sequence b/w 2 stop codons. Potential protein coding gene – starts with methionine & ends with
- If you have genes that have introns, can’t expect long open reading frame b/c it could be interrupted at any
point by intron that could actually change reading frame – might actually hit stop codon b/c it’s an intron.
It’s dark & raining outside, you get phone call from desperate grad student in middle of night, who faxes you
unknown sequence before disappearing forever. How do you figure out what this sequence is? Finding Info on Mystery Sequence
1. Are there sequences in database which are similar? BLAST search
2. Can it be aligned to family of sequences? ClustAl sequence alignment
3. How similar is it to this family? Sequence similarity
4. Is it related to this family on a tree? Phylogenic analysis
5. Which protein domains does it contain? SMART analysis
BLAST search (does it match perfectly a sequence in the
1) Nat Centre for Biotech Info
- Access genome databases with query sequence.
What is BLAST Search?
1) Algorithm that uses short stretches of sequence similarity to find related genes in database – if you queried with
very long sequence – that long sequence would have to be aligned to every single gene in genome database – that is
incredibly slow process.
2) Algorithm is fast, efficient.
- Red is what you inputted. Black are all different sequence bases that
line up with your sequence
- Want low E-value & high score.
- Show you distribution of hits – where they align to in query
sequence. Show you descriptions of each of those hits – give you
excession #s, species’ sequence it came from, scores & e-values that
give you indication of how well these sequences in database might
match your sequence.
- These statistics are based on probabilistic matches – probabilistic
values of how likely match would be to random sequence.
- In general, the higher score, the better the match & lower the e-
value, the better the match.
- Also give you region of query sequence that matched sequence in
Questions from last time:
1) What is the sequence of the template strand read in the 5’ 3’ direction? Shorter fragments are further
down because they traveled farther and you read from the bottom upwards so
5’ ACCGATT 3’
3’ TGGCTAA 5’
2) The original template is:
5’ ACTTACGTAC 3’
3’ TGAATGCATG 5’
Primer = 5’ ACTT
What does the sequence gel look like?
A C G T
Slide 1 1) Only short stretch of similarity needed for a BLAST hit
2) Too computationally expensive (will crash computer)
3) Higher score or Lower E-value (is better)
- The higher the score you get, the closer match you have, the E-
value tells you if it’s by chance or not, the lower it is, the more
likely that the match is not just by chance.
- Probabilistic statistics to see how likely your sequence
matches those in the database
- Will give you region of query sequence that matched in
Slide 2 - There is no exact (perfect) match in database from the
Now we search for how close it matches to some known family
of genes, so we pull the top closest BLAST results.
- Sequence alignment and similarity (is next where you
find similar sequences, searching for how similar their
sequence is to the one in the database (matching
fragments, not exact match).)