LECTURE 14: GATTACA: Genomics and Next-Generation Sequencing
In this lecture, new methods of DNAsequencing, which replace the original Dideoxy
Sequencing method (Sanger Sequencing) – the way the human genome has been originally
sequenced – are presented.
• Original definition/goal: Determining the DNAsequence of the genome
o to identify the location of genes (process of genome annotation).
• Accuracy level: 70-80% (not terribly good, but we’re getting better at it)
• Modern days: new initiatives (referred to as the –omics era)
o Genomics (still present)
o Functional Genomics: Find the functions of the genes (globally, with a holistic
approach rather than by evaluating the function of one gene at a time).
o Proteomics: Find the functions of the protein that is encoded by each gene.
Functional and Proteomics: more to come in Roy’s section.
o Evolutionary Genomics: Achieved through complete sequencing of multiple
genomes and comparing them to deduce their evolutionary progress throughout
o Transcriptomics: Understanding the population of transcripts.
o Phenomics: Analysis of the entire phenotype [observable characteristics of an
organism], and its evolution [how it changes through developmental time].
o Spliceomics: Observing splice elements/variants.
o Others: glycomics, metabolomics, lipidomics, predictomics
o Eventually “omicsomics”…?
Popular theme: Looking at -omics in a more holistic way.
Organism genomes that were sequenced:
• First organism to be sequenced: Epstein-Barr virus (192 kbp), in 1984.
o Considerable effort back then considering the primitive technology available.
• Since then, the DNAsequence from thousands
of viral, organellar, prokaryotic and eukaryotic
genomes have been sequenced.
• Focus on disease-causing viruses and bacteria
o Influenza Virus and Yersinia pestis, for
• Eukaryotic organisms: certain model
organisms’genomes were sequenced first and
are more frequently used in labs.
o Yeast: One of the first, b/c its genome is very small (17 Mbp).
o Roundworm (Caenorhabditis elegans): Common model system, b/c easy to
sequence – we know its exact cellular composition [we can fake map every cell in
the organism, we know how many exactly there are, etc.].
o Fruit fly (Drosophila melanogaster): Often used in molecular genomics.
o Mice: One of the last to be sequenced, because its genome is relatively large.
1 o Plant (Arabidopsis thaliana): Common weed – often found in the cracks of
sidewalks and other places – this fast growing plant is easy to sequence, because
it has a small genome that is easy to manipulate. Model organism despite its lack
of agricultural benefit.
• Now: There are over 2000 complete/ongoing eukaryotic Genome Projects [sequencing
o Fugu fish: Model system for invertebrates. Easy to
sequence, because it has a very streamline genome with
very few transposable elements to get in the way of
understanding what the genes do.
o Sea Squirt: Model for vertebrae development (similar to the
zebrafish). It has a backbone in early stages, but later loses it in
the adult phase.
o Mosquito: Has malaria carrying potential (and other diseases)
o Rice: One of the early genomes to be sequenced, because it is an
important part of the human diet (main source of calories for most
people on the planet).
• Thanks to modern, more efficient technologies, many more genomes have been
o Others: silk worm, armadillo, elephant, insect that carries the Chagas Disease,
banana, chicken, several fungi, tomato, radish, lettuce, spider mite, etc.
Sequencing method: Shotgun Sequencing
Main obstacle to DNAsequencing: we can’t sequence the long chromosome directly (DNA
Polymerase can only read sequences containing a maximum of ~1000 base pairs = limitation)
We had to break down these very large molecules/chromosomes extracted very carefully using
the following technique.
2 Soln: Fragmentation of chromosomes
First step of Shotgun Sequencing - Fragmentation:
• We use many exact copies of the chromosome (For example, there are 5 in the image
below, on the left)
• Break down the chromosome into small fragments of different lengths (done by
mechanical shearing or enzymatic breakdown of the DNA)
• The reference dots on the image indicate that the copies of the chromosomes are identical
to one another. We can see that each segment containing these dots differs in length after
fragmentation of the chromosome.
• The idea is to capture the sequence in every one of those bits and reassemble all the
information back together. This is what the next step is about.
Second step of Shotgun Sequencing - Assembly [VERY COMPLEX]
• Explanation of the above diagram: The reference sequence is the Original Strand. We
typically don’t know it; this is what we are trying to achieve. The blue region indicates
the overlap between the two fragment sequences.
• DNAPolymerase helps determine the sequence of each small fragment. This information
is then combined to form a full chromosomal DNAsequence.
• To form the original segment, we need fragments from different copies of the
chromosome that contain common/overlapping regions. The common regions for
consecutive fragments are aligned, creating multiple tiling paths, and eventually the small
segments are recombined into a longer DNAfragment. (See image on the right)
• Complicated, because we have to recombine in a faithful way millions/billions of similar
segments of DNA(more pieces than the world’s largest jigsaw puzzle – 24,000 pieces,
1.5 x 4.2 meters).
o Contiguous sequences (contigs): Reconstructed sequence of segments with
overlapping regions. Shown in red on the diagram below. Various contigs are
separated by gaps. [Wiki: Sets of overlapping clones that form a contiguous
stretch of DNAare called contigs].
o Gaps: Regions that separate the contigs and contain no sequence information (so
we can basically bridge those contigs together).
3 o Tiling path: Combination of contigs and gaps. [Wiki: minimum number of clones
that form a contig that covers the entire chromosome comprise the tiling path that
is used for sequencing].
o Scaffold: Result of the assembly of the tiling path (original DNAsequence).
Region that represents a series of contigs in their appropriate order. There are still
gaps (so it’s not a complete sequence). [Wiki: consists of overlapping contigs
separated by gaps of known length].
Genome sequencing status [where most of the projects are at the moment]:
• The grid on the right indicates the number of genomes that have been sequenced for
viruses, prokaryotes, and eukaryotes (can be more than one per species)
• For viruses, there are no contigs or scaffolds currently deposited in databases, because it’s
so easy to sequence a viral genome completely due its short length. (one contig usually
covers the entire genome).
• Some sequencing projects are left incomplete (keep contigs, scaffolds or raw reads),
because further sequencing is unnecessary for that particular project and not because the
scientists are inapt, especially in the case of prokaryotes.
• Genomes that are 100% sequenced are rare – but there are a few. They’re usually missing
a few bits. Eukaryotic genomes, for example, are almost 100% sequenced.
• Raw reads: Bits of genome that haven’t been assembled yet (no project yet).
• Most virus and prokaryotic genomes have been sequenced [their sequencing status is
relatively stable since we hit all the major disease-causing bacteria and viruses], while
there are still many eukaryotic genomes left to sequence (or re-sequence individuals of
the same species, too).
o Some genomes have been sequenced several times. (For humans: 1000s of times,
since it was sequenced completely in year 2000).
Initial sequencing method to about 2008 or so: Dideoxy DNASequencing
• Requires a lot of workers, technological machines and money
• Works very well. The quality of the output is great.
• Now, there is a great push to speeding up DNAsequencing.
4 New sequencing methods (also known as Next/2nd Generation Sequencing) include several
platforms, of which 2 persisted:
1) Roche 454 pyrosequencing (also known as 454 Sequencing or Pyrosequencing)
2) IlluminaSolexa sequencing (also known as Illumina)
Next Generation Sequencing (also known as 2ndGeneration Sequencing):
• High-throughput (The method is very efficient):
sumanasinc.com/webcontent/animations/content/highthroughput2.html (with pyrosequ.)
o Massively-parallel: Millions of strands are sequenced at the same time, as
opposed to Dideoxy DNASequencing, which sequences only 100 fragments at a
time [50-100 fragments could be loaded on one gel, for instance].
o Done with specific techniques (technology improvements):
Microfluidics: moving very small volumes of liquid
Fixed synthesis: DNAstrands, being sequenced, are not transported in a
solution, but actually fixed on a matrix
High-resolution microscopy: It allows us to actually visualize the process
of DNAsynthesis with a more detailed image
• Read length: The maximum length of a DNAsequence that can be read and synthesized
from the primer sequence by DNAPolymerase.
o Dideoxy DNASequencing: approximately 1000 nucleotides.
o New Generation Sequencing: shorter length, but the high-throughput/massive