This chapter covers the cutting edge of microbiology: genomics, proteomics, and metagenomics.
A history of genomics is followed by what can be learned from genomics, how we do genomics,
and the post-genomics era. Examples of some interesting and exciting discoveries are given
along the way.
After reading this chapter and attending lecture, you should be able to:
1. Explain the history of genomics and the battle between government and industry
scientists to complete the first genome sequences
2. Explain how gene content and genome size are related.
3. How complete genome maps are reconstructed from sequences and major pathways are
4. How the origins of genes are identified based on gene sequence analysis and the
limitations associated with pure in silico studies.
5. Describe how genomic DNA is prepared, cloned, and analyzed for complete sequencing.
6. Understand the difference between Sanger sequencing and next-generation sequencing
7. Understand the concept of comparative genomics and what it can tell us.
8. Describe the process of determining differential gene expression between two cell types
or growth conditions using genomic microarrays.
9. Understand the difference between cDNA and proteomic studies and their limitations.
10. Understand the concept of RNA Seq
11. Understand the difference between the “core” genome and the “pan” genome and the role
of genetic “islands.”
12. The theory of environmental metagenomics and why this is a powerful new approach for
describing novel genes and their functions.
I. History of microbial genomics
Microbial genomics is an offshoot of the human genome project (HGP), which started in 1990
and was t9e first “big science” project focused on biology. The goal of the HGP was to sequence
all 3 x 10 nucleotides of the human genome. There was a very public race to complete the first
genome sequences and the first human genome sequences by two groups: the public US NHGRI,
led by Francis Collins, and the private The Institute for Genome Research (TIGR), led by J.
Craig Venter. They took two very different approaches—1) the public group used “chromosome
walking”, a slow, steady process requiring a new sequencing primer be designed from the
previous sequence, thus increasing the sequencing information in an iterative fashion, 2) TIGR
used “shotgun cloning”, an approach where random pieces of DNA were sequenced and “stitched together” from overlapping regions into the complete genome. The latter approach was
shown to be much more cost and time efficient, and was used to produce the first genome
sequence—a bacterium called Haemophilus influenzae in 1995. Bacterial genomes are much
smaller than the human genome, and thus this approach demonstrated that it was technically
feasible to sequence an entire genome. The public group abandoned the chromosomal walking
approach in favor of the shotgun cloning approach, and working together with TIGR, the
complete human genome sequence was published in 2003—in less than ½ the time that was
expected. Since then, the number of published genomes has been increasing exponentially and
now numbers about 4000—nearly all prokaryotic.
Recently, a new approach has also been used—metagenomics, or the sequencing all of
the DNA from an entire environment (e.g. a water sample, a soil sample, the human gut, the
human mouth, etc.). While in most cases individual genomes cannot be reconstructed in this
way, the inventory of genes and potential metabolic pathways in those environments can be
II. Metagenomics: two examples
1. Proteorhodopsin. In a study of ocean water, Beja and colleagues made a metagenomic
clone library from the sample. They screened the library for 16S rRNA genes—genes which
would provide them with information on “who” the clone came from. One of these clones had a
16S rRNA gene from SAR86—an organism that had never been cultured, but was known to be
widespread and abundant from previous molecular studies. On the same clone that had the
SAR86 16S rRNA gene, they found a gene that was related to the rhodopsins found in halophilic
archaea. These molecules responded to light in the halophilic archaea to carry out several
different functions—Cl transport, photosynthesis, and phototaxis. They named the new gene
“proteorhodopsin” and cloned it into E. coli, where they showed it was capable of light-
dependent proton pumping. Thus, its role in the environment is to enable photoheterotrophy in
cells that contain it. Further studies of this gene showed that it is found in many uncultivated
groups of marine bacteria, and that photoheterotrophy is a very common metabolism in the ocean
—something we had no idea about prior to the metagenomic study.
2. Human gut microbiome study. Based on 16S rRNA genes, the two main groups of
bacteria in the human (and mouse) gut are Bacteroidetes and Firmicutes. In mice, a mutation in
the leptin gene makes them obese, even if they are eating the same amount as wild type mice that
are lean. In a study of the gut microflora (the microbes living in the mice guts), Turnbaugh and
colleagues found that in the obese mice, Firmicutes were significantly more abundant and
Bacteroidetes were significantly less abundant than in the wild type mice. In a metagenomic
study of these mice, the obese mouse gut microbiota were found to contain more genes involved
in breaking down otherwise indigestible polysaccharides and those that are involved in the
metabolism of the sugar products. The feces of the obese mice contained significantly less
energy than the feces of the lean (wild type) mice, indicating that the gut microbiota of the obese
mice were more efficient at extracting energy from food. Thus, obesity may in part be due to the
microbiota in the gut and their efficiency at extracting energy from food.
3. What are some things we can learn from studying genomes?
A. ORFs The first step after obtaining sequence information on a genome is to identify the open reading
frames (or ORFs)— segments of DNA that have recognizable operon structure: a transcriptional
start site (-10 and -35 regions), a ribosomal binding site (Shine-Dalgarno sequence), a start codon
for translation spaced appropriately from the Shine-Dalgarno (ATG), and at least 300 base pairs
representing 100 codons before reaching an in-frame stop codon. Certainly, some genes have
fewer codons, but the majority of polypeptide-encoding sequences have greater than 100. These
ORFs generally correspond to genes—but to call them genes, you would have to demonstrate
that they are actually transcribed and translated, and that they play a role in the cell. ORF
content is proportional to genome size in the prokaryotes, but this relationship breaks down in
the eukaryotes. Some organisms have undergone genome reduction—the loss of genes (ORFs)
because they live in environments where those genes are not needed. For example, the genomes
of free living bacteria range from ~1.5-10 million nucleotides, with a median around 4.1 million
base pairs; however, obligate parasites have range of 800,000 to ~7,000,000 nucleotides with a
median of 2.4 million and obligate symbionts have a range from 400,000 to about 2 million
nucleotides, with a median of about 900,000 nucleotides. Interestingly, the genome sizes of
some of these obligate symbionts are smaller than the genomes of some large viruses.
Computer algorithms have been created that search through the reams of genomic DNA
sequence to not only identify ORFs, but also to predict their function based on DNA and amino-
acid sequence similarity to known genes. After the computer has had a shot, researchers must go
through the information by hand to determine if the computer was correct. This process is called
“annotation” and is extremely labor intensive. Annotation is based on homology to previously
characterized gene sequences; thus, gene function can actually be quite different than predicted.
Thus, genomics can only provide us with hypotheses regarding the organism’s function—
hypotheses that must be tested at a later time.
B. Genome maps
Once a genome sequence is annotated, several maps representing the informational content can
be generated. One of the simplest maps represents the genome as a series of informational
circles. These circles are color-coded and show such things as size of the genome, origin of
replication, numbers and positions of tRNA genes and rRNA operons, numbers and positions of
ORFs on both strands of DNA (remember that genes can be encoded on both strands of the
double helix), transposons or insertional sequence elements (indicating transposition of genes),
and G+C content of the genome. All of this information is basic and allows one to compare the
general characteristics of multiple bacteria with one image.
This information has been provided for 100’s of bacteria to date by two main US sources, the
non-profit The Institute for Genomic Research (TIGR), which is now defunct, and the
governmental Joint Genome Institute (JGI) run by the US Department of Energy. Some genome
sequences have also been completed by individual researchers through Genome Canada and by
the governmental French organization, Genoscope. Soon, thousands of complete bacterial
genome sequences will become available due to activities of several more companies
internationally, and also because bacterial genome sequences are being added as “filler” to large
eukaryotic genomic sequencing initiatives. Synteny, or comparing the order of genes, can tell us about the evolution of the genomes of
different organisms and how often recombination has occurred. These kinds of evolutionary
history studies are one major goal of comparative genomics, where the genome sequences of
closely related strains and species are compared to each other. This provides insights into which
genes or ORFs of unknown function (also known as URFs) are critical (shared among all strains)
and which are ancillary (found among only a few strains. This has led to the concept of the core
genome, consisting of genes shared by all strains of the same species, and the pangenome,
which includes both the core genome and the ancillary genome, which is found in only one or a
few strains of a species.
Comparative genomics has also led to the recognition of chromosomal islands—regions of the
genome that appear to have come from other organisms. These regions are often flanked by
inverted repeats (a signal that they may have come from transposons or viruses) and have
different G+C content or codon usage relative to the rest of the genome. Many of these
chromosomal islands code for genes important for virulence and are called pathogenicity
islands. Transfer of a pathogenicity island can turn a normally