A Brief Review of the Human Genome Landscape
The Human Genome Project
- ….was officially completed in 2003. In this lecture we will review what we have learned
about our genome. There are some interesting surprises!
Some Facts About the Human Genome
- The human genome has around 3,200 million nucleotide bases (3,200 Mb).
o This compares with:
Escherichia coli: ~4.6 Mb
Saccharomyces cerevisiae (Yeast): ~12.5 Mb
Caenorhabditis elegans (Worm): ~95.5 Mb
Drosophila melanogaster (Fruit fly): ~122 Mb
Fugu rubripes (Puffer fish): ~365 Mb
Oryza sativa (rice): ~389 Mb
Mus musculus (Mouse): ~3,000 Mb
Allium cepa (onion): ~15,000 Mb
- The number of genes in the human genome is approximately 21,000, according to a
recent estimate. This is much lower than previous estimates of 80,000-140,000.
- This compares with:
o Escherichia coli: ~4,400
o Saccharomyces cerevisiae (Yeast): ~5,700
o Caenorhabditis elegans (Worm): ~19,800
o Drosophila melanogaster (Fruit fly): ~13,500
o Fugu rubripes (Puffer fish): ~similar to human
o Mus musculus (Mouse): ~similar to human
o Oryza sativa (Rice): ~40,000-60,000
- Less than 3% of the genome corresponds to protein-coding genes (even less if one
considers only protein-coding exons: 1.2%).
o In the genome, there are gene-rich regions, which typically have a relatively
high GC content, and gene-poor regions, which are richer in A and T bases.
- Repeated sequences that do not code for proteins make up at least 50% of the human
- In spite of having a lower number of genes than initially expected, the human
(vertebrate) proteome is more complex than in other animals (Worm or fly). The main
o More transcripts per gene due to alternative splicing
o More complex protein architecture (in terms of the number and arrangement of
protein domains-regions within proteins with a well-defined set of properties or
- Other interesting facts, based on the Human Genome Project and recent data from the
1,000 genomes project:
- Mutation rate is higher in males than in females (2:1 ratio), at least in part due to the
higher number of cell divisions required for sperm formation than for eggs.
- When sequencing individual genomes, the 1,000 genomes project described that the
mean number of variant SNP sites per individual ranges between 2.8 and 3.4 million
(depending on population) and the mean number of variant indel sites per individual
between 350,000 and 385,000.
- Putative functional variants
o An individual typically differs from the reference human genome sequence at:
10,000-11,000 non-synonymous sites 10,000-12,000 synonymous sites (do not change the amino acid)
190-210 in-frame indels
80-100 premature stop codons (
40-50 splice-site-disrupting variants
220-250 deletions that shift the reading frame
Navigating the Genomic Landscape
- In the remaining part of the lecture, we will review in more detail the human genome
landscape, underlying the most interesting findings and surprises.
- We will also compare the human genome with the genome of other species, such as the
yeast, the worm, the fly and the mouse, when relevant.
Broad Genomic Landscape
- Recombination in the human genome:
o Similar to what happens in GC content, there is quite a lot of variation in
recombination rates in the human genome.
o Long chromosome arms have a lower recombination rate than short chromosome
o Recombination tends to be suppressed near the centromeres, and is higher in
the distal portions of the chromosomes.
- Repeat content of the human genome
o Repeat sequences account for 47% of the human genome (probably more,
because it is not possible to recognize the oldest repeat sequences). In contrast,
protein-coding sequences are less than 2%!!.
o The portion of the human genome accounted for by repeat sequences (47% or
more) is a little bit higher than in the mouse (37.5%), and much higher than in
invertebrates, such as worm (7%), fly (3%) or some plants (mustard weed, 11%).
- Main classes of repeat sequences: 1. Transposon-derived repeats***
o Long insterspersed elements (LINEs): 21% genome
o Short interspersed elementes (SINEs): 13% genome
o LTR transposons (LTRs): 8% genome (long term that repeats)
o DNA transposons: 3% genome
- Main classes of repeat sequences: 2. Simple sequence repeats (SSRs).
o Simple sequence repeats (SSRs) are perfect or slightly imperfect tandem repeats
of a particular core sequence. They comprise about 3% of the human genome.
o SSRs with short repeat units (1-13 bases) are also known as microsatellites.
They are used often in evolutionary and disease mapping studies.
o SSRs with long repeat units (14-500 bases) are known as minisatellites.
- Main classes of repeat sequences: 3. Segmental duplications.
o Segmental duplications involve the transfer of large blocks of the human
genome (1-200 Kb) from one genomic region to other locations in the human
genome. They comprise around 3% of the human genome. There are two main
o Interchromosal duplications (duplication is observed in different
chromosomes) and Intrachromosomal duplications (duplicated regions in the
same chromosome as the original sequence)