BIO314 Lab 10 – Introductory Bioinformatics (Part 1)
1 - Databases & other tools
2 – Genetic screening using BLAST queries and alignments
Section 1 – Some important resources – databases, tools, etc.
Nucleotide & Genomic Databases
NCBI Nucleotide (GenBank, RefSeq, etc.): http://www.ncbi.nlm.nih.gov/nucleotide
Ensembl Genome Browser (EMBL): http://www.ensembl.org/index.html
Online Mendelian Inheritance in Man (genes & genetic disorders): http://www.omim.org/
SNPedia (human genetic variations): http://www.snpedia.com/index.php/SNPedia
FlyBase (Drosophila vineagar fly model genome): http://flybase.org/
XenBase (Xenopus frog model genome): http://www.xenbase.org/
WormBase (Caenorhabditis nematode model genome): http://www.wormbase.org/
Protein & Proteomics Databases:
UniProt (sequences, alignment, supporting data, etc.): http://www.uniprot.org/
World-wide Protein Databank (structures): http://www.wwpdb.org/
ProteomeScout (processed proteomic datasets): https://proteomescout.wustl.edu/
GelMap (proteins ID'd on 2D Gels: https://gelmap.de/
Expression & Microarray Databases:
Gene Expression Omnibus (NCBI Geo): http://www.ncbi.nlm.nih.gov/geo/
Genevestigator Expression Search Engine: https://genevestigator.com/gv/
Pathways and Metabolic Databases:
Kyoto Encyclopedia of Genes and Genomes (KEGG): http://www.kegg.jp/kegg/pathway.html
Comparative Toxicogenomics Database (interactions between toxins and genes): http://ctdbase.org/
Reactome (reactions, pathways, bioprocesses): http://www.reactome.org/
Enzyme Portal (small molecule chem, pathways, etc.): http://www.ebi.ac.uk/enzymeportal/
BOLD Systems (DNA Barcoding): http://www.boldsystems.org/
USDA PLANTS Database (taxonomy & species ranges in NA, native vs. non-native, benign vs. invasive,
identification keys, etc.): http://plants.usda.gov/java/
NCBI Entrez: https://www.ncbi.nlm.nih.gov/gquery/ When beginning work on a particular research question, in addition to a survey of the literature it is
often helpful to examine what other relevant information and pre-existing experimental data is
available from various databases. Generally speaking, it is easiest to begin your search in a meta
database, i.e. a database itself calls up information from many other, more subject specific, databases.
As an example of how one might do this, we will do a simple query using the meta database Entrez
from the American National Center for Biotechnology Information, or NCBI.
1) Navigate to the NCBI Entrez main page using link on previous page. Below the search bar, you will
see the names of all the databases that Entrez searches and a short description of what each contains.
2) Search the term Marfan Syndrome. In the space below the search bar, numbers of results will
appear next to each of the databases that Entrez draws from.
3) Click on the OMIM link. Note how only the second search result here is actually about Marfan
Syndrome (the first concerns another disorder with some similarities in phenotypic outcome). Using
these databases is similar to internet search engines – you will often sweep up a fair amount of
irrelevant information along with what you are interested in.
4) Click on the Marfan Syndrome OMIM result, and have a quick glance at the report on this heritable
connective tissue disorder. Note its well-developed summary of the primary literature concerning this
disorder. Navigate back to the Entrez search results page and spend some time examining other pages.
Section 2 - BLAST queries and alignments & basic genetic testing
Comparing nucleotide or protein sequences from the same or different organisms is a very powerful
tool in molecular biology. We can infer the function of newly sequenced genes, predict new members
of gene families, and explore evolutionary relationships by finding similarities in sequences. Now that
whole genomes are being sequenced, sequence similarity searching can even be used to predict the
location and function of protein-coding and transcription-regulation regions in genomic DNA.
The Basic Local Alignment Search Tool, or BLAST, is perhaps the tool used most frequently for basic
calculations of sequence similarity. There are a number of variations of BLAST for use with different
query sequences against different databases.
Most people use BLAST by entering a nucleotide or protein sequence into the textbox on one of the
BLAST web interfaces hosted by the NCBI and submitting it as a query against all (or a subset of) public
sequence databases. The search is performed on the NCBI databases and servers, and after a
processing delay, the results will show in the person's browser in the chosen display format. Many
biotechnology companies, genome scientists, and bioinformatics personnel also use a “stand-alone”
version of BLAST to query their own, local databases. They may also customize BLAST in some way to
make it better suit their needs.
There are a number of BLAST variations for different kinds of sequence comparisons, e.g., a DNA query to a DNA database, a protein query to a protein database, and a DNA query, translated in all six
reading frames, to a protein sequence database. There are even more advanced versions of BLAST for
special queries, such as PSI-BLAST (for iterative protein sequence similarity searches using a position-
specific score matrix) and RPS-BLAST (for searching for protein domains in the Conserved Domains
2.2 BRCA1 Mutations, Breast Cancer, and Genetic Testing
There are a number of mechanisms in the body to repair cellular DNA damage, each involving sets of
proteins that repair a different type of damage, such as single-strand insertion errors, single- and
double-strand breaks, and nucleotide dimerization. If such damages are not repaired, one potential
outcome is the development of the cell into pre-cancerous tissues as it divides, and ultimately into
cancerous tissue as more mutations accumulate. Consequently, proteins involved in these processes
are categorized as a tumor supressors, because their functioning helps prevent the formation of
tumors. BRCA1, “breast cancer associated, early onset 1”, is one such tumor suppressor.
The lifetime risk of developing breast cancer is 12%, meaning 120 women in every 1000 will develop
the disease at some point in their lives. Inherited mutations to the BRCA genes are found in 5% of all
breast cancer cases. When a woman inherits a cancer-causing BRCA mutation, her risk of developing
breast cancer increases by up to 85%, depending on the exact mutation. Furthermore, mutations to
BRCA genes also increase the risk of ovarian cancer from the population average of 2% to between 16
The BRCA1 gene has 22 exons and spans ~110kb of DNA at the autosomal locus 17q21. It encodes a
nuclear phosphoprotein that combines with other tumor suppressors, DNA damage sensors, and
signal transducers to form a large multi-subunit protein complex known as the BRCA1-associated
genome surveillance complex (BASC). In particular, BRCA1 products associate with RNA polymerase II,
and through its C-terminal domain it also interacts with histone deacetylase complexes. Therefore,
this protein plays a role in transcription along with DNA repair of double-stranded breaks and
recombination. Alternative splicing plays an important role in the subcellular localization and
physiological function of this gene. Many alternatively spliced variants have been described for BRCA1,