Genomic Data

References

The following references, which are the sources for most of the current module's figures, provide a tutorial on protein structures and an excellent hyperlinked glossary of genetic terms.

· "Protein Structure Basics" by Bernhard Rupp, UCRL-MI-125269, Lawrence Livermore National Laboratory, 2000, http://ruppweb.dyndns.org/

· "Talking Glossary of Genetic Terms," National Human Genome Research Institute, http://www.nhgri.nih.gov/DIR/VIP/Glossary

Introduction

In the module "Computational Science and Web-Accessed Databases," we introduced a "Genomic Example" involving web-accessed genomic databases. The current module provides the biological background for understanding such databases and a discussion of their characteristics. The next module on "Genomic Sequence Comparison" presents the dynamic programming technique for measuring the similarity of two DNA sequences, and the module "Searching Genomic Databases" considers algorithms for discovering regions of sequence alignment. Thus, with our knowledge HTML, databases, and CGI programming, we will be able to develop form-based web pages and CGI programs for interfacing between web pages and genomic databases, for processing DNA sequences, and for returning results to the user. Such applications fall in the domain of bioinformatics.

A newly developing area of computational science, called bioinformatics, deals with the organization of biological data, such as in databases, and the analysis of such data. Recently, enormous strides have been made in genetics, due in part to the power of bioinformatics, as the following examples illustrate:

Geneticists in Sweden and with Merck Research Laboratories in West Point, Pennsylvania, studied families with a high occurrence of macular degeneration, which causes loss of vision in old age. "Once the combined team got to within 800,000 bases of it [a section of the chromosome], the researchers searched computer databases for potential genes in that target region. Both labs then looked for mutations in those genes in family members with the disease but not in people with normal vision." One of the groups "found that the mutations consistently appeared" in a particular gene. The researchers are now investigating how mutations in this gene initiate the pathology. ["New Gene Found for Inherited macular Degeneration", Human Genetics News Focus column by Elizabeth Pennisi, v. 281, 3 July 1998, Science, p. 31]

A microbial ribosomal DNA (rDNA) sequence database is particularly valuable because biologists have not been able to culture many bacteria and viruses outside the body. By extracting the rDNA and comparing its genetic makeup with data from the database, scientists have been able to determine organisms that cause Crohn's disease, which is an inflammatory bowel disorder, and several other diseases ["The Search for Unrecognized Pathogens" by David A. Relman, v. 284, 21 May 1999, Science, pp. 1308-1310].

National Center for Biotechnical Infomation (NCBI) web site's Educational page describes "What is bioinformatics?" and "Why use bioinformatics?".

Proteins
Proteins are the basic building blocks of life, performing many critical functions. Some proteins are the fundamental, structural components of tissue; while others, (enzymes), are catalysts for chemical reactions. A simple protein is a linear polymer or chain of amino acids. Table 1 lists the twenty amino acids common to proteins along with their one-letter and three-letter codes. Each amino acid contains an amino group (-NH₃+) at one end and a carboxyl group (-COO-) at the other, connected by a carbon (a-carbon). A variable side-chain (R-group) and a hydrogen are attached to the a-carbon (see Figure 1). The R-group is responsible for the chemical nature of each amino acid. Chains of amino acids are linked by peptide bonds, which form through the interaction of an amino group of one amino acid with the carboxyl group of another (see Figure 2). This interaction results in condensation, or release of water. Therefore, each amino acid component is referred to as a residue. Because one end of a protein has a free amino group (N-terminal) and the other has a free carboxyl group (C-terminal), we can assign a direction to the chain and list the amino acids from the “beginning” (N-terminal) of the chain to the “end” (C-terminal).

Table 1. The twenty commonly occurring amino acids along with their one-letter and three-letter codes. (Note: B is used when one cannot distinguish between D and N because of amino acid analytical processing. Similarly, Z is used when it is ambiguous whether the amino acid is E or Q. X represents an unknown or nonstandard amino acid.)

One-Letter Code	Three-Letter Code	Name

A	Ala	Alanine
C	Cys	Cysteine
D	Asp	Aspartic Acid
E	Glu	Glutamic Acid
F	Phe	Phenylalanine
G	Gly	Glycine
H	His	Histidine
I	Ile	Isoleucine
K	Lys	Lysine
L	Leu	Leucine
M	Met	Methionine
N	Asn	Asparagine
P	Pro	Proline
Q	Gln	Glutamine
R	Arg	Arginine
S	Ser	Serine
T	Thr	Threonine
V	Val	Valine
W	Trp	Tryptophan
Y	Tyr	Tyrosine

Figure 1 Structure of an amino acid ["Protein Structure Basics" by Bernhard Rupp, UCRL-MI-125269, Lawrence Livermore National Laboratory, 2000, http://ruppweb.dyndns.org/Xray/tutorial/protein_structure.htm]

Figure 2 Chain of two amino acids ["Protein Structure Basics" by Bernhard Rupp, UCRL-MI-125269, Lawrence Livermore National Laboratory, 2000, http://ruppweb.dyndns.org/Xray/tutorial/protein_structure.htm]

The linear sequence of residues is called the primary structure of a protein (see Figure 3). However, interactions between hydrogens and oxygens of the amino and carboxyl groups of different amino acids may result in regular arrangements, called the secondary structure of the protein. For example, a helix is one type of secondary structure (see Figure 5). Chemical interactions that take place between R-groups of nonadjacent amino acids maintain this structure. The primary structure governs these higher order structures of proteins, which are essential in determining the three-dimensional conformation of the protein. A single polymer of amino acids might be more properly called a polypeptide. Some functional proteins are made up of only one polypeptide, whereas many proteins are made up of more than one polypeptide. For instance, hemoglobin consists of two pairs. Proteins that consist of two or more interacting polypeptides are said to exhibit quaternary structure (see Figure 5). These higher-order structures are very important because they determine the overall shape of the protein, possible bindings of the protein, and, thus, its function.

Figure 3 Primary protein structure ["Talking Glossary of Genetic Terms," National Human Genome Research Institute, http://www.genome.gov/Pages/Hyperion//DIR/VIP/Glossary/Illustration/amino_acid.shtml]

Figure 4 Helix ["Protein Structure Basics" by Bernhard Rupp, UCRL-MI-125269, Lawrence Livermore National Laboratory, 2000, http://ruppweb.dyndns.org/Xray/tutorial/protein_structure.htm]

Figure 5 Protein structures ["Talking Glossary of Genetic Terms," National Human Genome Research Institute, http://www.genome.gov/Pages/Hyperion//DIR/VIP/Glossary/Illustration/protein.shtml]

Quick Review Question

Quick Review Question 1

a. Give the name of the area of computational science that deals with the organization and the analysis of biological data.

b. Give the number of amino acids common to proteins.

Match each of the following phrases with the best term.

c. Basic building blocks of life

a-carbon

amino acids

amino group

C-terminal

carboxyl group

enzymes

higher-order structures

N-terminal

peptide bonds

polypeptide

primary structure

proteins

R-group

residue

d. Proteins that are catalysts for chemical reactions.

a-carbon

amino acids

amino group

C-terminal

carboxyl group

enzymes

higher-order structures

N-terminal

peptide bonds

polypeptide

primary structure

proteins

R-group

residue

e. A simple protein is a linear chain of these

a-carbon

amino acids

amino group

C-terminal

carboxyl group

enzymes

higher-order structures

N-terminal

peptide bonds

polypeptide

primary structure

proteins

R-group

residue

f. Link chains of amino acids

a-carbon

amino acids

amino group

C-terminal

carboxyl group

enzymes

higher-order structures

N-terminal

peptide bonds

polypeptide

primary structure

proteins

R-group

residue

g. Formed through the interaction of an amino group of one amino acid with the carboxyl group of another

a-carbon

amino acids

amino group

C-terminal

carboxyl group

enzymes

higher-order structures

N-terminal

peptide bonds

polypeptide

primary structure

proteins

R-group

residue

h. Amino acid component of a protein

a-carbon

amino acids

amino group

C-terminal

carboxyl group

enzymes

higher-order structures

N-terminal

peptide bonds

polypeptide

primary structure

proteins

R-group

residue

i. Free amino group that is the beginning of the chain of amino acids

a-carbon

amino acids

amino group

C-terminal

carboxyl group

enzymes

higher-order structures

N-terminal

peptide bonds

polypeptide

primary structure

proteins

R-group

residue

j. Free carboxyl group that is the end of the chain of amino acids

a-carbon

amino acids

amino group

C-terminal

carboxyl group

enzymes

higher-order structures

N-terminal

peptide bonds

polypeptide

primary structure

proteins

R-group

residue

k. Linear sequence of residues of a protein

a-carbon

amino acids

amino group

C-terminal

carboxyl group

enzymes

higher-order structures

N-terminal

peptide bonds

polypeptide

primary structure

proteins

R-group

residue

l. A single polymer of amino acids

a-carbon

amino acids

amino group

C-terminal

carboxyl group

enzymes

higher-order structures

N-terminal

peptide bonds

polypeptide

primary structure

proteins

R-group

residue

m. Determine the overall shape of the protein and its function

a-carbon

amino acids

amino group

C-terminal

carboxyl group

enzymes

higher-order structures

N-terminal

peptide bonds

polypeptide

primary structure

proteins

R-group

residue

Nucleic Acids
In the cell, the nucleic acid DNA (deoxyribonucleic acid) contains the encoded information for the manufacture of all the proteins a cell needs. However, DNA does not oversee protein synthesis directly but acts through an intermediary nucleic acid, RNA (ribonucleic acid). The RNA sequences subsequently specify the amino acid sequences of proteins. Both DNA and RNA are polymers of molecules called nucleotides. A nucleotide is a compound molecule made up of a sugar (either deoxyribose or ribose), a phosphate, and a nitrogen base (adenine (A), guanine (G), cytosine (C), and thymine (T) in DNA or uracil (U) in RNA (see Figure 6). DNA is a double strand of nucleotides, whereas RNA is a single strand. Thus, we can say a particular DNA molecule has 300 bases or 300 nucleotides. As with proteins, because the backbone of a strand always has specific chemical structures at opposite ends, we can canonically give direction to the sequence of nucleotides (or bases) in a strand.

Figure 6 RNA and DNA ["Talking Glossary of Genetic Terms," National Human Genome Research Institute, http://www.genome.gov/Pages/Hyperion//DIR/VIP/Glossary/Illustration/rna2.shtml]

Bases in one strand may bond with bases in another. Because of their structure, A and T always bond together, and C and G always bond together. Each pair is said to be made up of complementary bases and is referred to as a base pair (bp). The number of such base pairs is used to describe the length of a DNA molecule. Because of pairing consistency, by knowing the sequence of bases in one strand, we can deduce the sequence of bases in the other strand through reverse complementation. For example, suppose one sequence is s = ATGAC. Because of the required pairing, A - T and C - G, we know the base pairs must appear as follows:

s:	A	T	G	A	C
	\|	\|	\|	\|	\|
	T	A	C	T	G

However, to list the bases in the canonical order, we first reverse s (s’) and then complement (

) to obtain the sequence GTCAT in the canonical order, as follows:

s:	A	T	G	A	C
s’	C	A	G	T	A
	G	T	C	A	T

s:	G	T	A	C	C	T
s’	T	C	C	A	T	G
	A	G	G	T	A	C

In contrast to DNA, RNA (ribonucleic acid) is a single strand of nucleotides made up of ribose sugars and bases A, C, G, and U instead of the nitrogen base thymine (T) (see Table 2). Several types of RNA with different functions exist in the cell.

Table 2. Bases in DNA and RNA

Base	Abbreviation	Complement	In DNA	In RNA
adenine	A	T in DNA, U in RNA	yes	yes
guanine	G	C	yes	yes
cytosine	C	G	yes	yes
thymine	T	A	yes	no
uraciler	U	A	no	yes

Quick Review Question

Quick Review Question 3 Match all the terms that apply for each of the following parts.
a. Contains the encoded information that is stored to direct the manufacture of all the proteins a cell needs
	A	C	DNA	G	nucleotide
	protein	RNA	T	U

b. An intermediary nucleic acid in protein synthesis
	A	C	DNA	G	nucleotide
	protein	RNA	T	U

c. Compound molecule made of a sugar, a phosphate, and a nitrogen base
	A	C	DNA	G	nucleotide
	protein	RNA	T	U

d. Type of molecule in DNA and RNA sequences
	A	C	DNA	G	nucleotide
	protein	RNA	T	U

e. Bases in DNA
	A	C	DNA	G	nucleotide
	protein	RNA	T	U

f. Bases in RNA
	A	C	DNA	G	nucleotide
	protein	RNA	T	U

g. Always bonds with base A in DNA
	A	C	DNA	G	nucleotide
	protein	RNA	T	U

h. Always bonds with base A in RNA
	A	C	DNA	G	nucleotide
	protein	RNA	T	U

i. Always bonds with base C in DNA or RNA
	A	C	DNA	G	nucleotide
	protein	RNA	T	U

j. Always bonds with base T in DNA
	A	C	DNA	G	nucleotide
	protein	RNA	T	U

k. Always bonds with base U in RNA
	A	C	DNA	G	nucleotide
	protein	RNA	T	U

l. Always bonds with base G in DNA or RNA
	A	C	DNA	G	nucleotide
	protein	RNA	T	U

m. Single strand of nucleotides
	A	C	DNA	G	nucleotide
	protein	RNA	T	U

n. Double strand of nucleotides
	A	C	DNA	G	nucleotide
	protein	RNA	T	U

From Genes to Proteins
Each cell contains chromosomes, which are very long DNA molecules. A gene is a contiguous section of a chromosome that encodes information to build a protein or an RNA molecule see (Figure 7). In humans, a gene is composed of about 10,000 bp. A chromosome contains genes and contiguous sections that are not part of any gene. Some scientists believe that genes compose only about 10% of a human chromosome. A complete set of chromosomes in a cell is called the genome. For example, a human genome has 46 chromosomes in 23 pairs.

Figure 7 Gene as contiguous section of chromosome [Figure slightly modified from graphics at "Talking Glossary of Genetic Terms," National Human Genome Research Institute, http://www.genome.gov/Pages/Hyperion//DIR/VIP/Glossary/Illustration/gene.shtml]

For simplicity, we assume that a particular protein in an organism corresponds to exactly one gene. In a gene, a sequence of three nucleotides (triplet) specifies an amino acid. For example, the sequence ACG or the codon ACA encodes the information for the amino acid Threonine (Thr). The genetic code represents such a correspondence between these triplets and the amino acids they specify. With four base choices, a pair of bases could only encode information for (4)(4) = 16 amino acids. With three bases, (4)(4)(4) = 64 possible triplets exist. Several, such as ACG and ACA, encode the same amino acid; and three sequences do not encode for any amino acid.

Protein synthesis is the process of using genetic code to direct the building of proteins. Synthesis begins in the nucleus, where enzymes catalyze the production of a molecule of RNA, termed messenger RNA or mRNA. As Figure 8 illustrates, each DNA triplet specifies a complementary sequence of three nucleotides, which we call a codon, in the RNA. The synthesis of RNA is called transcription. During transcription, base pairing ensures formation of a strand of RNA that is complementary to the gene sequence with U replacing T.

Figure 8 Codons in RNA ["Talking Glossary of Genetic Terms," National Human Genome Research Institute, http://www.genome.gov/Pages/Hyperion//DIR/VIP/Glossary/Illustration/codon.shtml]

As Figure 9 shows, transcription represents only the first step in protein synthesis. The initial transcript must be processed through a complex process of chemical changes that includes removal of portions of the RNA and splicing back together of loose ends. However, this modified mRNA molecule is in the nucleus with a double thickness of membrane separating the nucleus from the cytoplasm, or area surrounding the cell nucleus; while protein synthesis must conclude in the cytoplasm on small structures called ribosomes. The mRNA molecule must be transported from the nucleus into the cytoplasm before any protein can be synthesized.

Figure 9 Protein synthesis ["Talking Glossary of Genetic Terms," National Human Genome Research Institute, http://www.genome.gov/Pages/Hyperion//DIR/VIP/Glossary/Illustration/mrna.shtml]

After movement to the cytoplasm, the mRNA attaches to a ribosome, which essentially reads the code (sequence of codons) specified by the gene. The ribosome brings together the mRNA strand with various molecules of another type of very small RNA, called transfer RNA (tRNA). There are many different tRNA molecules, each of which attaches very specifically to only one type of amino acid. The ribosomal enzymes must ensure that each codon of the mRNA combines with a tRNA that carries the correct amino acid in the nascent protein sequence. This process is possible because each tRNA molecule contains a triplet code, called an anticodon. The anticodon then base pairs with the complementary codon of the mRNA, ensuring addition of the correct amino acid. The ribosome moves down the mRNA molecule one codon at a time, allowing the correct tRNA with the specified amino acid to base pair. Enzymes of the ribosome also help by catalyzing the formation of peptide bonds between the amino acid and the growing peptide chain (see Figure 10). Eventually, the ribosome reaches a codon that does not code for any amino acid, which signals the ribosome to stop. Voila! With the help of some RNA and ribosomes, we have a protein with an amino acid sequence that the DNA sequence of the gene specified.

Figure 10 Growing peptide chain ["Talking Glossary of Genetic Terms," National Human Genome Research Institute, http://www.genome.gov/Pages/Hyperion//DIR/VIP/Glossary/Illustration/peptide.shtml]

Quick Review Question

Quick Review Question 5 Select the best match for each of the following:
a. Very long DNA molecule
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA
b. Contiguous section of a chromosome that encodes information to build a protein or an RNA molecule
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA
c. A complete set of chromosomes in a cell
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA
d. Sequence of three nucleotides
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA
e. The process of using genetic code to direct the building of proteins
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA
f. The place in the cell where protein synthesis begins
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA
g. The place in the cell where enzymes catalyze the production of a molecule of RNA
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA
h. A molecule of RNA produced in the nucleus
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA
i. Sequence of three nucleotides in RNA complementary to a DNA triplet
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA
j. The synthesis of RNA
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA
k. Area surrounding the cell nucleus
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA
l. Small structure on which protein synthesis concludes
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA
m. Location in the cell of ribosomes
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA
n. A type of RNA that attaches very specifically to only one type of amino acid
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA
o. Triplet code that a tRNA molecule contains
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA
p. Bond between an amino acid and a growing peptide chain
	anticodon	chromosome	codon	cytoplasm	DNA
	gene	genome	mRNA	nucleus	peptide bond
	protein synthesis	ribosome	transcription	triplet	tRNA

Studying the Genome
Sequencing is the process of finding the base-pair sequence of a section of DNA. Although a human chromosome contains about 100-million (10⁸) base pairs, scientists have only been able to sequence pieces of DNA with several thousand bp. One way to sequence a much larger segment is to split it into smaller pieces; determine the sequences on these fragments; and using computational techniques, reconstruct the sequence for the larger segment. Computational techniques must handle the errors that are almost always in the laboratory data.

Human Genome Project
The Human Genome Project (HGP), which started in 1988, is an international effort to determine the whole human DNA sequence. A working draft of the human genome sequence by a private company, Celera Genomics, and government laboratories was published in February, 2001. Knowing the sequence will help scientists in their work to determine the location of genes and the functions of proteins. Databases of DNA, RNA, and protein sequences and computational techniques to search and analyze the data will help in these efforts. Already, scientists have completely sequenced the DNA for 26 organisms, such as baker’s yeast, several microbes that cause disease, and a common worm (Caenorhabditis elegans), which is the first animal whose entire genome is known.

The National Center for Biotechnology Information (NCBI), USA, maintains GenBank, a database of more than 920,000 plant and animal DNA sequences from over 16,000 species. The database has been doubling about every 18 months. Figure 11 is an example of a GenBank record. The accession number, here M82814, is a key and field identifiers describe the contents of the fields, which are strings. One can search the database by keyword or sequence. Other databases include DNA database EMBL, protein sequence database PIR, PDB database for three-dimensional structures of proteins, and EcoCyc Encyclopedia of Escherichia coli Genes and Metabolism.

Figure 11. A GenBank record (Click here)

Genomic Database Characteristics
The data and access characteristics of genomic databases present challenges to computational scientists. First, genomic data is very complex. For example, the yeast genome has a 10-million bp sequence. A genomic database stores subsequences and a wealth of related data, such as locations, references, comments, and features.

This complexity results in another characteristic of genomic databases, which is the necessity for some form of redundancy. For example, there can be several alternative genes (alleles), say for determining different eye colors, at the same location (locus) on a DNA sequence. Databases handle redundancy in different ways. GenBank makes little attempt to reduce redundancy, while the nr (non-redundant) nucleotide database by NCBI has merged entries that have identical sequences .

A high degree of variability is another aspect of the complexity of biological data. Often there are exceptions, or new discoveries cause revisions or additions to possible values. For example, besides the 20 common amino acids, we use three other symbols for ambiguous or unknown situations. Thus, data types must be flexible with few constraints.

Because of this variability, schemas change rapidly. Database designers must manage extensions to the schemas as scientific knowledge expands. Many genomic databases are re-released annually or semiannually.

To add to the complexity involving data types and schemas, two biologists or two databases may represent the same data in different ways. However, scientists need to be able to compare their findings.

One of the most significant access characteristics of genomic databases is that scientists read, or query, such databases extensively but rarely write to one. For example, each month in 2000, the MITOMAP database had over 10,000 people that query the database but fewer than five who made additions to it.

As another characteristic, these querying users have a limited knowledge of the database schema design. Hence, the web access to the database must be flexible and intuitive to handle a variety of queries. Often, these queries are complex and involve multiple data sets.

Moreover, biologists are interested in the data in context. For example, a biologist might want to know similar sequences to a query sequence, where they are alike, where they are different, how the sequences compare, functions of similar subsequences, etc. By making such comparisons, he or she hopes to deduce functions of the query sequence.

Biologists must query the most recent version of a database but also must be able to access earlier versions to examine data and repeat procedures. Thus, a genomic database must provide access to historic data.

The following is a summary of the main characteristics of genomic databases:

Very complex
Necessity for a form of redundancy
Highly variable data
Rapidly changing schemas
Varied data representations
Extensive use of read only access and limited use of write access
Limited knowledge of schema design by biologists
Importance of complex queries
Importance of data context
Necessity of access to historic data

Quick Review Question

Quick Review Question 6 Select the data and access characteristics of genomic databases
	All versions of data needed	Complex	Context of data important
	Databases may represent the same date in different ways	Data only needed in isolation	Detailed knowledge of schema by users
	Fixed schemas	Intricate queries important	Limited knowledge of schema by users
	Mostly read and write access	Mostly read only access	Mostly write only access
	No data redundancy	One form of data representation	Only latest version of data needed
	Only simple queries	Schemas change often	Simple
	Some data has exceptions	Some data redundancy needed	User-friendly web access needed

Exercises
1. Consider TTGCGGAATC, which is part of a hypothetical DNA sequence.

a. Find its complementary strand.
b. Find the corresponding sequence of bases in mRNA.
c. Give 6 possible sequences of codons for the protein it produces.

2. Repeat Exercise 1 for the sequence CTGGATAGGCCAGT.

3. In GenBank, search for each of the following:

a. The information on Accession Number M82606. Hint: You can use the online search at the GenBank site.
b. The disease-causing agent Vibrio cholerae, which causes cholera. Search the vital sequence section of the database.

4. For the GenBank entity record in Figure 11, determine possible attributes and their types.

Projects
1. Develop a program to read DNA sequences from a file and to write the sequences along with their complementary strands to another file.

2. Develop a program to read DNA sequences from a file and to write to another file each sequence with the six possible sequences of codons for the protein it produces.