Introduction to Databases

Introduction
    Scientific research is producing enormous amounts of data. Not only are scientists experiencing challenges in storing such quantities, but, more importantly, they must find fast ways to access and innovative ways to interpret the flood of data. Projects that produce and analyze large amounts of data occur in all areas of science. For example, the U.S. Geological Survey plans to take and store about 12 terabytes (10¹² bytes or characters) of high resolution aerial photographs of the entire United States; this quantity is more than all the printed material in the U.S. Library of Congress. For at least 15 years, the Large Hadron Collider at the European Laboratory for Particle Physics (CERN) in Switzerland will generate trajectory and energy data for billions of particle collisions each year. The World Ocean Database stores over four million ocean temperature profiles. The Species 2000 project will make it possible to search about 200 data collections for information on any of 1.7 million known species on earth.

    Even on a more modest level, a research hospital needs to access enormous amounts of information about people, scientific studies, and facilities. Doctors, administrators, and researchers maintain their own sets of data, but overlaps exist. For example, the hospital must store each patient's name, ID number, address, phone number, insurance information, charges, and doctors. A patient's doctor wants access to the first four items along with the patient's history and treatment information. If the patient is part of a study, a scientist needs the contact information and data related to the study.

    If a hospital stored all of these data in separate files, a change in a patient's address would result in updating several different files. Inevitably, there would be a case where the address would be correct in one place, but not another. Moreover, the billing office would need to consult two different files to inform the doctor about the status of a patient's account. However, we do not want to give that office the ability to see or change the patient's medical status. The solution to this complex interaction of data is to store the information in a database instead of in separate files. A database is an integrated collection of files. Much data is placed in the database only once with methods of interrelation and retrieval of the information in many different forms.

Definition

A database is an integrated collection of files.

Traditional File Processing
To appreciate the advantages of databases let us consider in greater detail some of the problems associated with traditional file processing. When organizations started buying computers in the 1950's and 60's, they appreciated the ability to store huge volumes of information in files and to reduce the cost, time, and errors inherent with manual processing. As computer use grew, however, problems with the file processing became apparent. Beginning to appreciate the power of the computer, personnel demanded more processing and different combinations of the data in files. The end user, however, usually did not know how to program and could not add the new features. Requests flooded into the already overworked data processing departments.

Realizing the capabilities, a hospital might request a match of information in files from three areas: billing, doctor information, and a particular drug study. Two difficulties can arise in such a situation. Someone must write a program to handle matching data from all the files. Moreover, since programmers created the files independently, some fields, which should hold the same information, might have different formats. The lengths of names could be different, or one field could be of type float in one file and of type double in another. We say that there is a data inconsistency between the files.

Definition

There is data inconsistency between the files if the same data is stored in different formats in two files or if data must be matched between files.

Moreover, these files duplicate some of the data, such as names and addresses. This redundant data compromises data integrity; we have the possibility of a value being changed in one file, but not in another. Often, it is difficult and expensive to rectify such inconsistencies.

Definition

Redundant data is data that is duplicated between files.

Definition

Data integrity is the assurance that data is correct and consistent--that the data correctly reflects the "real" world.

Related to this need for multiple file access is the problem that there is very little data security in traditional file processing. Using a doctor's patient file, personnel in the billing office need only to ascertain charges; they should not have the authority to check or change a patient's condition.

Definition

Data security refers to data being protected so that only authorized personnel can access them.

Another problem arises from the occasional requirement to change the structure of a file, perhaps to add or adjust a field. For example, a few years ago zip codes expanded from five characters to ten (including the dash). The hospital's data processing department had to change every program that accessed zip codes for patients or staff. We say that traditional file processing suffers from data dependence.

Definition

There is data dependence if the format of the data storage is dependent upon the application program.

To summarize, the difficulties with traditional file processing are as follows:

Necessity of programming to access the data
Data inconsistency
Redundant data's compromise of data integrity
Lack of data security
Data dependence

Quick Review Question

Quick Review Question 1
Indicate the problem for each of the following situations:

a. In one file of results from experiments, dates are in the format mm/dd/yy, while in another related file the dates are in the format ddmmyyyy.
Programming necessary	Data inconsistency		Data integrity compromised
Data security		Data dependence

b. A scientist needs a list of all toxic chemicals from a file of chemical data.
Programming necessary	Data inconsistency		Data integrity compromised
Data security	Data dependence

c. A laboratory that stores all its data on files modified the format of experiment ids to contain a date. Consequently, all programs accessing the files had to be modified to accommodate the new format.
Programming necessary	Data inconsistency		Data integrity compromised
Data security	Data dependence

d. Without permission, one scientist was able to look at a file containing the results of lab tests by another scientist.
Programming necessary	Data inconsistency		Data integrity compromised
Data security	Data dependence

e. One file contains the names, addresses, and phone numbers of various laboratories, while another file contains the names, phone numbers, and heads of the laboratories.
Programming necessary	Data inconsistency		Data integrity compromised
Data security	Data dependence

In the 1940s, punch cards and paper tape stored files with data encoded using holes in the medium. Magnetic tape was introduced in the early 1950s as an external storage medium for the UNIVAC, the first commercially produced computer. Some people would not believe that information was recorded on the tape, because they could not see holes as they did on punched cards. One plant worker, who did not know about computer tape, received a delivery of tape he thought was no good because it had no adhesive back!

Database Management System
As an answer to these difficulties, starting in the late 1960's database management system (DBMS) software was developed to facilitate formation, manipulation, and maintenance of databases. A DBMS contains two major functional parts, a data manipulation language and a data definition language. End users employ the data manipulation language (DML) or query language to access and update data in the database. This language is usually very high level, so that the user does not need to be a programmer to manipulate the data. Database managers, system engineers, and programmers use the data definition language (DDL) to design and maintain the database. With the DDL they define the structure of the database; they develop a data dictionary, which contains a list of the data names, types, lengths, and locations; and they establish the security of the system. A schema is a description of the overall database structure, while a subschema holds the definition of part of the data described in the schema. Through the use of subschema, the database administrator can control access to the data. If the billing office only needs to access patients' contact, charge, and insurance information, the subschema for billing contains only these items. Thus, the billing office staff has a simplified view of the database; and they cannot accidentally or maliciously access or change data outside their purview.

Definition

A database management system (DBMS) is database software that contains two major functional parts, a data manipulation language (DML) or query language to access and update data in the database and a data definition language (DDL) to design and maintain the database. A schema is a description of the overall database structure, while a subschema holds the definition of part of the data described in the schema.

There are several major types of DBMSs: hierarchical, network, relational, and object-oriented.

A hierarchical DBMS maintains a tree-like structure of data, much like the hierarchy diagram of an organizational chart. A network database has more involved connections, like the view of airline routes in the U. S. Both of these models, developed in the late 1960's, have complex data definition and manipulation languages. Moreover, the database definition establishes relationships among the data. Thus, queries into the database are restricted to those permitted by these relationships. The hierarchical and network databases are not very flexible in allowing new ways of navigating the system. However, many organizations still use such DBMSs.

Relational Database Overview
In this section, we give an overview of relational database concepts, which we cover in detail in the "Relational Database" module. In the 1970s, the relational model was developed, based on mathematical principles. Not only are its DDL and DML simpler, but relational databases do not restrict queries to prearranged interrelationships of the data. This powerful system had limited use at first because of the insufficient speed and memories of computers. Advances in hardware, however, have now made relational databases very attractive for microcomputers and mainframes alike.

The relational model is based on tables of data. Each table has a name and a finite number of columns. Each column has a name and a domain, or the possible values it can assume. Each row contains values with the value in each column being in that column's domain.

The designer of a relational database establishes relations or two-dimensional tables of data. Various attributes, which correspond to the columns in a table, are like elements or fields in a C++ structure. The domain of an attribute, as with the domain of a function in mathematics, is the set of all possible values of that attribute. Thus, for a Social Security number (SSN) attribute, the domain might be the set of all 9-digit character strings. A tuple is similar to a row in a table or a particular value of a structure variable. For example, suppose the relation PatientAddr has the structure

(SSN, LastName, FirstName, address1, address2, city, state, zip, phone)
with the attributes SSN, LastName, FirstName, address1, address2, city, state, zip, and phone. Figure 1 displays this relation containing the patient address information in a very small hospital. In this relation, one tuple is

("123456789", "Smith", "Mary", "123 Baker Rd.", null, "Atlanta", "GA", "30342", "(404)555-2345")
Because the value of the attribute SSN uniquely identifies a tuple, we call SSN the key or primary key for the relation. One or more colums make up a key that must be unique across all rows.

In the tuple, the address2 attribute has the special null to represent that no value exists. The value null is in every domain and indicates a missing or, as in this case, a nonexistant data value.

Definition

The relational database model is based on mathematical principles. A relation is a set of tuples, or a two-dimensional table of data. An attribute corresponds to a column in a relation. The domain of an attribute is the set of all possible values of that attribute. A tuple corresponds to a row in a relation. An attribute or set of attributes that uniquely identifies a particular entity is a key. The special value null is in every domain and indicates a missing or nonexistant data value.

Figure 1 The relation PatientAddr

*SSN*	*LastName*	*FirstName*	*address1*	*address2*	*city*	*state*	*zip*	*phone*
123456789	Smith	Mary	123 Baker Rd.	null	Atlanta	GA	30342	(404)555-2345
946573218	Hodges	James	847 El Dorado	null	Roswell	GA	30327	(510)555-9876
565324884	Hall	William	Apt. 12	4 Elm St.	Spartanburg	SC	29303	(864)555-3838
657982253	May	Ann	654 Main St.	null	Richland	WA	94550	(912)555-8329
429348203	Hodges	Jane	847 El Dorado	null	Roswell	GA	30327	(510)555-9876
267392093	Smith	Hondre	Apt. 3A	342 Church St.	Atlanta	GA	30320	(404)555-4239
834295377	Martinez	Louis	672 High St.	null	Roswell	GA	30327	(510)555-7722

Quick Review Question

Quick Review Question 2 Consider the following data:

+	------------	+	-------------	+	--------------	+
\|	subject	\|	RhFactor	\|	BloodType	\|
+	------------	+	------------	+	------------	+
\|	SubjectA	\|	+	\|	M	\|
\|	SubjectB	\|	+	\|	N	\|
\|	SubjectD	\|	-	\|	M	\|
\|	SubjectE	\|	+	\|	M	\|
\|	SubjectC	\|	-	\|	MN	\|
+	------------	+	-------------	+	--------------	+

a. In a relational database, pick the name of such a rectangular array of data.
attribute	DML	DDL
domain	relation	tuple

b. Select the number of attributes in this relation.
0	1	2
3	4	5
	Impossible to determine

c. Select the minimum number of values in the domain of the attribute RhFactor.
0	1	2
3	4	5
	Impossible to determine

d. Select the number of tuples in this relation.
0	1	2
3	4	5
	Impossible to determine

Suppose also for each drug study there is a relation like StudyA05 for those in Study A05 indicating their SSN and whether or not they are taking a placebo instead of the tested drug. The structure of this relation might be

(SSN, placebo)

and the relation might be as in Figure 2.

Figure 2 The relation StudyA05

*SSN*	*placebo*
946573218	'y'
123456789	'n'
657982253	'y'
834295377	'n'

Using mathematical operations, we can form new relations from the existing ones. The select command in the DML takes a subset of the set of tuples (rows) in a relation (table). For example, to obtain a table RoswellPatientAddr, of the address information for all patients from Roswell, we write a statement similar to the following:

insert into RoswellPatientAddr
select *
from PatientAddr
where city = 'Roswell'

The resulting relation is like that of Figure 3.

Figure 3 Relation RoswellPatientAddr, created by selecting tuples from PatientAddr where city = 'Roswell'

*SSN*	*LastName*	*FirstName*	*address1*	*address2*	*city*	*state*	*zip*	*phone*
946573218	Hodges	James	847 El Dorado	null	Roswell	GA	30327	(510)555-9876
429348203	Hodges	Jane	847 El Dorado	null	Roswell	GA	30327	(510)555-9876
834295377	Martinez	Louis	672 High St.	null	Roswell	GA	30327	(510)555-7722

The project command takes a subset of the attribute values (columns) in a table, which in mathematics is a projection of the relation. For instance, we might be interested in obtaining a report of the names and phone numbers of all patients. To do so we create a NamePhone relation, as follows:

insert into NamePhone
select FirstName, LastName, phone
from PatientAddr

Figure 4 displays the NamePhone relation. If we only want the names and phone numbers of patients from Roswell, we would perform the projection on RoswellPatientAddr instead of PatientAddr.

Figure 4 Relation NamePhone, created with projecting FirstName, LastName, and phone from PatientAddr

*LastName*	*FirstName*	*phone*
Smith	Mary	(404)555-2345
Hodges	James	(510)555-9876
Hall	William	(864)555-3838
May	Ann	(912)555-8389
Hodges	Jane	(510)555-9876
Smith	Hondre	(404)555-4239
Martinez	Louis	(510)555-7722

We can combine related data from different tables using the join operation. For example, we can form a new relation, AddrStudyA05, having all the information from PatientAddr and StudyA05, matched by Social Security Number, as follows:

insert into AddrStudyA05
select *
from PatientAddr, StudyA05
where PatientAddr.SSN = StudyA05.SSN

Notice we specify to which the relation SSN belongs (PatientAddr or StudyA05) by using a syntax similar to that of structures or objects, which is the relation name followed by a dot and the attribute name (SSN). Figure 5 shows the result of such an operation.

Figure 5 Relation AddrStudyA05 formed by joining PatientAddr and StudyA05 where PatientAddr.SSN = StudyA05.SSN

*SSN*	*LastName*	*FirstName*	*address1*	*address2*	*city*	*state*	*zip*	*phone*	*placebo*
123456789	Smith	Mary	123 Baker Rd.	null	Atlanta	GA	30342	(404)555-2345	'n'
946573218	Hodges	James	847 El Dorado	null	Roswell	GA	94583	(510)555-9876	'y'
657982253	May	Ann	654 Main St.	null	Richland	WA	94550	(912)555-8329	'y'
834295377	Martinez	Louis	672 High St.	null	Roswell	GA	30327	(510)555-7722	'n'

Quick Review Question

Quick Review Question 3 Consider the relation CurrentBioStudy, as follows:

+	------------	+	-------------	+	--------------	+
\|	subject	\|	RhFactor	\|	BloodType	\|
+	------------	+	------------	+	------------	+
\|	SubjectA	\|	+	\|	M	\|
\|	SubjectB	\|	+	\|	N	\|
\|	SubjectD	\|	-	\|	M	\|
\|	SubjectE	\|	+	\|	M	\|
\|	SubjectC	\|	-	\|	MN	\|
+	------------	+	-------------	+	--------------	+

The relation Telephone follows:

+	------------	+	--------------	+
\|	subject	\|	Phone	\|
+	------------	+	--------------	+
\|	SubjectA	\|	803-555-1213	\|
\|	SubjectB	\|	803-555-1213	\|
\|	SubjectD	\|	803-555-4444	\|
\|	SubjectE	\|	803-555-5678	\|
\|	SubjectC	\|	803-555-9021	\|
+	------------	+	--------------	+

Give the type of relational database command to obtain each of the following:

a. A table of all data for each subject from both relations.
join	project	project, then select
select	select, then project

b. A table of all data for Rh-positive subjects.
join	project	project, then select
select	select, then project

c. A table of only the subject and RhFactor data.
join	project	project, then select
select	select, then project

d. A table of subject and BloodType information for Rh-positive subjects.
join	project	project, then select
select	select, then project

With its powerful ability to restructure the data in any number of ways and its relative ease of use, relational databases have become very popular. Some examples of such databases are DB2 for IBM mainframe computers, Oracle and Sybase for UNIX systems, Access for Windows, and FileMaker for Macintosh and Windows.

Exercises

1. Create a row in the PatientAddr table for yourself and a row for a friend of yours.

2. Suppose for this semester there is a relation, Course, storing the courses each patient takes with an attribute structure of (SSN, CourseNumber). Give part of this relation, containing information for you and a friend. Note that your SSN will appear several times, but the course numbers will be different.

3. For the relation PatientAddr, sketch the resulting table that is formed as a result of each of the following commands:

a. insert into A
    select *
    from Study A05
    where placebo = 'n'

b. insert into B
    select SSN, LastName, FirstName
    from PatientAddr

c. insert into C
   select *
   from A, B
   where A.SSN = B.SSN

4. How is data integrity preserved in a relational database?

5. How is data consistency preserved in a relational database?

6. How is data independence preserved in a relational database?

In the following exercises, write relational database commands to create the relations.

7. A relation with the FirstName, LastName and state columns from PatientAddr.

8. A relation with the records from StudyA05 of those taking a placebo.

9. A relation with the SSN and LastName attributes from PatientAddr and the placebo attribute from StudyA05.

Projects

1. How would you represent a table its data values in a program?

2. Design an algorithm to select a subset of the set of rows in a table of data, which is in a file. The algorithm should involve traditional file processing to emulate the select operation of relational databases. We employed the select operation to create RoswellPatientAddr of Figure 3.

3. Design an algorithm to project by selecting a subset of the set of columns in a table of data, which is in a file. The algorithm should involve traditional file processing to emulate the project operation of relational databases. We employed the project operation to create NamePhone of Figure 4.

4. Design an algorithm to join the data in two tables of data, which are in separate files. The algorithm should involve traditional file processing to emulate the join operation of relational databases. We employed the join operation to create AddrStudyA05 of Figure 5.

5. a. The file star.dat contains data on stars. Click here to obtain a description of the file format. Develop a program to read the data from star.dat into an array. Repeatedly present the user with a menu to perform the select or project operations on the data or to quit. If the user chooses to select or project, display a menu of subfields. For the select operation, have the user pick a subfield and then type a value for that subfield. Display all records that have the indicated subfield value. If the user wishes to project the data, display a user-selected subfield for all records.

b. Revise the program from Part a to allow the user to perform the select or project operation on more than one subfield. For the select operation, repeatedly allow the user to indicate a subfield and value for that subfield. Display the records that have all the indicated subfield values. Also, for the project operation, display one or more chosen subfields for all records.