Data, Information and Knowledge
Data, Information and Knowledge
Data, Information and Knowledge
Data represents unorganized and unprocessed facts. o Usually data is static in nature. o It can represent a set of discrete facts about events. o Data is a prerequisite to information. o An organization sometimes has to decide on the nature and volume of data that is required for creating the necessary information. Information o Information can be considered as an aggregation of data (processed data) which makes decision making easier. o Information has usually got some meaning and purpose. Knowledge o By knowledge we mean human understanding of a subject matter that has been acquired through proper study and experience. o Knowledge is usually based on learning, thinking, and proper understanding of the problem area. o Knowledge is not information and information is not data. o Knowledge is derived from information in the same way information is derived from data. o We can view it as an understanding of information based on its perceived importance or relevance to a problem area. o It can be considered as the integration of human perceptive processes that helps them to draw meaningful conclusions.
3 initial user, then you can create more users and change access to your database). There are following kinds of Sedna database objects:
Standalone document Collection of documents Value based index Full-text index Module Trigger Metadata document
Database administrator (DBA user). Formally, DBA user is a user that has the DBA role. Ordinary user (below we call user)
DBA user:
has all possible privileges on any object in the database; can remove any object in the database; can remove any user of the database; can grant/revoke any privilege to/from any user of the database; can grant DBA role to a user, thus making that user also a DBA user (not recommended, as the database with multiple DBA users is hard to administrate). Any DBA user can also revoke the DBA role from any DBA user.
An ordinary user:
can act according to the privileges that he has; can grant and revoke any privileges on the database object that he owns to any user; can remove database objects that he owns and drop users that he has created.
Every user has its name and password. TTypes of Database Users
Users are differentiated by the way they expect to interact with the system: Application programmers - interact with system through DML calls. Sophisticated users - form requests in a database query language. Specialized users - write specialized database applications that do not fit into the traditional data processing framework. Naive users - invoke one of the permanent application programs that have been written previously.
Description of data at some level. Each level has its own schema.
Routines are hardcoded to deal with physical representation. Changes to data structures are difficult to make. Application code becomes complex since it must deal with details. Rapid implementation of new features very difficult.
In the relational model, the conceptual schema presents data as a set of tables.
The DBMS maps data access between the conceptual to physical schemas automatically.
Physical schema can be changed without changing application: DBMS must change mapping from conceptual to physical. Referred to as physical data independence.
Students should not see faculty salaries. Faculty should not see billing or payment data.
Information that can be derived from stored data might be viewed as if it were stored.
Applications are written in terms of an external schema. The external view is computed when accessed. It is not stored. Different external schemas can be provided to different categories of users. Translation from external level to
Mapping from external to conceptual must be changed. Referred to as conceptual data independence.
Over the years there have been several different ways of constructing databases, amongst which have been the following:
The Hierarchical Data Model The Network Data Model The Relational Data Model
Although I will give a brief summary of the first two, the bulk of this document is concerned with The Relational Data Model as it the most prevalent in today's world.
A hierarchical database consists of the following: 1. 2. 3. 4. 5. 6. 7. It contains nodes connected by branches. The top node is called the root. If multiple nodes appear at the top level, the nodes are called root segments. The parent of node nx is a node directly above nx and connected to nx by a branch. Each node (with the exception of the root) has exactly one parent. The child of node nx is the node directly below nx and connected to nx by a branch. One parent may have many children.
7 By introducing data redundancy, complex network structures can also be represented as hierarchical databases. This redundancy is eliminated in physical implementation by including a 'logical child'. The logical child contains no data but uses a set of pointers to direct the database management system to the physical child in which the data is actually stored. Associated with a logical child are a physical parent and a logical parent. The logical parent provides an alternative (and possibly more efficient) path to retrieve logical child information.
Like the The Hierarchical Data Model the Network Data Model also consists of nodes and branches, but a child may have multiple parents within the network structure instead of being restricted to just one.
I have worked with both hierarchical and network databases, and they both suffered from the following deficiencies (when compared with relational databases): Access to the database was not via SQL query strings, but by a specific set of API's, typically for FIND, CREATE, READ, UPDATE and DELETE.
Each API would only access a single table (dataset), so it was not possible to implement a JOIN which would return data from several tables.
It was not possible to provide a variable WHERE clause. The only selection mechanism availabe was
read all entries (a full table scan). o read a single entry using a specific primary key. o read all entries on a child table which were associated with a selected entry on a parent table Any further filtering had to be done within the application code.
o
It was not possible to provide an ORDER BY clause. Data was presented in the order in which it existed in the database. This mechanism could be tuned by specifying sort criteria to be used when each record was inserted, but this had several disadvantages:
8 Only a single sort sequence could be defined for each path (link to a parent), so all records retrieved on that path would be provided in that sequence. o It could make inserts rather slow when attempting to insert into the middle of a large collection, or where a table had multiple paths each with its own set of sort criteria.
o
The Relation
The Relation is the basic element in a relational data model. Figure 3 - Relations in the Relational Data Model
A relation is subject to the following rules: 1. Relation (file, table) is a two-dimensional table. 2. Attribute (i.e. field or data item) is a column in the table. 3. Each column in the table has a unique name within that table. 4. Each column is homogeneous. Thus the entries in any column are all of the same type (e.g. age, name, employee-number, etc). 5. Each column has a domain, the set of possible values that can appear in that column. 6. A Tuple (i.e. record) is a row in the table. 7. The order of the rows and columns is not important. 8. Values of a row all relate to some thing or portion of a thing. 9. Repeating groups (collections of logically related attributes that occur multiple times within one record occurrence) are not allowed. 10. Duplicate rows are not allowed (candidate keys are designed to prevent this). 11. Cells must be single-valued (but can be variable length). Single valued means the following: o Cannot contain multiple values such as 'A1,B2,C3'. o Cannot contain combined values such as 'ABC-XYZ' where 'ABC' means one thing and 'XYZ' another. A relation may be expressed using the notation R(A,B,C, ...) where:
R = the name of the relation. (A,B,C, ...) = the attributes within the relation. A = the attribute(s) which form the primary key.
Keys
1. A simple key contains a single attribute. 2. A composite key is a key that contains more than one attribute. 3. A candidate key is an attribute (or set of attributes) that uniquely identifies a row. A candidate key must possess the following properties: o Unique identification - For every row the value of the key must uniquely identify that row. o Non redundancy - No attribute in the key can be discarded without destroying the property of unique identification. 4. A primary key is the candidate key which is selected as the principal unique identifier. Every relation must contain a primary key. The primary key is usually the key selected to identify a row when the database is physically implemented. For example, a part number is selected instead of a part description. 5. A superkey is any set of attributes that uniquely identifies a row. A superkey differs from a candidate key in that it does not require the non redundancy property. 6. A foreign key is an attribute (or set of attributes) that appears (usually) as a non key attribute in one relation and as a primary key attribute in another relation. I say usually because it is possible for a foreign key to also be the whole or part of a primary key: o A many-to-many relationship can only be implemented by introducing an intersection or link table which then becomes the child in two one-to-many relationships. The intersection table therefore has a foreign key for each of its parents, and its primary key is a composite of both foreign keys. o A one-to-one relationship requires that the child table has no more than one occurrence for each parent, which can only be enforced by letting the foreign key also serve as the primary key. 7. A semantic or natural key is a key for which the possible values have an obvious meaning to the user or the data. For example, a semantic primary key for a COUNTRY entity might contain the value 'USA' for the occurrence describing the United States of America. The value 'USA' has meaning to the user. 8. A technical or surrogate or artificial key is a key for which the possible values have no obvious meaning to the user or the data. These are used instead of semantic keys for any of the following reasons: o When the value in a semantic key is likely to be changed by the user, or can have duplicates. For example, on a PERSON table it is unwise to use PERSON_NAME as the key as it is possible to have more than one person with the same name, or the name may change such as through marriage. o When none of the existing attributes can be used to guarantee uniqueness. In this case adding an attribute whose value is generated by the system, e.g from a sequence of numbers, is the only way to provide a unique value. Typical examples would be ORDER_ID and INVOICE_ID. The value '12345' has no meaning to the user as it conveys nothing about the entity to which it relates. 9. A key functionally determines the other attributes in the row, thus it is always a determinant. 10. Note that the term 'key' in most DBMS engines is implemented as an index which does not allow duplicate entries.
10
Relationships
One table (relation) may be linked with another in what is known as a relationship. Relationships may be built into the database structure to facilitate the operation of relational joins at runtime. 1. A relationship is between two tables in what is known as a one-to-many or parent-child or master-detail relationship where an occurrence on the 'one' or 'parent' or 'master' table may have any number of associated occurrences on the 'many' or 'child' or 'detail' table. To achieve this the child table must contain fields which link back the primary key on the parent table. These fields on the child table are known as a foreign key, and the parent table is referred to as the foreign table (from the viewpoint of the child). 2. It is possible for a record on the parent table to exist without corresponding records on the child table, but it should not be possible for an entry on the child table to exist without a corresponding entry on the parent table. 3. A child record without a corresponding parent record is known as an orphan.
4. It is possible for a table to be related to itself. For this to be possible it needs a foreign key which points back to the primary key. Note that these two keys cannot be comprised of exactly the same fields otherwise the record could only ever point to itself. 5. A table may be the subject of any number of relationships, and it may be the parent in some and the child in others. 6. Some database engines allow a parent table to be linked via a candidate key, but if this were changed it could result in the link to the child table being broken. 7. Some database engines allow relationships to be managed by rules known as referential integrity or foreign key restraints. These will prevent entries on child tables from being created if the foreign key does not exist on the parent table, or will deal with entries on child tables when the entry on the parent table is updated or deleted.
Relational DatabaseA relational database consists of a collection of tables that store particular sets of data. The invention of this database system has standardized the way that data is stored and processed. The concept of a relational database derives from the principles of relational algebra, realized as a whole by the father of relational databases, E. F. Codd. Most of the database systems in use today are based on the relational system. The history of the relational database began with Codd's 1970 paper, A Relational Model of Data for Large Shared Data Banks. This theory established that data should be independent of any hardware or storage system, and provided for automatic navigation between the data elements. In practice, this meant that data should be stored in tables and that relationships would exist between the different data sets, or tables. The relation, which is a two-dimensional table, is the primary unit of storage in a relational database. A relational database can contain one or more of these tables, with each table consisting of a unique set of rows and columns. A single record is stored in a table as a row, also known as a tuple, while attributes of the data are defined in columns, or fields, in the table. The characteristics of the data, or the column,
11
relates one record to another. Each column has a unique name and the content within it must be of the same type. Tables can be related to each other in a variety of ways. Functional dependencies are formed when an attribute of one table relates to attributes of other tables. The simplest relationship is the one-to-one relationship, in which one record in a table is related to another record in a separate table. A one-to-many relationship is one in which one record in a table is related to multiple records in another table. A many-toone relationship defines the reverse situation; more than one record in a single table relates to only one record in another table. Finally, in a many-to-many relationship, more than one record in a table relates to more than one record in another table. A key is an entity in a table that distinguishes one row of data from another. The key may be a single column, or it may consist of a group of columns that uniquely identifies a record. Tables can contain primary keys which differentiate records from one another, and primary keys can be an individual attribute, or a combination of attributes. Foreign keys relate tables in the database to one another. A foreign key in one table is a primary key in another; the foreign keys generally define parent-to-child relationships between tables. The data that is stored in tables are organized logically based on a particular purpose that minimizes duplication, reduces data anomalies, and reinforces data integrity. The process by which data is organized logically is called normalization. Normalization simplifies the way data is defined and regulates its structure. There are five forms in the normalization process, with each form meeting a more stringent condition. The first normal form, 1NF, has the least data integrity, while the fifth normal form, or 5NF, structures the data with the least anomalies and best integrity. Stored data is manipulated using a programming language called Structured Query Language, or SQL. Many varieties of SQL exist. SQL is based on set theory; relational operators such as and, or, not, and in are used to perform operations on the data. The operations that can be used in a relational database include insert, select, update, and delete privileges. Today, the relational database management system (RDBMS), is the most commonly used database format. Oracle Corporation created the first commercial relational database in 1979. IBM followed suit in 1982 with the SQL Data System. Microsoft was the last major company to jump in with SQL Server 4.2 in 1992.
12
possible without changing the conceptual or external schemas. In other words, physical data independence indicates that the physical storage structures or devices used for storing the data could be changed without necessitating a change in the conceptual view or any of the external views. The Logical data independence is difficult to achieve than physical data independence as it requires the flexibility in the design of database and programmer has to foresee the future requirements or modifications in the design.