Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data, Information and Knowledge

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 12

1

Data, Information and Knowledge

Data represents unorganized and unprocessed facts. o Usually data is static in nature. o It can represent a set of discrete facts about events. o Data is a prerequisite to information. o An organization sometimes has to decide on the nature and volume of data that is required for creating the necessary information. Information o Information can be considered as an aggregation of data (processed data) which makes decision making easier. o Information has usually got some meaning and purpose. Knowledge o By knowledge we mean human understanding of a subject matter that has been acquired through proper study and experience. o Knowledge is usually based on learning, thinking, and proper understanding of the problem area. o Knowledge is not information and information is not data. o Knowledge is derived from information in the same way information is derived from data. o We can view it as an understanding of information based on its perceived importance or relevance to a problem area. o It can be considered as the integration of human perceptive processes that helps them to draw meaningful conclusions.

2 1.0 Relational Databases


This is the most common of all the different types of databases. In this, the data in a relational database is stored in various data tables. Each table has a key field which is used to connect it to other tables. Hence all the tables are related to each other through several key fields. These databases are extensively used in various industries and will be the one you are most likely to come across when working in IT. Examples of relational databases are Oracle, Sybase and Microsoft SQL Server and they are often key parts of the process of software development. Hence you should ensure you include any work required on the database as part of your project when creating a project plan and estimating project costs.

2.0 Operational Databases


In its day to day operation, an organisation generates a huge amount of data. Think of things such as inventory management, purchases, transactions and financials. All this data is collected in a database which is often known by several names such as operational/ production database, subject-area database (SADB) or transaction databases. An operational database is usually hugely important to Organisations as they include the customer database, personal database and inventory database ie the details of how much of a product the company has as well as information on the customers who buy them. The data stored in operational databases can be changed and manipulated depending on what the company requires.

3.0 Database Warehouses


Organisations are required to keep all relevant data for several years. In the UK it can be as long as 6 years. This data is also an important source of information for analysing and comparing the current year data with that of the past years which also makes it easier to determine key trends taking place. All this data from previous years are stored in a database warehouse. Since the data stored has gone through all kinds of screening, editing and integration it does not need any further editing or alteration. With this database ensure that the software requirements specification (SRS) is formally approved as part of the project quality plan.

4.0 Distributed Databases


Many organisations have several office locations, manufacturing plants, regional offices, branch offices and a head office at different geographic locations. Each of these work groups may have their own database which together will form the main database of the company. This is known as a distributed database.

5.0 End-User Databases


There is a variety of data available at the workstation of all the end users of any organisation. Each workstation is like a small database in itself which includes data in spreadsheets, presentations, word files, note pads and downloaded files. All such small databases form a different type of database called the end-user database.

3.1 Database Users


Database user names are global across a database (and not per all Sedna databases). Database users interact with database objects. Every database object has its owner - the user that created it. Every user and role (we will discuss roles in the Section 3.2) has its creator. In order to bootstrap the database, a freshly created database always contains one predefined DBA user with name "SYSTEM" and password "MANAGER". To start your work with the database, you first have to connect as this

3 initial user, then you can create more users and change access to your database). There are following kinds of Sedna database objects:

default password (if you care for preventing unauthorized

Standalone document Collection of documents Value based index Full-text index Module Trigger Metadata document

There are two types of Sedna database users:


Database administrator (DBA user). Formally, DBA user is a user that has the DBA role. Ordinary user (below we call user)

DBA user:

has all possible privileges on any object in the database; can remove any object in the database; can remove any user of the database; can grant/revoke any privilege to/from any user of the database; can grant DBA role to a user, thus making that user also a DBA user (not recommended, as the database with multiple DBA users is hard to administrate). Any DBA user can also revoke the DBA role from any DBA user.

An ordinary user:

can act according to the privileges that he has; can grant and revoke any privileges on the database object that he owns to any user; can remove database objects that he owns and drop users that he has created.

Every user has its name and password. TTypes of Database Users
Users are differentiated by the way they expect to interact with the system: Application programmers - interact with system through DML calls. Sophisticated users - form requests in a database query language. Specialized users - write specialized database applications that do not fit into the traditional data processing framework. Naive users - invoke one of the permanent application programs that have been written previously.

Three Level Database Architecture


last updated 30-aug-11

Data and Related Structures


Data are actually stored as bits, or numbers and strings, but it is difficult to work with data at this level. It is necessary to view data at different levels of abstraction. Schema:

Description of data at some level. Each level has its own schema.

We will be concerned with three forms of schemas:


physical, conceptual, and external.

Physical Data Level


The physical schema describes details of how data is stored: files, indices, etc. on the random access disk system. It also typically describes the record layout of files and type of files (hash, b-tree, flat). Early applications worked at this level - explicitly dealt with details. E.g., minimizing physical distances between related data and organizing the data structures within the file (blocked records, linked lists of blocks, etc.) Problem:

Routines are hardcoded to deal with physical representation. Changes to data structures are difficult to make. Application code becomes complex since it must deal with details. Rapid implementation of new features very difficult.

Conceptual Data Level


Also referred to as the Logical level Hides details of the physical level.

In the relational model, the conceptual schema presents data as a set of tables.

The DBMS maps data access between the conceptual to physical schemas automatically.

Physical schema can be changed without changing application: DBMS must change mapping from conceptual to physical. Referred to as physical data independence.

External Data Level


In the relational model, the external schema also presents data as a set of relations. An external schema specifies a view of the data in terms of the conceptual level. It is tailored to the needs of a particular category of users. Portions of stored data should not be seen by some users and begins to implement a level of security and simplifies the view for these users Examples:

Students should not see faculty salaries. Faculty should not see billing or payment data.

Information that can be derived from stored data might be viewed as if it were stored.

GPA not stored, calculated when needed.

Applications are written in terms of an external schema. The external view is computed when accessed. It is not stored. Different external schemas can be provided to different categories of users. Translation from external level to

6 conceptual level is done automatically by DBMS at run changing application:


time. The conceptual schema can be changed without

Mapping from external to conceptual must be changed. Referred to as conceptual data independence.

Over the years there have been several different ways of constructing databases, amongst which have been the following:

The Hierarchical Data Model The Network Data Model The Relational Data Model

Although I will give a brief summary of the first two, the bulk of this document is concerned with The Relational Data Model as it the most prevalent in today's world.

The Hierarchical Data Model


The Hierarchical Data Model structures data in a tree of records, with each record having one parent record and many children. It can be represented as follows: Figure 1 - The Hierarchical Data Model

A hierarchical database consists of the following: 1. 2. 3. 4. 5. 6. 7. It contains nodes connected by branches. The top node is called the root. If multiple nodes appear at the top level, the nodes are called root segments. The parent of node nx is a node directly above nx and connected to nx by a branch. Each node (with the exception of the root) has exactly one parent. The child of node nx is the node directly below nx and connected to nx by a branch. One parent may have many children.

7 By introducing data redundancy, complex network structures can also be represented as hierarchical databases. This redundancy is eliminated in physical implementation by including a 'logical child'. The logical child contains no data but uses a set of pointers to direct the database management system to the physical child in which the data is actually stored. Associated with a logical child are a physical parent and a logical parent. The logical parent provides an alternative (and possibly more efficient) path to retrieve logical child information.

The Network Data Model


The Network Data Model uses a lattice structure in which a record can have many parents as well as many children. It can be represented as follows: Figure 2 - The Network Data Model

Like the The Hierarchical Data Model the Network Data Model also consists of nodes and branches, but a child may have multiple parents within the network structure instead of being restricted to just one.

I have worked with both hierarchical and network databases, and they both suffered from the following deficiencies (when compared with relational databases): Access to the database was not via SQL query strings, but by a specific set of API's, typically for FIND, CREATE, READ, UPDATE and DELETE.

Each API would only access a single table (dataset), so it was not possible to implement a JOIN which would return data from several tables.

It was not possible to provide a variable WHERE clause. The only selection mechanism availabe was

read all entries (a full table scan). o read a single entry using a specific primary key. o read all entries on a child table which were associated with a selected entry on a parent table Any further filtering had to be done within the application code.
o

It was not possible to provide an ORDER BY clause. Data was presented in the order in which it existed in the database. This mechanism could be tuned by specifying sort criteria to be used when each record was inserted, but this had several disadvantages:

8 Only a single sort sequence could be defined for each path (link to a parent), so all records retrieved on that path would be provided in that sequence. o It could make inserts rather slow when attempting to insert into the middle of a large collection, or where a table had multiple paths each with its own set of sort criteria.
o

The Relational Data Model


The Relational Data Model has the relation at its heart, but then a whole series of rules governing keys, relationships, joins, functional dependencies, transitive dependencies, multi-valued dependencies, and modification anomalies.

The Relation
The Relation is the basic element in a relational data model. Figure 3 - Relations in the Relational Data Model

A relation is subject to the following rules: 1. Relation (file, table) is a two-dimensional table. 2. Attribute (i.e. field or data item) is a column in the table. 3. Each column in the table has a unique name within that table. 4. Each column is homogeneous. Thus the entries in any column are all of the same type (e.g. age, name, employee-number, etc). 5. Each column has a domain, the set of possible values that can appear in that column. 6. A Tuple (i.e. record) is a row in the table. 7. The order of the rows and columns is not important. 8. Values of a row all relate to some thing or portion of a thing. 9. Repeating groups (collections of logically related attributes that occur multiple times within one record occurrence) are not allowed. 10. Duplicate rows are not allowed (candidate keys are designed to prevent this). 11. Cells must be single-valued (but can be variable length). Single valued means the following: o Cannot contain multiple values such as 'A1,B2,C3'. o Cannot contain combined values such as 'ABC-XYZ' where 'ABC' means one thing and 'XYZ' another. A relation may be expressed using the notation R(A,B,C, ...) where:

R = the name of the relation. (A,B,C, ...) = the attributes within the relation. A = the attribute(s) which form the primary key.

Keys
1. A simple key contains a single attribute. 2. A composite key is a key that contains more than one attribute. 3. A candidate key is an attribute (or set of attributes) that uniquely identifies a row. A candidate key must possess the following properties: o Unique identification - For every row the value of the key must uniquely identify that row. o Non redundancy - No attribute in the key can be discarded without destroying the property of unique identification. 4. A primary key is the candidate key which is selected as the principal unique identifier. Every relation must contain a primary key. The primary key is usually the key selected to identify a row when the database is physically implemented. For example, a part number is selected instead of a part description. 5. A superkey is any set of attributes that uniquely identifies a row. A superkey differs from a candidate key in that it does not require the non redundancy property. 6. A foreign key is an attribute (or set of attributes) that appears (usually) as a non key attribute in one relation and as a primary key attribute in another relation. I say usually because it is possible for a foreign key to also be the whole or part of a primary key: o A many-to-many relationship can only be implemented by introducing an intersection or link table which then becomes the child in two one-to-many relationships. The intersection table therefore has a foreign key for each of its parents, and its primary key is a composite of both foreign keys. o A one-to-one relationship requires that the child table has no more than one occurrence for each parent, which can only be enforced by letting the foreign key also serve as the primary key. 7. A semantic or natural key is a key for which the possible values have an obvious meaning to the user or the data. For example, a semantic primary key for a COUNTRY entity might contain the value 'USA' for the occurrence describing the United States of America. The value 'USA' has meaning to the user. 8. A technical or surrogate or artificial key is a key for which the possible values have no obvious meaning to the user or the data. These are used instead of semantic keys for any of the following reasons: o When the value in a semantic key is likely to be changed by the user, or can have duplicates. For example, on a PERSON table it is unwise to use PERSON_NAME as the key as it is possible to have more than one person with the same name, or the name may change such as through marriage. o When none of the existing attributes can be used to guarantee uniqueness. In this case adding an attribute whose value is generated by the system, e.g from a sequence of numbers, is the only way to provide a unique value. Typical examples would be ORDER_ID and INVOICE_ID. The value '12345' has no meaning to the user as it conveys nothing about the entity to which it relates. 9. A key functionally determines the other attributes in the row, thus it is always a determinant. 10. Note that the term 'key' in most DBMS engines is implemented as an index which does not allow duplicate entries.

10

Relationships
One table (relation) may be linked with another in what is known as a relationship. Relationships may be built into the database structure to facilitate the operation of relational joins at runtime. 1. A relationship is between two tables in what is known as a one-to-many or parent-child or master-detail relationship where an occurrence on the 'one' or 'parent' or 'master' table may have any number of associated occurrences on the 'many' or 'child' or 'detail' table. To achieve this the child table must contain fields which link back the primary key on the parent table. These fields on the child table are known as a foreign key, and the parent table is referred to as the foreign table (from the viewpoint of the child). 2. It is possible for a record on the parent table to exist without corresponding records on the child table, but it should not be possible for an entry on the child table to exist without a corresponding entry on the parent table. 3. A child record without a corresponding parent record is known as an orphan.

4. It is possible for a table to be related to itself. For this to be possible it needs a foreign key which points back to the primary key. Note that these two keys cannot be comprised of exactly the same fields otherwise the record could only ever point to itself. 5. A table may be the subject of any number of relationships, and it may be the parent in some and the child in others. 6. Some database engines allow a parent table to be linked via a candidate key, but if this were changed it could result in the link to the child table being broken. 7. Some database engines allow relationships to be managed by rules known as referential integrity or foreign key restraints. These will prevent entries on child tables from being created if the foreign key does not exist on the parent table, or will deal with entries on child tables when the entry on the parent table is updated or deleted.

Relational DatabaseA relational database consists of a collection of tables that store particular sets of data. The invention of this database system has standardized the way that data is stored and processed. The concept of a relational database derives from the principles of relational algebra, realized as a whole by the father of relational databases, E. F. Codd. Most of the database systems in use today are based on the relational system. The history of the relational database began with Codd's 1970 paper, A Relational Model of Data for Large Shared Data Banks. This theory established that data should be independent of any hardware or storage system, and provided for automatic navigation between the data elements. In practice, this meant that data should be stored in tables and that relationships would exist between the different data sets, or tables. The relation, which is a two-dimensional table, is the primary unit of storage in a relational database. A relational database can contain one or more of these tables, with each table consisting of a unique set of rows and columns. A single record is stored in a table as a row, also known as a tuple, while attributes of the data are defined in columns, or fields, in the table. The characteristics of the data, or the column,

11
relates one record to another. Each column has a unique name and the content within it must be of the same type. Tables can be related to each other in a variety of ways. Functional dependencies are formed when an attribute of one table relates to attributes of other tables. The simplest relationship is the one-to-one relationship, in which one record in a table is related to another record in a separate table. A one-to-many relationship is one in which one record in a table is related to multiple records in another table. A many-toone relationship defines the reverse situation; more than one record in a single table relates to only one record in another table. Finally, in a many-to-many relationship, more than one record in a table relates to more than one record in another table. A key is an entity in a table that distinguishes one row of data from another. The key may be a single column, or it may consist of a group of columns that uniquely identifies a record. Tables can contain primary keys which differentiate records from one another, and primary keys can be an individual attribute, or a combination of attributes. Foreign keys relate tables in the database to one another. A foreign key in one table is a primary key in another; the foreign keys generally define parent-to-child relationships between tables. The data that is stored in tables are organized logically based on a particular purpose that minimizes duplication, reduces data anomalies, and reinforces data integrity. The process by which data is organized logically is called normalization. Normalization simplifies the way data is defined and regulates its structure. There are five forms in the normalization process, with each form meeting a more stringent condition. The first normal form, 1NF, has the least data integrity, while the fifth normal form, or 5NF, structures the data with the least anomalies and best integrity. Stored data is manipulated using a programming language called Structured Query Language, or SQL. Many varieties of SQL exist. SQL is based on set theory; relational operators such as and, or, not, and in are used to perform operations on the data. The operations that can be used in a relational database include insert, select, update, and delete privileges. Today, the relational database management system (RDBMS), is the most commonly used database format. Oracle Corporation created the first commercial relational database in 1979. IBM followed suit in 1982 with the SQL Data System. Microsoft was the last major company to jump in with SQL Server 4.2 in 1992.

What is data independence? Explain different types of data independence.


The ability to modify a scheme definition in one level without affecting a scheme definition in a higher level is called data independence. There are two kinds of data independence: Logical data independence Physical data independence Logical data independence The ability to modify the conceptual schema without affecting the existing external schemas is called Logical data independence. In logical data independence, the users are shielded from changes in the logical structure of the data or changes in the choice of relations to be stored. Changes to the conceptual schema, such as the addition and deletion of entities, addition and deletion of attributes, or addition and deletion of relationships must be possible without changing existing external schemas or having to rewrite application programs. Only the view definition and the mapping need be changed in a DBMS that supports logical data independence. Physical data independence The ability to modify the internal schema without having to change the conceptual or external schemas is called physical data independence. In physical data independence, the conceptual schema insulates the users from changes in the physical storage of the data. Changes to the internal schema, such as using different file organizations or storage structures, using different storage devices, modifying indexes or hashing algorithms must be

12
possible without changing the conceptual or external schemas. In other words, physical data independence indicates that the physical storage structures or devices used for storing the data could be changed without necessitating a change in the conceptual view or any of the external views. The Logical data independence is difficult to achieve than physical data independence as it requires the flexibility in the design of database and programmer has to foresee the future requirements or modifications in the design.

You might also like