SBM College of Engineering and Technolog PDF
SBM College of Engineering and Technolog PDF
SBM College of Engineering and Technolog PDF
STUDY MATERIALS
UNIT I
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
In a relational database, a weak entity is an entity that cannot be uniquely
identified by its attributes alone; therefore ,it must use a foreign key in
conjunction with its attributes to create a primary key. The foreign key is
typically a primary key of an entity it is related to.
A link is created between two tables where the primary key of one table is
associated with the foreign key of another table using database relationships.
Example
Book table (pk_book_id,title,ISBN) is associated with Author
(pk_author_id,author_name,phone_no,fk_book_id).
File manager
Buffer manager
Data integrity manager
Transaction manager
1. one-to-one (1:1)
2. one-to-many (1:M)
3. many-to-many (M:N)
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
There are three Types of Relationship Constraints-
1. Structural Constraints
o Participation Constraints
o Cardinality Ratio
2. Overlap Constraints
3. Covering Constraints
Structural Constraints are applicable for binary relationships and Overlap and
Covering Constraints are applicable for EERD(Extended ER Diagrams).
1. Total/Mandatory Participation
2. Partial/Optional Participation
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Protection of data against accidental or intentional loss, destruction, or
misuse
Firewalls
Establishment of user privileges
Schema: The overall design of the data base is called the data base schema.
1) Physical schema
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
2) logical schema.
Physical schema: The physical schema describes the database design at the
physical level, which is the lowest level of abstraction describing how the data
are actually stored.
Logical schema: The logical schema describes the database design at the logical
level, which describes what data are stored in the database and what relationship
exists among the data.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Data redundancy leads to higher storage and data base access cost.
Updation is not done properly in the various copies of the same data.
Difficulty in accessing data
Required data cannot be retrieved in a convenient and efficient manner
Ex.
Requirement : find out the names of the customer who lives in the particular
postal code.
Given: An application program to generate the list of all customers.
Solution:
Obtain the list of all customers and extract the requirements manually
Ask a system programmer to write the necessary application program
Need to write a new program to carry out each new task
Data isolation
Because data are stored in various files may be in different formats writing
new application programs to retrieve the appropriate data is difficult.
Integrity problems
Integrity constraints (e.g. account balance > 0) become “buried” in
program code rather than being stated explicitly Hard to add new
constraints or change existing ones
Atomicity of updates
Failures may leave database in an inconsistent state with partial updates
carried out
Example:
Transfer of funds from one account to another should either complete or not
happen at all
Concurrent access by multiple users
Concurrent accessed needed for performance
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Uncontrolled concurrent accesses can lead to inconsistencies
Example: Two people reading a balance and updating it at the same time
Security problems
Hard to provide user access to some, but not all, data
Database systems offer solutions to all the above problems
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Transaction manager, which ensures that the database remains in a
consistent (correct) state despite system failures, and that concurrent
transaction executions proceed without conflicting.
File manager, which manages the allocation of space on disk storage and
the data structures used to represent information stored on disk.
Buffer manager, which is responsible for fetching data from disk storage
into main memory, and deciding what data to cache in main memory. The
buffer manager is a critical part of the database system, since it enables the
database to handle data sizes that are much larger than the size of main
memory. The storage manager implements several data structures as part of
the physical system implementation:
Data files, which store the database itself.
Data dictionary, which stores metadata about the structure of the
database, in particular the schema of the database.
Indices, which provide fast access to data items that hold particular values.
The Query Processor
The query processor components include DDL interpreter, which
interprets DDL statements and records the definitions in the data dictionary.
DML compiler, which translates DML statements in a query language into
an evaluation plan consisting of low-level instructions that the query
evaluation engine understands. A query can usually be translated into any
of a number of alternative evaluation plans that all give the same result. The
DML compiler also performs Query optimization, that is, it picks the
lowest cost evaluation plan from among the alternatives.
Query evaluation engine, which executes low-level instructions generated
by the DML compiler.
Application Architectures
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Most users of a database system today are not present at the site of the
database system, but connect to it through a network.
We can therefore differentiate between client machines, on which
remote Database applications are usually partitioned into two or three
parts, as in Figure 1.5. In a two-tier architecture, the application is
partitioned into a component that resides at the client machine, which
invokes database system functionality at the server machine through
query language statements
In contrast, in a three-tier architecture, the client machine acts as
merely a front end and does not contain any direct database calls.
Instead, the client end communicates with an application server,
usually through a forms interface.
3. Define data model. Explain the different types of data models with
relevant examples. (16)
Data Models
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Data model: a collection of conceptual tools for describing data, data
relationships, data semantics, and consistency constraints.
The various data models that have been proposed fall into the three
different groups
1. Object based logical models
2. Record based logical models
3. Physical models
Object based logical models
object based logical models are used in describing data at the logical and
view levels
There are many different models and more are likely to come
Entity –relationship model
object oriented model
semantic data model
functional data model
Entity –relationship model
The entity-relationship (E-R) data model is based on a perception of a
real world that Consists of a collection of basic objects, called entities, and
of relationships among these objects.
ENTITY
An entity is a “thing” or “object” in the real world that is distinguishable
from other objects
For example,
Each person is an entity, and bank accounts can be considered as entities.
ATTRIBUTES
Entities are described in a database by a set of attributes
For example,
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
The attributes account-number and balance may describe one particular
account in a bank, and they form attributes of the account entity set. Similarly,
attributes customer-name, customer-street address and customer-city may
describe a customer entity.
RELATIONSHIP
A relationship is an association among several entities
For example,
A depositor relationship associates a customer with each account that she
has. The set of all entities of the same type and the set of all relationships of the
same type are termed an entity set and relationship set, respectively
E-R diagram
Rectangles, which represent entity sets
Ellipses, which represent attributes
Diamonds, which represent relationships among entity sets
Lines, which link attributes to entity sets and entity sets to
relationships
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Record based logical model are so named because the database is structured
in fixed format record of several types.
Each record type defines a fixed number of fileds or attributes and each
field is usually of a fixed length.
The three most widely accepted record based models are the relational
network and hierarchical model
Relational Model
The relational model uses a collection of tables to represent both data and
the relationships among those data. Each table has multiple columns, and
each column has a unique name.
Figure 1.3 presents a sample relational database comprising three tables:
One shows details of bank customers, the second shows accounts, and the
third shows which accounts belong to which customers.
The first table, the customer table, shows, for example, that the customer
identifiedby customer-id 192-83-7465 is named Johnson and lives at 12 Alma
St. in Palo Alto.
The second table, account, shows, for example, that account A-101 has a
balance of$500, and A-201 has a balance of $900.
The third table shows which accounts belong to which customers. For
example, account number A-101 belongs to the customer whose customer-id
is 192-83-7465,namely Johnson, and customers 192-83-7465 (Johnson) and
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
019-28-3746 (Smith) share account number A-201 (they may share a business
venture).
Other Data Models
The object-oriented data model is another data model that has seen
increasing attention. The object-oriented model can be seen as extending
the E-R model with notions of encapsulation, methods (functions), and
object identity.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Data in the hierarchial model are represented by collection of tree rather
than orbitary graphs
Physical level. The lowest level of abstraction describes how the data are
actually stored. The physical level describes complex low-level data
structures in detail.
Logical level. The next-higher level of abstraction describes what data are
stored in the database, and what relationships exist among those data. The
logical level thus describes the entire database in terms of a small number
of relatively simple structures.
View level. The highest level of abstraction describes only part of the entire
database. Even though the logical level uses simpler structures, complexity
remains because of the variety of information stored in a large database
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Instances and Schemas
The collection of information stored in the database at a particular
moment is called an instance of the database. The overall design of the
database is called the database schema. Schemas are changed infrequently,
if at all.
Types of schema
Database systems have several schemas, partitioned according to the
levels of abstraction.The physical schema describes the database design at
the physical level, while the logical schema describes the database design
at the logical level.Adatabase may also have several schemas at the view
level, sometimes called subschemas, that describe different views of the
database.
Data independence
The ability to modify a schema definition in one level without
affecting a schema definition to the next higher level is called Data
independence
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Two levels of Data independence
Physical Data independence
Logical Data independence
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Basic Concepts
Entity Sets
An entity is a “thing” or “object” in the real world that is distinguishable
from all other objects.
An entity set is a set of entities of the same type that share the same
properties, or attributes. The set of all persons who are customers at a given
bank, for example, can be defined as the entity set customer
For each attribute, there is a set of permitted values, called the domain, or
value set, of that attribute. The domain of attribute customer-name might be
the set of all text strings of a certain length.
Types of Attributes
Simple and composite attributes.
Single-valued and multivalued attributes
Derived attributes
Null attributes
The attributes have been simple; that is, they are not divided into subparts.
Eg: customer id
Composite attributes,
On the other hand, can be divided into subparts (that is, other attributes).
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
For example,
Multivalued attributes
An employee may have zero, one, or several phone numbers, and different
employees may have different numbers of phones. This type of attribute is said to
be multivalued
{(e1, e2, . . . , en) | e1 ∈ E1, e2 ∈ E2, . . . , en ∈ En} where (e1, e2, . . . , en) is a
relationship
Consider the two entity sets customer and loan in Figure 2.1. We define the
relationship set borrower to denote the association between customers and the
bank loans that the customers have.
Constraints
An E-R enterprise schema may define certain constraints to which the contents of
a database must conform. In this section, we examine mapping cardinalities and
participation constraints, which are two of the most important types of
constraints.
Mapping Cardinalities
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
• One to many. An entity in A is associated with any number (zero or more) of
entities in B. An entity in B, however, can be associated with at most one entity
in A. (See Figure 2.4b.)
• Many to one. An entity in A is associated with at most one entity in B. An
entity in B, however, can be associated with any number (zero or more) of
entities in A. (See Figure 2.5a.)
• Many to many. An entity in A is associated with any number (zero or more) of
entities in B, and an entity in B is associated with any number (zero or more) of
entities in A.
Participation Constraints
The values of the attribute values of an entity must be such that they can
uniquely identify the entity. In other words, no two entities in an entity set are
allowed to have exactly the same value for all attributes.
A key allows us to identify a set of attributes that suffice to distinguish
entities from each other. Keys also help uniquely identify relationships, and thus
distinguish relationships from each other
Types of key
Super key
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Candidate key
Primary key
Super key
A super key is a set of one or more attributes that allow identifying uniquely
an entity in a entity set
Ex: (cus_id,cus_name)
Candidate key
A Candidate key is a minimal super key for which no proper subset can
formed
Ex: (cus_name,Cus_street)
Primary key
Primary key is a key that has unique.
Ex: (cus_id)
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Entity-Relationship Diagram
Rectangles, which represent entity sets
Ellipses, which represent attributes
Diamonds, which represent relationship sets
Lines, which link attributes to entity sets and entity sets to relationship sets
Double ellipses, which represent multivalued attributes
Dashed ellipses, which denote derived attributes
Double lines, which indicate total participation of an entity in a
relationshipset
Double rectangles, which represent weak entity sets
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Constraints on Generalizations
Condition-defined.
In condition-defined lower-level entity sets, membership is evaluated on the basis
of whether or not an entity satisfies an explicit condition or predicate
User-defined.
User-defined lower-level entity sets are not constrained by a membership
condition; rather, the database user assigns entities to a given entity set
Disjoint.
A disjointness constraint requires that an entity belong to no more than one
lower-level entity set.
Overlapping.
In overlapping generalizations, the same entity may belong to more than
one lower-level entity set within a single generalization
Total generalization or specialization
Each higher-level entity must belong to a lower-level entity set.
Partial generalization or specialization
Some higher-level entities may not belong to any lower-level entity set
Aggregation
The best way to model a situation such as the one just described is to use
aggregation. Aggregation is an abstraction through which relationships are
treated as higherlevel entities. Thus, for our example, we regard the relationship
set works-on (relating the entity sets employee, branch, and job) as a higher-level
entity set called works-on.
Such an entity set is treated in the same manner as is any other entity set.
We can then create a binary relationship manages between works-on and
manager to represent who manages what tasks.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
6. Draw an E – R Diagram for Banking, University, Company y, Airlines,
ATM, Hospital, Library, Super market, Insurance Company.
(16)
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
A database system provides a data definition language to specify the
database schema and a data manipulation language to express database queries
and updates In practice, the data definition and data manipulation languages are
not two separate languages; instead they simply form parts of a single database
language, such as the widely used SQL language.
Data-Definition Language
We specify a database schema by a set of definitions expressed by a special
language called a data-definition language (DDL).
For instance, the following statement in the SQL language defines the account
table:
create table account (account-number char(10), balance integer)
Execution of the above DDL statement creates the account table. In
addition, it updates a special set of tables called the data dictionary or
data directory.
A data dictionary contains metadata—that is, data about data. The schema
of a table is an example of metadata. A database system consults the data
dictionary before reading or modifying actual data.
We specify the storage structure and access methods used by the database
system by a set of statements in a special type of DDL called a data
storage and definition language.
The data values stored in the database must satisfy certain consistency
constraints.For example, suppose the balance on an account should not fall
below $100. The DDL provides facilities to specify such constraints. The
database systems check these constraints every time the database is
updated.
Data-Manipulation Language
Data manipulation is
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
• The retrieval of information stored in the database
• The insertion of new information into the database
• The deletion of information from the database
• The modification of information stored in the database
A data-manipulation language (DML) is a language that enables users to
access or manipulate data as organized by the appropriate data model. There are
basically two types:
• Procedural DMLs require a user to specify what data are needed and how to
get those data.
• Declarative DMLs (also referred to as nonprocedural DMLs) require a user to
Database Access from Application Programs
Application programs are programs that are used to interact with the database.
Application programs are usually written in a host language, such as Cobol, C,
C++, or Java.
To access the database, DML statements need to be executed from the host
language. There are two ways to do this:
• By providing an application program interface (set of procedures) that can be
used to send DML and DDL statements to the database, and retrieve the results.
The Open Database Connectivity (ODBC) standard defined by Microsoft for use
with the C language is a commonly used application program interface standard.
The Java Database Connectivity (JDBC) standard provides corresponding
features to the Java language.
• By extending the host language syntax to embed DML calls within the host
language program. Usually, a special character prefaces DML calls, and a
preprocessor, called the DML precompiler, converts the DML statements to
normal procedure calls in the host language.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
8. Discuss about database users and administrators. (8)
Database Users and Administrators
A primary goal of a database system is to retrieve information from and
store new information in the database. People who work with a database can be
categorized as database users or database administrators.
Database Users and User Interfaces
Naive users
Naive users are unsophisticated users who interact with the system by
invoking one of the application programs that have been written previously. For
example, a bank teller who needs to transfer $50 from account A to account B
invokes a program called transfer
Application programmers
Application programmers are computer professionals who write
application programs. Application programmers can choose from many tools to
develop user interfaces. Rapid application development (RAD) tools are tools
that enable an application programmer to construct forms and reports without
writing a program.
Sophisticated users
Sophisticated users interact with the system without writing programs.
Instead, they form their requests in a database query language. They submit each
such query to a query processor, whose function is to break down DML
statements into instructions that the storage manager understands. Analysts who
submit queries to explore data in the database fall in this category.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Online analytical processing (OLAP)
Online analytical processing (OLAP) tools simplify analysts’ tasks by
letting them view summaries of data in different ways. For instance, an analyst
can see total sales by region (for example, North, South, East, andWest), or by
product, or by a combination of region and product (that is, total sales of each
product in each region).
Another class of tools for analysts is data mining tools, which help them find
certain kinds of patterns in data
Specialized users
Specialized users are sophisticated users who write specialized database
applications that do not fit into the traditional data-processing framework.
Among these applications are computer-aided design systems, knowledge base
and expert systems, systems that store data with complex data types (for
example, graphics data and audio data), and environment-modeling systems.
Database Administrator
One of the main reasons for using DBMSs is to have central control of both
the data and the programs that access those data. A person who has such central
control over the system is called a database administrator (DBA). The
functions of a DBA include: • Schema definition. The DBA creates the original
database schema by executing a set of data definition statements in the DDL.
• Storage structure and access-method definition.
• Schema and physical-organization modification. The DBA carries out
changes to the schema and physical organization to reflect the changing needs of
the organization, or to alter the physical organization to improve performance.
• Granting of authorization for data access. By granting different types of
authorization, the database administrator can regulate which parts of the database
various users can access. The authorization information is kept in a special
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
system structure that the database system consults whenever someone attempts to
access the data in the system.
• Routine maintenance. Examples of the database administrator’s routine
maintenance activities are:
Periodically backing up the database, either onto tapes or onto remote
servers, to prevent loss of data in case of disasters such as flooding. �Ensuring
that enough free disk space is available for normal operations, and upgrading
disk space as required.
Monitoring jobs running on the database and ensuring that performance is
not degraded by very expensive tasks submitted by some users.
000024 novelist
000024 playwright
000034 magazine columnist
002345 novella
002345 newpaper columnist
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Second Normal Form:
=================
*Tables are said to be in second normal form when:
*The tables meet the criteria for first normal form.
*If the primary key is a composite of attributes (contains multiple
columns), the non key attributes (columns) must depend on the whole
key.
Note: Any table with a primay key that is composed of a single
attribute (column) is automatically in second normal form.
Dependency preservation
Computer data – information in a form suitable for use with a computer. Data is
often distinguished from programs. A program is a sequence of instructions that
detail a task for the
computer to perform. In this sense, data is everything in software that is not
program code.
Boyce– Codd normal form if and only if for every one of its dependencies X →
Y, at least one of the following conditions hold:[4]
X → Y is a trivial functional dependency (Y ⊆ X)
X is a superkey for schema R
Fifth normal form (5NF), also known as Project-join normal form (PJ/NF) is
a level of atabase normalization designed to reduce redundancy in relational
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
databases recording multi-valued facts by isolating semantically related multiple
relationships. A table is said to be in the 5NF if and only if every join
dependency in it is implied by the candidate keys.
Only in rare situations does a 4NF table not conform to 5NF. These are situations
in which a
complex real-world constraint governing the valid combinations of attribute
values in the 4NF table is not implicit in the structure of that table. If such a
table is not normalized to 5NF, the burden of maintaining the logical consistency
of the data within the table must be carried partly by the application responsible
for insertions, deletions, and updates to it; and there is a heightened risk that the
data within the table will become inconsistent. In contrast, the 5NF design
excludes the possibility of such inconsistencies.
In file System, each user maintains separate files and programs to manipulate
these files because each requires some data not available from other user‘s files.
This redundancy in defining and storage of data results in
wasted storage space,
redundant efforts to maintain common update,
higher storage and access cost and
Leads to inconsistency of data (ie.,) various copies of same data may not
agree.
(ii) Self Describing Nature of Database System In File System, the structure
of the data file is embedded in the access programs. A database system contains
not only the database itself but also a complete definition or description of
database structure and constraints. This definition is stored in System catalog
which contains information such as structure of each file, type and storage format
of each data item and various constraints on the data. Information stored in the
catalog is called Meta-Data. DBMS is not written for specific applications, hence
it must refer to catalog to know structure of file etc., and hence it can work
equally well with any number of database applications.
(iv) Enforcing Integrity Constraints: The data values must satisfy certain types
of consistency constraints. In File System, Developers enforce constraints by
adding appropriate code in application program. When new constraints are
added, it is difficult to change the programs to enforce them. In data base system,
DBMS provide capabilities for defining and enforcing constraints. The
constraints are maintained in system catalog. Therefore application programs
work independently with addition or modification of constraints. Hence integrity
problems are avoided.
(vi) Concurrent Access or sharing of Data: When multiple users update the
data simultaneously, it may result in inconsistent data. The system must maintain
supervision which is difficult because data may be accessed by many different
application programs that may have not been coordinated previously. The
database (DBMS) include concurrency control software to ensure that several
programs /users trying to update the same data do so in controlled manner, so
that the result of update is correct.
(vii) Security: Every user of the database system should not be able to access the
data. But since the application programs are added to the system in an adhoc
manner, enforcing such security constraints is difficult in file system. DBMS
provide security and authorization subsystem, which the DBA uses to create
accounts and to specify account restrictions.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
(viii) Support for multiple views of Data: Database approach support multiple
views of data. A database has many users each of whom may require a different
view of the database. View may be a subset of database or virtual data retrieved
from database which is not explicitly stored. DBMS provide multiple views of
the data or DB. Different application programs are to be written for different
views of data.
UNIT II
1. List the string operations supported by SQL?
3. What are aggregate functions? And list the aggregate functions supported
by SQL?
AVG() - Returns the average value. COUNT() - Returns the number of rows.
FIRST() - Returns the first value. LAST() - Returns the last value. MAX() -
Returns the largest value. MIN() - Returns the smallest value. SUM() - Returns
the sum
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
BY clause follows the WHERE clause in a SELECT statement and precedes the
ORDER BY clause.
7. Write a SQL statement to find the names and loan numbers of all
customers who have a loan at Chennai branch.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
List of common commands to be performed on a SQL table, to change its
structure. The commands to access the content with SELECT or adding rows
with INSERT or UPDATE are not addressed here.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
VARCHAR(n) or CHARACTER Character string. Variable length.
VARYING(n) Maximum length n
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
While the DBMS is designed to process these low - level o perations efficiently,
it can be quite the burden to a user to submit requests to the DBMS in these
formats. There are three phases that a query passes through during the DBMS’
processing of that query:
1. Parsing and translation
2. Optimization
3. Evaluation
The first step in processing a query submitted to a DBMS is to convert the
query into a form usable by the query processing engine. High - level query
languages such as SQL represent a query as a string, or sequence, of characters.
Certain sequences of characters represent various types of tokens such as
keywords, operators, operands, literal strings, etc. Like all languages, there are
rules ( syntax and grammar) that govern how the tokens can be combined into
understandable (i.e. valid) statements
Advantages of Views:
1. Provide automatic security for hidden data.
2. Different views of same data for different users.
3. Provide logical data independence.
4. Provides the principle of interchangeability and principle of database
relativity.
The select operation selects tuples that satisfy a given predicate.We use the
lowercase Greek letter sigma (σ) to denote selection. The predicate appears as a
subscript to σ.
The argument relation is in parentheses after the σ. Thus, to select those
tuples of the loan relation where the branch is “Perryridge,” we write
Then the relation that results from the preceding query is as shown in Figure
3.10.
We can find all tuples in which the amount lent is more than $1200 by writing
σamount>1200 (loan)
Suppose we want to list all loan numbers and the amount of the loans, but
do not care about the branch name.
Π loan-number, amount(loan)
Consider a query to find the names of all bank customers who have either
an account or a loan or both. Note that the customer relation does not contain the
information, since a customer does not need to have either an account or a loan at
the bank.
To answer this query, we need the information in the depositor relation (Figure
3.5) and in the borrower relation (Figure 3.7).We know how to find the names of
all customers with a loan in the bank:
Πcustomer-name (borrower )
We also know how to find the names of all customers with an account in the
bank:
Πcustomer-name (depositor)
To answer the query, we need the union of these two sets; that is, we need all
customer names that appear in either or both of the two relations.We find these
data by the binary operation union, denoted, as in set theory, by ∪. So the
expression needed
Is Πcustomer-name (borrower ) ∪ Πcustomer-name (depositor)
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
1 In static SQL how database will In dynamic SQL, how
be accessed is predetermined in database will be accessed
the embedded SQL is determined at run time.
statement.Hard coded
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Parsing and Translation:
Translates the query into its internal form. This is then translated into
relational algebra
Parser checks syntax,verifies relations
Evaluation
The query-execution engine takes a query-evaluation plan, executes that
plan, and returns the answers to the query.
Optimization
A relational algebra expression may have many equivalent expressions
E.g., sbalance<2500(II balance(account)) is equivalent to II balance(
balance<2500(account))
Each relational algebra operation can be evaluated using one of several
different algorithm
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
A query evaluation plan (or simply plan) consist of an extended
relational relational algebra tree, with additional annotations at
each node indicating:
Tha access methods to use for each table.
The implementation method.
SELECT S.name FROM Reserves R,Sailors R WHERE
R.sid=S.sid AND R.bid=100 AND S.rating>5
This query can be expressed as:
( =100^rating>5
When the query involves several operators, sometimes the result of one
is pipelined into next.
In this case, no temporary relation is written to disk (materialized).
The result is fed to the next operator as soon as it is available.
It is cheaper.
When the input table to a unary operator is pipelined into it, we say it
is applied on-the-fly.
8. Since indices speed query processing, why might they not be kept on
several search keys? List as many reasons as possible.
Reason for not keeping several search indices include:
a.Every index requires additional CPU time and disk I/O overhead
during insert and deletion.
b.Indices on non-primary keys might have to be changed on updates
although an index on the primary key might not (this is because update
typically do not modify the primary key attributes).
c.`Each extra index requires additional storage space.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
d.For queries which involve conditions on several search keys,efficiency
might not be bad even if only some of the keys have indices on
them.threfore database performance is improved less by adding indices
when many indices already exist.
11. Consider the employee database , where the primary keys are
Underlined.
employee(empname,street,city)
works(empname,companyname,salary)
company(companyname,city)
manages(empname,management)
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Give an expression in the relational algebra for each request.
1) Find the names of all employees who work for First Bank
Corporation.
2) Find the names, street addresses and cities of residence of all
employees who work for First Bank Corporation and earn
more than 200000 per annum.
3) Find the names of all employees in this database who live in
the same city as the company for which they work.
4) Find the names of all employees who earn more than every
employees of small Bank Corporation.
UNIT III
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Action, or series of actions, carried out by user or application, which
accesses or changes contents of database. It Transforms database from one
consistent state to another, although consistency may be violated during
transaction
Deadlock
Starvation
Active
Partially committed
Failed
Aborted
Committed
Terminated
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
8. What is a shadow copy scheme?
It is simple, but efficient, scheme called the shadow copy schemes. It is
based on making copies of the database called shadow copies that one transaction
is active at a time. The scheme also assumes that the database is simply a file on
disk
15. What are the two methods for dealing deadlock problem?
Commit Overhead
Data fragmentation
Garbage collection
23.Define blocks?
The database system resides permanently on nonvolatile storage, and is
partitioned into fixed-length storage units called blocks
28.Differentiate strict two phase locking protocol and rigorous two phase
locking
protocol.
In strict two phase locking protocol all exclusive mode locks taken by a
transaction is held until that transaction commits.
Rigorous two phase locking protocol requires that all locks be held until the
transaction commits.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
32.How the time stamps are implemented
• Use the value of the system clock as the time stamp. That is a transaction‟s time
stamp is equal to the value of the clock when the transaction enters the system.
• Use a logical counter that is incremented after a new timestamp has been
assigned; that is the time stamp is equal to the value of the counter.
33. What are the time stamps associated with each data item?
• W-timestamp (Q) denotes the largest time stamp if any transaction that
executed WRITE (Q) successfully.
• R-timestamp (Q) denotes the largest time stamp if any transaction that executed
READ (Q) successfully.
UNIT:3
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
If a schedule S can be transformed into a schedule S0 by a series of
swaps of non-conflicting instructions, we say that S and S0 are conflict
equivalent.
If a schedule Sis conflict serializable if it is conflict equivalent to
a serial schedule.
View Serializability Let Sand S0 be two schedules with the same set of
transactions. S and S0 are view equivalent if the following three conditions are
met: 1. For each data item Q, if transaction Ti reads the initial value of Q in
schedule S, then transaction Ti must, in schedule S0, also read the initial value of
Q. 2. For each data item Q, if transaction Ti executes read( Q) in schedule S, and
that value was produced by transaction Tj (if any), then transaction Ti must in
schedule S0 also read the value of Q that was produced by transaction Tj. 3. For
each data item Q, the transaction (if any) that performs the final write (Q)
operation in schedule S must perform the final write (Q) operation in schedule
S0. A schedule S is view serializable if it is view equivalent to a serial schedule.
Every conflict serializable schedule is also view serializable. The following
schedule is view-serializable but not conflict serializable
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
2) Immediate database modification
The immediate modification technique allows database
modifications to be output to the database while the transaction is still in
the active state.
The recovery scheme uses two recovery procedures:
i) Undo (Ti): It restores the value of all data items updated by
transaction Ti to the old values.
ii) Redo (Ti): It sets the value of all data items updated by transaction
Ti to the new values.
After a failure has occurred, the recovery scheme consults the log to determine
which transactions need to be redone and which need to be undone.
. Transaction Ti needs to be undone if the log to contains the record <Ti
start>, but does not contain the record <Ti commit>.
. Transaction Ti needs to be redone if the log contains both the record
<Ti start> and the record <Ti commit>.
Let us reconsider the banking system, in which transactions T0 and T1 are
executed in the order To followed by T1. Suppose that the system crashes before
the completion of the transactions.
Recoverability
If a transaction Ti fails, we need to undo the effect of this transaction to
ensure the atomicity property of the transaction. In a system that allows
concurrent execution, it is necessary to ensure that any transaction T j that is
dependent on Ti should also be aborted. To achieve this surety, we need to place
restrictions on the type of schedules permitted in the system.
Types of schedules that are acceptable from the view point of recovery from
transaction failure are:
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
. Recoverable schedules
. Cascade less schedules
1) Recoverable schedules
A recoverable schedule is one where, for each pair of transactions
Ti and T j such that T j reads a data item previously written by Ti, the
commit operation of Ti appears before the commit operation of T j.
Consider schedule 11 in Fig. 4.29, in which T9 is a transaction that
performs only one instruction: read (A). Suppose that the system allows
T9 to commit immediately after executing the read (A) instruction. Thus,
T9 commits before T8 does. Suppose that T8 fails before it commits.
Since T9 has read the value of data item. A written by T8, we must abort
T9 to ensure transaction atomicity. However, T9 has already committed
and cannot be aborted. Thus, it is impossible to recover correctly from
the failure of T8. Thus, schedule 11 is non-recoverable schedule, which
should not be allowed. Most database system requires that all schedules
be recoverable.
T8 T9
Read (A)
Write (A)
Read (A)
Read (A)
2) Cascade less schedules
Even if a schedule is recoverable, to recover correctly from the
failure of a transaction Ti, we may have to roll back several transactions.
Such situations occur if transactions have read data written by Ti.
Consider schedule 12 of Fig. 4.30. Transaction T1 writes a
value of A that is read by transaction T2. Transaction T2 writes a value
of A that is read by T3. Suppose that, at this point, T1 fails, T1 must be
rolled back. Since T2 is dependent on T1, T2 must be rolled back.
Similarly as T3 is dependent on T2, T3 should also be rolled back. This
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
phenomenon, in which a single transaction failure leads to a series of
transaction roll backs, is called cascading rollback.
Cascading rollback is undesirable, since it leads to the
undoing of a significant amount of work. Therefore, schedules should
not contain cascading rollbacks. Such schedules are called cascading
schedules.
A cascading schedule is one where, for each pair of
transactions Ti and T j such that T j reads a data item previously written
by Ti, the commit operation of Ti appears before the read operation of T
j.
T1 T2 T3
Read (A)
Read (B)
Write (A)
Read (A)
Write (A)
Read (A)
Fig. 4.30 Schedule 12
Strict two-phase locking protocol: This protocol requires that locking should be
two- Phase, and all exclusive-mode locks taken by a transaction should be held
until the transaction. This requirement prevents any transaction from reading the
data written by any uncommitted transaction commits.
Timestamp based protocols
Time stamp based protocol ensures serialize ability. It selects an ordering
among transactions in advance using time stamps.
Timestamps
With each transaction in the system, a unique fixed timestamp is
associated. It is denoted by TS (Ti). This timestamp is assigned by the database
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
system before the transaction Ti status execution. If a transaction Ti has been
assigned timestamp TS (Ti), and a new transaction T j enters the system, then
TS<TS (T j).
Two methods are used for implementing timestamp:
i) Use the value of the system clock as the timestamp, that is, a
transactions timestamp is equal to the value of the clock when the
transaction enters the system.
ii) Use a logical counter, that is a transactions timestamp is equal to
the value of logical counter, when transaction enters the system.
After assigning a new timestamp, value of timestamp is increased.
The timestamps of the transactions determine the serialize ability order. Thus, if
TS (Ti)> TS (T j), then the system must ensure that in produced schedule
transaction Ti appears before transaction T j.
To implement this scheme, two timestamps are associated with each data
item Q.
i) W-timestamp (Q) denotes the largest timestamp of any transaction
that executed write (Q) successfully.
ii) R-timestamp (Q) denotes the largest timestamp of any transaction
that executed read (Q) successfully.
These timestamps are updated whenever a new read (Q) or write (Q)
instruction is executed.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
b) If TS (Ti) >_W-timestamp (Q), then the read operation is
executed, and R-timestamp (Q) is set to the maximum of R-
timestamp (Q) and TS (Ti).
2. Suppose that transaction Ti issues write (Q).
a) If TS (Ti) < R-timestamp (Q), then the value of Q that Ti is
producing was needed previously, and the system assumed that
the value would never produced. Hence, the system rejects write
operation and rolls Ti back.
b) If TS (Ti) < W-timestamp (Q), then Ti is attempting to write an
obsolete value of Q. Hence, the system rejects this write
operation and rolls back Ti.
c) Otherwise, the system executes the write operation and sets W-
timestamp (Q) to TS (Ti).
If a transaction Ti is rolled by the concurrency control scheme, the system
assigns it a new timestamp and restarts it.
Advantages
1) The timestamp ordering protocol ensures conflict serializability. This
is because conflicting operations are processed in timestamp order.
2) The protocol ensures freedom from deadlock, since no transaction
ever waits.
Disadvantage
1) There is a possibility of starvation of long transactions if a sequence of
conflicting short transactions causes repeated restarting of the long
transaction. If a transaction is found to be getting restarted repeatedly,
conflicting transactions need to be temporarily blacked to enable the
transaction to finish.
2) The protocol can generate schedules that are not recoverable.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Before the execution of transaction Ti the values of accounts A and
B are $1000 and $2000, respectively.
Suppose if the transaction fails due to some power failure, hardware
failure and system error the transaction Ti will not execute successfully.
If the failure happens after the write(A) operation but before the write(B)
operation. The database will have values $950 and $2000 which results in
a failure.
The system destroys $50 as a result of failure and leads the system to
inconsistent state.
The basic idea of atomicity is: The database system keeps track of the
old values of any data on which a transaction performs a write, if the
transaction does not terminate successfully then the database system
restores the old values.
Atomicity is handled by transaction-management component.
Concurrency Control
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Time stamp based protocols
Binary Locks: a lock on data item can be in two states; it is either locked
or unlocked.
Shared/exclusive: this type of locking mechanism differentiates lock based
on their uses. If a lock is acquired on a data item to perform a write
operation, it is exclusive lock. Because allowing more than one transactions
to write on same data item would lead the database into an inconsistent
state. Read locks are shared because no data value is being changed.
Simplistic
Pre-claiming
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
transaction requests the system for all locks it needs beforehand. If all the
locks are granted, the transaction executes and releases all the locks when
all its operations are over. Else if all the locks are not granted, the
transaction rolls back and waits until all locks are granted.
Two phase locking has two phases, one is growing; where all locks are
being acquired by transaction and second one is shrinking, where locks held by
the transaction are being released.
Lock based protocols manage the order between conflicting pairs among
transaction at the time of execution whereas time-stamp based protocols start
working as soon as transaction is created.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Every transaction has a time-stamp associated with it and the ordering is
determined by the age of the transaction. A transaction created at 0002 clock
time would be older than all other transaction, which come after it. For example,
any transaction 'y' entering the system at 0004 is two seconds younger and
priority may be given to the older one.
In addition, every data item is given the latest read and write-timestamp. This lets
the system know, when last read was and write operation made on the data item
UNIT IV
1. What is an index?
An index is a data structure which enables a query to run at a sublinear-time.
Instead of having to go through all records one by one to identify those which
match its criteria the query uses the index to filter out those which don't and
focus on those who do.
8. What are the ways in which the variable-length records arise in database
systems?
Storage of multiple record types in a file. Record types that allow variable
lengths for one or more fields.
Record types that allow repeating fields (used in some older data models).
Byte string representation
Attach an end-of-record () control character to the end of each record.
Difficulty with deletion.
Difficulty with growth.
Variable-Length Records: Slotted Page Structure
10. What are the techniques to be evaluated for both ordered indexing and
hashing?
Ordered indices: search keys are stored in sorted order.
Hash indices: search keys are distributed uniformly across ―bucketsǁ using a
―hash functionǁ.
13. How does B-tree differ from a B+ - tree? Why is a B+ - tree usually
preferred as an access structure to a data file?
Advantage of B+-tree index files: automatically reorganizes itself with small,
local, changes, in the face of insertions and deletions. Reorganization of entire
file is not required to maintain performance
Advantages of B+-trees outweigh disadvantages, and they are used extensively
14. What are the types of transparencies that a distributed database must
support? Why?
Fragmentation transparency
Replication transparency
Location transparency
17. What are structured data types? What are collection types in particular?
b. Faster data transfer than with a single disk, but fewer I/Os per second since every disk has
to participate in every I/O.
c. Subsumes Level 2 (provides all its benefits, at lower cost).
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
» RAID Level 4: Block-Interleaved Parity; uses block-level striping, and keeps a parity block on
a separate disk for corresponding blocks from N other disks.
a. Provides higher I/O rates for independent block reads than Level 3
i. block read goes to a single disk, so blocks stored on different disks can be read
in parallel
b. Provides high transfer rates for reads of multiple blocks than no-striping
c. Before writing a block, parity data must be computed
1. More efficient for writing large amounts of data sequentially
» RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N +
1 disks, rather than storing data in N disks and parity in 1 disk.
a. E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with
the data blocks stored on the other 4 disks.
b. Higher I/O rates than Level 4.
i. Block writes occur in parallel if the blocks and their parity blocks are on different
disks.
c. Subsumes Level 4: provides same benefits, but avoids bottleneck of parity disk.
» RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra redundant
information to guard against multiple disk failures.
a. Better reliability than Level 5 at a higher cost; not used as widely.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
i. Including time taken to rebuild failed disk
» RAID 0 is used only when data safety is not important
a. E.g. data can be recovered quickly from other sources
» Level 2 and 4 never used since they are subsumed by 3 and 5
» Level 3 is not used anymore since bit-striping forces single block reads to access all disks,
wasting disk arm movement, which block striping (level 5) avoids
» Level 6 is rarely used since levels 1 and 5 offer adequate safety for almost all applications
» So competition is between 1 and 5 only
» Level 1 provides much better write performance than level 5
» Level 1 had higher storage cost than level 5
» Level 5 is preferred for applications with low update rate, and large amounts of data
» Level 1 is preferred for all other applications.
» Heap – a record can be placed anywhere in the file where there is space
» Sequential – store records in sequential order, based on the value of the search key of each record
» Hashing – a hash function computed on some attribute of each record; the result specifies in
which block of the file the record should be placed
» Records of each relation may be stored in a separate file. In a cl ust ering file organization
records of several different relations can be stored in the same file
o Motivation: store related records on the same block to minimize I/O
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Figure Sequential file for account records
Clustering File Organization :
» Simple file structure stores each relation in a separate file
» Can instead store several relations in one file using a clustering file organization
o good for queries involving depositor customer, and for queries involving one single customer
and his accounts
o bad for queries involving only customer
o results in variable size records
Clustering File Structure with Pointer Chains
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
FigureClustering file structure with pointer
chains
Data Warehouse:
Large organizations have complex internal organizations, and have data stored at
different locations, on different operational (transaction processing) systems, under
different schemas.
Data sources often store only current data, not historical data Corporate decision
making requires a unified view of all organizational data, including historical data
A data warehouse is a repository (archive) of information gathered from
multiple sources, stored under a unified schema, at a single site
Greatly simplifies querying, permits study of historical trends
Shifts decision support query load away from transaction processing systems
Components of a Datawarehouse
WAREHOUSE SCHEMA
Distributed networks have disadvantages and this must be consider before a system
is decentralized. They include:
COMPLEXITY: Distributed data bases that hide the distributed nature from the
user and provide acceptable level of performance, reliability and availability are more
complex than centralized DBMSs. The date replication, failure recovery, network
management, etc. make the system more complex
COST: Increased complexity means increased man power (skilled professionals)
requirements and complex and costly hardware and high procurement and
maintenance costs. Since a DBMS needs more people and more hardware-both of
which are costly, running and maintaining the system can be more expensive than
the centralized system.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
TECHNICAL PROBLEMS OF CONNECTING DISSIMILAR MACHINES:
Technical problems can sometimes be overwhelming for a distributed system.
Additional layers of operating system software are needed to translate and
coordinate the flow of data between machines. Sometimes a link between main
frames and micro computers can be difficult to establish.
NEED FOR SOPHISTICATED COMMUNICATION SYSTEM: Distributed
processing required the development of a data communication system. This system
can be costly to develop and use. In addition, in their maintenance can be costly
affair.
DATA INTEGRETY AND SECURITY PROBLEMS: Because data maintained
by distributed system can be accessed at many location in the network, controlling
the integrity of a data base can be difficult.
LACK OF PROFESSIONAL SUPPORT: Finally, distributed computers are often
placed location where little or no data processing support is available. Consiquently,
they will be run by nonprofessionals .Another aspects is that the communicaton
system also require highly trained personal for their maintenance.
6. Describe the structure of B+ tree and give the algorithm for search in the
B+ tree with example.
TREE INDEXES:
In the case of index sequential file the performance degrades as the file grows. This is
because, the index lookups the sequential scans take more time as more records are
there in the file. Although this performance de re-gradation can be overcome (to a
certain extend) by reorganizing the file, frequent file reorganizations are undesirable
and will add to the file maintenance overheads. One of the index structures that
maintain its efficiency even with the insertion and deletion of data is the B+ -tree
index structure. AB+ - tree index takes the form of a balanced tree in which every
path from the root to the tree leaf is of the same length. Each non-leaf node in the
tree has between ‘n/2’ and ‘n’ children (n/2<=c <=n) , where ‘n’ is fixed for a
particular tree.
The pointers (p1,p2,…..pn-1) points either to a file record with search key values
(k1,k2,……..kn-1) or to a bucket of pointers each of which points to a file records with
search key does not form a primary key and if the files is not sorted in the search key
does not form a primary key and if the file is not sorted in the search key order .
Pointer pn is used to chain to gether(as shown in the figure) the nodes on the search
key order the allowing efficient sequential processing of the file.
7. What are the types of Knowledge discovered data mining? Explain with
suitable example.
Data Mining
Classification
Given a training set consisting of items belonging to different classes, and a new item
whose class is unknown, predict which class it belongs to.
Regression
formulae
Given a set of parameter-value to function-result mappings for an unknown function,
predict the function-result for a new parameter-value
A multidimensional database is a specific type of database that has been optimized for data
warehousing and OLAP (online analytical processing). A multi-dimensional database is
structured by a combination of data from various sources that work amongst databases
simultaneously and that offer networks, hierarchies, arrays, and other data formatting
methods. In a multidimensional database, the data is presented to its users through
multidimensional arrays, and each individual value of data is contained within a cell which
can be accessed by multiple indexes.
This section introduces the concepts of outlines, dimensions, and members within a
multidimensional database. If you understand dimensions and members, you are well on
your way to understanding the power of a multidimensional database.
A dimension represents the highest consolidation level in the database outline.
The database outline presents dimensions and members in a tree structure to indicate a
consolidation relationship. . Standard dimensions represent the core components of a
business plan and often relate to departmental functions. Typical standard dimensions: Time,
Accounts, Product Line, Market, and Division. Dimensions change less frequently than
members.
Attribute dimensions are associated with standard dimensions. Members are the individual
components of a dimension. For example, Product A, Product B, and Product C might be
members of the Product dimension. Each member has a unique name. Essbase can store the
data associated with a member (referred to as a stored member in this chapter), or it can
dynamically calculate the data when a user retrieves it.
PARALLEL DATABASES
Data can be partitioned across multiple disks for parallel I/O. Individual relational operations
(e.g., sort, join, aggregation) can be executed in parallel data can be partitioned and each
processor can work independently on its own partition.
Queries are expressed in high level language (SQL, translated to relational algebra) makes
parallelization easier. Different queries can be run in parallel with each other. Concurrency
control takes care of conflicts. Thus, databases naturally lend themselves to parallelism.
Reduce the time required to retrieve relations from disk by partitioning the relations on
multiple disks. Horizontal partitioning – tuples of a relation are divided among many disks
such that each tuple resides on one disk. Partitioning techniques (number of disks = n):
Round-robin: Send the ith tuple inserted in the relation to disk i mod n.
Hash partitioning: Choose one or more attributes as the partitioning attributes. Choose hash
function h with range 0…n - 1 Let i denote result of hash function h applied to the
partitioning attribute value of a tuple. Send tuple to disk i.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Range partitioning: Choose an attribute as the partitioning attribute. A partitioning vector
[vo, v1, ..., vn-2] is chosen. Let v be the partitioning attribute value of a tuple. Tuples such
that vi vi+1 go to disk I + 1. Tuples with v < v0 go to disk 0 and tuples with v vn-2 go
to disk n-1. E.g., with a partitioning vector [5,11], a tuple with partitioning attribute value of
2 will go to disk 0, a tuple with value 8 will go to disk 1, while a tuple with value 20 will go
to disk2.
Cache-coherency has to be maintained — reads and writes of data in buffer must find latest
version of data.
Multimedia databases provide features that allow users to store and query different types of
multimedia information, which includes images (such as photos or drawings), video clips (such as movies,
newsreels, or home videos), audio clips (such as songs, phone messages, or speeches), and documents (such
as books or articles). Themain types of database queries that are needed involve locating multimedia sources
that contain certain objects of interest.
For example, one may want to locate all video clips in a video database that include a certain person,
say Michael Jackson. One may also want to retrieve video clips based on certain activities included in
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
them, such as video clips where a soccer goal is scored by a certain player or team. The above types of
queries are referred to as content-based retrieval, because the multimedia source is being retrieved based
on its containing certain objects or activities.
Hence, a multimedia database must use some model to organize and index the multimedia sources
based on their contents. Identifying the contents of multimedia sources is a difficult and time-consuming
task. There are two main approaches.
The first is based on automatic analysis of the multimedia sources to identify certain
mathematical characteristics of their contents. This approach uses different techniques
depending on the type of multimedia source (image, video, audio, or text).
The second approach depends on manual identification of the objects and activities of
interest in each multimedia source and on using this information to index the sources. This
approach can be applied to all multimedia sources
An image is typically stored either in raw form as a set of pixel or cell values, or in compressed form
to save space. The image shape descriptor describes the geometricshape of the raw image, which is typically
a rectangle of cells of a certain width and height. Hence, each image can be represented by an m by n grid of
cells. Each cell
Analysis of multimedia sources is critical to support any type of query or search interface.We need to
represent multimedia source data such as images in terms of features that would enable us to define
similarity. The work done so far in this area uses low-level visual features such as color, texture, and shape,
which are directlyrelated to the perceptual aspects of image content. These features are easy to extract and
represent, and it is convenient to design similarity measures based on their statistical properties.
o Color is one of the most widely used visual features in content-based image retrieval since it
does not depend upon image size or orientation.
o Retrieval based on color similarity is mainly done by computing a color histogram for each
image that identifies the proportion of pixels within an image for the three color channels
(red, green, blue—RGB).
o However, RGB representation is affected by the orientation of the object with respect to
illumination and camera direction.
o Therefore, current image retrieval techniques compute color histograms using competing
invariant representations such as HSV (hue, saturation, value).
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
o HSV describes colors as points in a cylinder whose central axis ranges from black at the
bottom to white at the top with neutral colors between them.
o The angle around the axis corresponds to the hue, the distance from the axis corresponds to
the saturation, and the distance along the axis corresponds to the value (brightness).
o Texture refers to the patterns in an image that present the properties of homogeneity that do
not result from the presence of a single color or intensity value.
o The notion of implicit tagging is an important one for image recognition and comparison.
Multiple tags may attach to an image or a subimage: for instance, in the example we referred
to above, tags such as “tiger,” “jungle,” “green,” and “stripes” may be associated with that
image.
o Most image search techniques retrieve images based on user-supplied tags that are often not
very accurate or comprehensive. To improve search quality, a number of recent systems aim
at automated generation of these image tags.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
o In case of multimedia data, most of its semantics is present in its content. These systems use
image-processing and statistical-modeling techniques toanalyze image content to generate
accurate annotation tags that can then be used to retrieve images by content. Since different
annotation schemes will use different vocabularies to annotate images, the quality of image
retrieval will be poor.
o To solve this problem, recent research techniques have proposed the use of concept
hierarchies, taxonomies, or ontologies using OWL (Web Ontology Language), in which
terms and their relationships are clearly defined.
o These can be used to infer higherlevel concepts based on tags. Concepts like “sky” and
“grass”may be further divided into “clear sky” and “cloudy sky” or “dry grass” and “green
grass” in such taxonomy.
o These approaches generally come under semantic tagging and can be used in conjunction
with the above feature-analysis and object-identification strategies.
Audio sources are broadly classified into speech, music, and other audio data. Eachof these
are significantly different from the other, hence different types of audio data are treated
differently.
Audio data must be digitized before it can be processed and stored. Indexing and retrieval of
audio data is arguably the toughest among all types of media, because like video, it is
continuous in time and does not have easily measurable characteristics such as text.
Clarity of sound recordings is easy to perceive humanly but is hard to quantify for machine
learning. Interestingly, speech data often uses speech recognition techniques to aid the actual
audio content, as this can make indexing this data a lot easier and more accurate.
This is sometimes referred to as text-based indexing of audio data. The speech metadata is
typically content dependent, in that the metadata is generated from the audio content, for
example, the length of the speech, the number of speakers, and so on.
However, some of the metadata might be independent of the actual content, such as the
length of the speech and the format in which the data is stored.
Music indexing, on the other hand, is done based on the statistical analysis of the audio
signal, also known as content-based indexing. Content-based indexing often makes use of the
key features of sound: intensity, pitch, timbre, and rhythm.
It is possible to compare different pieces of audio data and retrieve information from them
based on the calculation of certain features, as well as application of certain transforms.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
10. Explain the architecture of mobile and web database with neat sketch.
Wireless Communications –
The wireless medium have bandwidth significantly lower than those of a wired network.
The current generation of wireless technology has data rates range from the tens to
hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second
(wireless Ethernet, popularly known as WiFi).
Modern (wired) Ethernet, by comparison, provides data rates on the order of hundreds of
megabits per second.
The other characteristics distinguish wireless connectivity options:
interference,
locality of access,
range,
support for packet switching,
seamless roaming throughout a geographical region.
Some wireless networks, such as WiFi and Bluetooth, use unlicensed areas of the
frequency spectrum, which may cause interference with other appliances, such as cordless
telephones.
Modern wireless networks can transfer data in units called packets, that are used in wired
networks in order to conserve bandwidth.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Client/Network Relationships –
Mobile units can move freely in ageographic mobility domain, an area that is
circumscribed by wireless network coverage.
To manage entire mobility domain is divided into one or more smallerdomains, called
cells, each of which is supported by at least one base station.
Mobile units be unrestricted throughout the cells of domain, while maintaining
information access contiguity.
The communication architecture described earlier is designed to give the mobile unit the
impression that it is attached to a fixed network, emulating a traditional client-server
architecture.
Wireless communications, however, make other architectures possible.
In aMANET, co-located mobile units do not need to communicate via a fixed network,
but instead, form their own using cost-effective technologies such as Bluetooth.
In aMANET, mobile units are responsible for routing their own data, effectively acting as
base stations as well as clients.
Moreover, they must be robust enough to handle changes in the network topology, such as
the arrival or departure of other mobile units.
MANET applications can be considered as peer-to-peer, meaning that a mobile unit is
simultaneously a client and a server.
Transaction processing and data consistency control become more difficult since there is
no central control in this architecture.
Resource discovery and data routing by mobile units make computing in a MANET even
more complicated.
Sample MANET applications are multi-user games, shared whiteboard, distributed
calendars, and battle information sharing.
Communication latency
Intermittent connectivity
Limited battery life
Changing client location
The server may not be able to reach a client.
A client may be unreachable because it is dozing– in an energy-conserving state in which
many subsystems are shut down – or because it is out of range of a base station.
In either case, neither client nor server can reach the other, and modifications must be
made to the architecture in order to compensate for this case.
Proxies for unreachable components are added to the architecture.
For a client (and symmetrically for a server), the proxy can cache updates intended for the
server.
Mobile computing poses challenges for servers as well as clients.
The latency involved in wireless communication makes scalability a problem.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Since latency due to wireless communications increases the time to service each client
request, the server can handle fewer clients.
One way servers relieve this problem is by broadcasting data whenever possible. A
server can simply broadcast data periodically.
Broadcast also reduces the load on the server, as clients do not have to maintain active
connections to it. Client mobility also poses many data management challenges.
Servers must keep track of client locations in order to efficiently route messages to them.
Client data should be stored in the network location that minimizes the traffic necessary
toaccess
The act of moving between cells must be transparent to the client.
The server must be able to gracefully divert theshipment of data from one base to another,
without the client noticing.
Client mobility also allows new applications that are location-based.
WEB DATABASES
A web database is a system for storing information that can then be accessed via a website.
For example, an online community may have a database that stores the username, password,
and other details of all its members.
The most commonly used database system for the internet is MySQL due to its integration
with PHP — one of the most widely used server side programming languages.
At its most simple level, a web database is a set of one or more tables that contain data. Each
table has different fields for storing information of various types. These tables can then be
linked together in order to manipulate data in useful or interesting ways. In many cases, a
table will use a primary key, which must be unique for each entry and allows for
unambiguous selection of data.
A web database can be used for a range of different purposes. Each field in a table has to
have a defined data type. For example, numbers, strings, and dates can all be inserted into a
web database. Proper database design involves choosing the correct data type for each field
in order to reduce memory consumption and increase the speed of access. Although for small
databases this often isn't so important, big web databases can grow to millions of entries and
need to be well designed to work effectively.
UNIT V
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
A mobile database is a database that resides on mobile device such as a
PDA, a smart phone, or a laptop. Such devices are often limited in resources such
as memory, computing power and battery power
2. List the markup languages which are suitable for web databases.
A web database is a system for storing information that can then be
accessed via a website. For example, an online community may have a database
that stores the username, password, and other details of all its members. The
most commonly used database system for the internet is MySQL due to its
integration with PHP — one of the most widely used server side programming
languages.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
8. Specify the advantages of Data warehousing.
A data warehouse is a repository (archive) of information gathered from
multiple sources, stored under a unified schema at a single site.
Greatly simplifies querying, permits study of historical trends
Shifts decision support query load away from transaction processing
systems
Marketing
Finance
Resource optimization
Image Analysis
LOSS OF INTEGRITY
LOSS OF AVAILABILITY
LOSS OF CONFIDENTIALITY
STRUCTURE OF XML
Example
<bank-1> <customer>
<branch_name>
<balance>
Perryridge </branch_name>
400 </balance>
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
</account>
<account> …
</account>
</customer>
</bank-1>
Data source 1
Data
loaders
Data source 2
:
DBMS
Query and analysis tools
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
DESIGN ISSUES
When and how to gather data source driven architecture: data sources
transmit new information to Warehouse, either continuously or periodically (e.g.
at night). Destination driven architecture: warehouse periodically requests new
information from data sources. Keeping warehouse exactly synchronized with
data sources (e.g. using two-phase commit) is too expensive.
What schema to use
Schema integration
Data cleansing
E.g. correct mistakes in addresses (misspellings, zip errors)
Merge address lists from different sources and purge duplicates
How to propagate updates
Warehouse schema may be a (materialized) view of schema from data sources
What data to summarize
Raw data may be too large to store on-line Aggregate values (totals/subtotals)
often suffice Queries on raw data can often be transformed by query optimizer
to use aggregate values
Dimension values are usually encoded using small integers and mapped to full
values via dimension tables. Resultant schema is called a star schema. More
complicated schema structures
DATA MINING
Data mining is the process of semi-automatically analyzing large database to
find useful patterns. Predict if a credit card applicant poses a good credit risk,
based on some attributes (income, job type, age, ..) and past history. Predict if a
pattern of phone calling card usage is likely to be fraudulent.
3)Information Retrieval
Definition:
Information Retrieval is a problem-oriented discipline, concerned
with the problem of the effective and efficient transfer of desired information
between human generator and human user.
Components of IR:
Three major components
1. Document Subsystem
a) Acquisition
b) Representation
c) File Organization
2. User Subsystem
a) Problem
b) Representation
c) Query
3. Searching / Retrieval Subsystem
a) Matching
b) Retrieved Object
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Traditional IR System
System User
Acquisition Problem
Representation Representation
Matching
Retrieved Object
Crawling
Overview
A Web Crawler is software for downloading pages from the web.
Also known as Web Spider , Web Robot or simply Bot.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
The crawler starts downloading a set of seed pages that are parsed
and scanned for new link.
Features a crawler must provide
Robustness
The web contains servers that create spider traps, which are
generators of web pages that mislead crawlers into getting stuck
fetching an infinite number of pages in a particular domain. Crawlers
must be designed to be resilient to such traps.
Politeness
Web servers have both implicit and explicit policies regulating the
rate at which a crawler can visit them. These politeness policies must
be respected.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Freshness
In many applications, the crawler should operate in continuous mode.
It should obtain fresh copies of previously fetched pages. A search engine
crawler, for instance, can thus ensure that the search engine’s index contains a
fairly current representation of each indexed web page.
Extensible
Crawlers should be designed to be extensible in many ways – to cope
with new data formats, new fetch protocol, and so on. This demands that the
crawler architecture be modulated.
4)Data Classification
Database security concerns the use of a broad range of information
security controls to protect database (potentially including the data, the database
applications or stored functions, the database systems, the database servers and
the associated networks links) against compromises of their confidentiality,
integrity and availability.
It involves the various types or categories of controls, such as technical,
procedural/administrative and physical. Database security is a specialist topic
within the broader realms of computer security, information security, and risk
management. Security risks to database systems include, for example:
Unauthorized or unintended activity or misuse by authorized database
users, database administrators, or network/systems managers, or by
unauthorized users or hackers (e.g. inappropriate access to sensitive
data, metadata or functions within databases, or inappropriate changes
to the database programs, structures or security configurations);
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Overloads, performance constraints and capacity issues resulting in the
inability of authorized users to use databases as intended;
Types of security
Legal and ethical issues
Policy issues
System-related issues
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
4. Weak authentication: Weak authentication models allow attackers to employ
strategies such as social engineering and brute force to obtain database login
credentials and assume the identity of legitimate database users.
5. Weak audit trails: A weak audit logging mechanism in a database server
represents a critical risk to organization especially in retail, financial healthcare,
and other industries with stringent regulatory compliance.
5)Cryptography
A DBMS can use to protect information in certain situations where the
normal security mechanisms of the DBMS are not adequate. For example,
hackers may access our data without our permission.
Cipher text
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
In encryption, the message to be encrypted is known as plaintext. The Plaintext is
transformed by a function that is parameterized by a key. The output of the
encryption is known as the cipher text.
Ciphertext is then transmitted over the network. The process of converting the
plaintext to ciphertext is called as Encryption and process of converting the
ciphertext to plaintext is called as Decryption.
Techniques used for Encryption: There are following techniques used for
encryption process: * Substitution Ciphers *Transportation Ciphers
Substitution Ciphers: In a substitution cipher each letter or group of letter is
replaced by another letter or group of letters to mask them For example: a is
replaced with D, b is replaced with E, c with F and z with C. In this way attack
becomes DWWDFN. The substitution ciphers are not much secure because
intruder can easily guess the substitution characters.
Transportation Ciphers: Substitution ciphers preserve the order of the plaintext
symbols but mask them-;-The transportation cipher in contrast reorders the letters
but do not mask them. For this process a key is used. For example: iliveinqadian
may be coded as divienaniqnli. The transportation ciphers are more secure as
compared to substitution ciphers.
Data Encryption Standards (DES): It uses both a substitution of characters
and a rearrangement of their order on the basis of an encryption key. The main
weakness of this approach is that authorized users must be told the encryption
key, and the mechanism for communicating this transformation is vulnerable to
clever intruders.
Public Key Encryption: Each authorized user has a public encryption key,
known to everyone and a private decryption key (used by the decryption
algorithm), chosen by the user and known only to him or her.
Disadvantages of encryption:
There are following problems of Encryption:
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL
Even in a system that supports encryption, data must often be
processed in plaintext form. Thus sensitive data may still be accessible
to transaction programs.
K.S.GOBINATH M.E., (MBA)., (AP/CSE), SBM CET CS6302 –DBMS STUDY MATERIAL