Self Unit 2
Self Unit 2
Self Unit 2
The File is a collection of records. Using the primary key, we can access the records. The
type and frequency of access can be determined by the type of file organization which
was used for a given set of records.
File organization is a logical relationship among various records. This method defines
how file records are mapped onto disk blocks.
File organization is used to describe the way in which the records are stored in terms of
blocks, and the blocks are placed on the storage medium.
The first approach to map the database to the file is to use the several files and store
only one fixed length record in any given file. An alternative approach is to structure our
files so that we can contain multiple lengths for records. Files of fixed length records are
easier to implement than the files of variable length records.
When a file is created using Heap File Organization mechanism, the Operating Systems
allocates memory area to that file without any further accounting details.
It is the responsibility of software to manage the records.
Heap File does not support any ordering, sequencing or indexing on its own.
In a heap file organization, records are stored in no particular order. When a new
record is inserted, it is simply appended to the end of the file. This method is simple but
can lead to inefficient retrieval since the entire file must be scanned to locate specific
records
(ii). Sequential File Organization:
Unordered Files: A unordered file, sometimes called a heap file, is the simplest type of file
organization. Records are placed in the file in the same order as they are inserted. A new record
is inserted in the last page of the file; if there is insufficient space in the last page, a new page is
added to the file. This makes insertion very efficient. However, as a heap file has no particular
ordering with respect to field values, a linear search must be performed to access a record. A
linear search involves reading pages from the file until the required record is found. This makes
retrievals from heap files that have more than a few pages relatively slow, unless the retrieval
involves a large proportion of the records in the file.
To delete a record, the required page first has to be retrieved, the record marked as deleted,
and the page written back to disk. The space with deleted records is not reused. Consequently,
performance progressively deteriorates as deletions occur. This means that heap files have to
be periodically reorganized by the Database Administrator (DBA) to reclaim the unused space
of deleted records.
Ordered Files: The records in a file can be sorted on the values of one or more of the fields,
forming a key-sequenced data set. The resulting file is called an ordered or sequential file. The
field(s) that the file is sorted on is called the ordering field. If the ordering field is also a key of
the file, and therefore guaranteed to have a unique value in each record, the field is also called
the ordering key for the file.
The various types of file organizations are:
a. Heap File Organization.
b. Sequential File Organization.
c. Hash File Organization.
d. Clustered File Organization.
{EXPLAIN THEM AS ABOVE}
Structure of Index
The structure of an index in the database management system (DBMS) is given below −
Types of indexes
Index (Unique value) is created for each record in a data file which is a candidate key.
Secondary index is a type of dense index and also called a non-clustering index.
Secondary mapping size will be small as the two level DB indexing is used.
Contains another level of indexing to minimize the size of mapping.
Primary Index:
Summary {When the index is based on the primary key of the table it is called primary key.
When the index is based on the primary key of the table, it is called a primary index.
There are two types of indexes in primary key called dense and spare index. The dense
index contains an index record for every search key value in the data file. In the spare
index, there are index records for some data items.
No. of entries in the index file will be equal to no of blocks you take in main file.
Ex in diagram: 10 ,20 ,30 these 3 are number of entries and blocks are also 3 in pair of
10,20:30,40and70,80.
Dense Index:
In a dense index, a record is created for every search key valued in the database.
Dense indexing helps you to search faster but needs more space to store index records.
In dense indexing, records contain search key value and points to the real record on the
disk.
Sparse Index:
The sparse index is an index record that appears for only some of the values in the file.
Sparse Index helps you to resolve the issues of dense indexing.
In sparse indexing technique, a range of index columns stores the same data block
address, and when data needs to be retrieved, this block address will be fetched.
Sparse indexing method stores index records for only some search key values.
It needs less space, less maintenance overhead for insertion, and deletions but it is
slower compared to the dense index for locating records.
Ordered indices
The indices are usually sorted to make searching faster. The indices which are sorted are known
as ordered indices.
Example: Suppose we have an employee table with thousands of record and each of which is
10 bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with ID-
543.
o In the case of a database with no index, we have to search the disk block from starting
till it reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the record
after reading 542*2= 1084 bytes which are very less compared to the previous case.
Searching, inserting and deleting a record is done in the same way as B+tree. Since it is a
balance tree, it searches for the position of the records in the file, and then it fetches/inserts
/deletes the records. In case it finds that tree will be unbalanced because of
insert/delete/update, it does the proper re-arrangement of nodes so that definition of B+ tree
is not changed.
B+ Tree
o The B+ tree is a balanced binary search tree. It follows a multi-level index format.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes remain
at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support
random access as well as sequential access.
Structure of B+ Tree
o In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of the
order n where n is fixed for every B+ tree.
o It contains an internal node and leaf node.
Internal node
o An internal node of the B+ tree can contain at least n/2 record pointers except the root node.
o At most, an internal node of the tree contains n pointers.
Leaf node
o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.
So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the end,
we will be redirected to the third leaf node. Here DBMS will perform a sequential search to find
55.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node
after 55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert 60
there.
In this case, we have to split the leaf node, so that it can be inserted into tree without affecting
the fill factor, balance and order.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will split
the leaf node of the tree in the middle so that its balance is not altered. So we can group (50,
55) and (60, 65, 70) into 2 leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have
60 added to it, and then we can have pointers to a new leaf node.
This is how we can insert an entry when there is overflow. In a normal scenario, it is very easy
to find the node where it fits and then place it in that leaf node.
B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove 60
from the intermediate node as well as from the 4th leaf node too. If we remove it from the
intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to modify it
to have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as follows:
B+ Tree Extensions: As the number of records grows in the database, the intermediary and
leaf nodes needs to be split and spread widely to keep the balance of the tree. This is called as
B+ tree extensions. As it spreads out widely, the searching of records becomes faster.
The main goal of creating B+ tree is faster traversal of records. As the branches spreads out, it
requires less I/O on disk to get the record. Record that needs to be fetched are fetched in
logarithmic fraction of time. Suppose we have K search key values – that is the pointers in the
intermediary node for n nodes. Then we can fetch any record in the b+ tree in log (n/2) (K).
Suppose each node takes 40bytes to store an index and each disk block is of 40Kbytes. That
means we can have 100 nodes (n). Say we have 1million search key values – that means we
have 1 million intermediary pointers. Then we can access log 50 (1000000) = 4 nodes are
accessed in one go. Hence this costs only 4milliseconds to fetch any node in the tree. Now we
can guess the advantage of extending the B+ tree into more intermediary nodes. As
intermediary nodes spread out more and more, it is more efficient in fetching the records in B+
tree. Look at below two diagrams to understand how it makes difference with B+ tree
extensions.
SQL INDEX The Index in SQL is a special table used to speed up the searching of the data in the
database tables. It also retrieves a vast amount of data from the tables frequently. The INDEX
requires its own space in the hard disk.
The index concept in SQL is same as the index concept in the novel or a book. It is the best SQL
technique for improving the performance of queries. The drawback of using indexes is that they
slow down the execution time of UPDATE and INSERT statements.
But they have one advantage also as they speed up the execution time of SELECT and WHERE
statements.
Create an INDEX :In SQL, we can easily create the Index using the following
CREATE Statement:
Here, Index_Name is the name of that index that we want to create, and Table_Name is the
name of the table on which the index is to be created. The Column_Name represents the name
of the column on which index is to be applied.
Create UNIQUE INDEX: Unique Index is the same as the Primary key in SQL. The unique index
does not allow selecting those columns which contain duplicate values. This index is the best
way to maintain the data integrity of the SQL tables.
Rename an INDEX: We can easily rename the index of the table in the relational database using
the ALTER command.
Syntax:
Alter an INDEX: An index of the table can be easily modified in the relational database using the
ALTER command. The basic syntax for modifying the Index in SQL is as follows:
Search Complexity:
Ordered Indexing: Searching in ordered indexing is typically performed using binary search or
interpolation search, which has a logarithmic time complexity of O(log n), where n is the
number of indexed records.
Hashing: Hashing allows direct access to the desired record using a hash function. In ideal
cases, the search complexity is O(1), providing constant-time access. However, collisions can
occur, requiring additional steps to resolve them, which may increase the search complexity.
Ordered Indexing: Insertion and deletion operations in ordered indexing require maintaining
the sorted order of the index. Insertion may require shifting existing records, resulting in
additional time complexity, typically O(n). Deletion also requires reordering the index, making it
a costly operation.
Hashing: Insertion and deletion in hashing involve computing the hash value and placing the
record in the corresponding bucket. In general, the time complexity for these operations is
considered O(1). However, in the case of collisions, additional steps such as probing or chaining
may be required, affecting the overall complexity.
Range Queries:
Ordered Indexing: Ordered indexing excels in range queries. Since the data is sorted, it is easy
to find records within a specified range by traversing the index sequentially. Range queries have
a complexity of O(k + log n), where k is the number of records in the range.
Hashing: Hashing is not optimized for range queries since the records are not stored in a
specific order. To perform range queries, all buckets need to be scanned, resulting in a time
complexity of O(m), where m is the total number of buckets.
Space Efficiency:
Ordered Indexing: Ordered indexing typically requires additional storage space to store the
index structure. The size of the index is proportional to the number of indexed records,
resulting in higher space requirements.
Hashing: Hashing can be more space-efficient since it only requires space for the hash table
itself and the records. However, depending on the level of collisions, additional space might be
needed to handle chaining or probing.
Handling Updates:
Ordered Indexing: Ordered indexing handles updates well, as it requires updating the index
structure and maintaining the sorted order. However, frequent updates can result in overhead
due to the need for reordering the index.
Hashing: Hashing can handle updates efficiently, especially when collisions are minimal.
Updates only require accessing the appropriate bucket and modifying the record. However,
excessive collisions can impact performance and require additional steps to resolve.
Static Hashing:
In static hashing, when a search-key value is provided, the hash function always computes the same
address. For example, if mod-4 hash function is used, then it shall generate only 5 values. The output
address shall always be same for that function. The number of buckets provided remains unchanged at
all times.
Operation:
Insertion − When a record is required to be entered using static hash, the hash
function h computes the bucket address for search key K, where the record will be
stored.
Bucket address = h(K)
Search − When a record needs to be retrieved, the same hash function can be used to
retrieve the address of the bucket where the data is stored.
Delete − This is simply a search followed by a deletion operation.
1. Open Hashing:
When a hash function generates an address at which data is already stored, then the next
bucket will be allocated to it. This mechanism is called as Linear Probing.
For example: suppose R3 is a new address which needs to be inserted, the hash
function generates address as 112 for R3. But the generated address is already full. So
the system searches next available data bucket, 113 and assigns R3 to it.
2. Close Hashing
When buckets are full, then a new data bucket is allocated for the same hash result and
is linked after the previous one. This mechanism is known as Overflow chaining.
For example: Suppose R3 is a new address which needs to be inserted into the table,
the hash function generates address as 110 for it. But this bucket is full to store the new
data. In this case, a new bucket is inserted at the end of 110 buckets and is linked to it.
Dynamic Hashing
o The dynamic hashing method is used to overcome the problems of static hashing like
bucket overflow.
o In this method, data buckets grow or shrink as the records increases or decreases. This
method is also known as Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting
in poor performance.
o The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits
of 5 and 6 are 01, so it will go into bucket B1. The last two bits of 1 and 3 are 10,
so it will go into bucket B2. The last two bits of 7 are 11, so it will go into B3.
o