Inls 623 - Database Systems Ii - File Structures, Indexing, and Hashing

INLS 623– DATABASE SYSTEMS
II– FILE STRUCTURES,

INDEXING, AND HASHING
Instructor: Jason Carter

REVIEW
Databases
Logically Coherent Collection of related data
Database has tables and there are relationships between the

tables
Where are those tables physically stored?

MEMORY
 Primary Memory
Random Access Memory (RAM)
 Secondary Memory
 Disk (Hard Disk)
 Tape
 Solid
State Devices (SSD)
 DVD/Blue Ray
 How are those table stored in memory?

FILE STORAGE
 Which type of memory do we typically store files in and

why?
 Secondary Storage
 Secondary Storage is persistent and cheaper (than primary

storage)
 Primary memory is faster
 We chose persistence and money over speed

DISK STORAGE DEVICES (CONTD.)
DISK STORAGE DEVICES (CONTD.)
 A track is divided into smaller blocks or sectors

 because it usually contains a large amount of information
 The division of a track into sectors is hard-coded on the disk
surface and cannot be changed.
 One type of sector organization calls a portion of a track that
subtends a fixed angle at the center as a sector.
 A track is divided into blocks.
 The block size B is fixed for each system.
 Typical block sizes range from B=512 bytes to B=4096 bytes.
 Whole blocks are transferred between disk and main memory
for processing.
RECORDS
 Records = Rows in a table
 Fixed and variable length records
 Records contain fields (attributes) which have values of a
particular type
 E.g., amount, date, time, age
 Fields themselves may be fixed length or variable length
 Variable length fields can be mixed into one record:
 Separator characters or length fields are needed so that the
record can be “parsed.”
BLOCKING
 Blocking:
 Refers to storing a number of records in one block on the disk.
 Blocking factor (bfr) refers to the number of records per
block.
 remember block size is a constant for a device
 Spanned Records:
 Refers to records that exceed the size of one or more blocks and
hence span a number of blocks.
FILES OF RECORDS
 A file is a sequence of records, where each record is a collection

of data values (or data items).
 Think of a file as a table though one can have multiple tables in a
file
 A file descriptor (or file header) includes information that
describes the file, such as the field names and their data types,
and the addresses of the file blocks on disk.
 Records are stored on disk blocks.
 The blocking factor bfr for a file is the (average) number of file
records stored in a disk block.
 A file can have fixed-length records or variable-length records.
FILES OF RECORDS (CONTD.)
 File records can be unspanned or spanned
 Unspanned: no record can span two blocks
 Spanned: a record can be stored in more than one block
 The physical disk blocks that are allocated to hold the
records of a file can be contiguous, linked, or indexed.
 In a file of fixed-length records, all records have the same
format. Usually, unspanned blocking is used with such files.
 Files of variable-length records require additional
information to be stored in each record, such as separator
characters and field types.
 Usually spanned blocking is used with such files.
Unordered Files
• Also called a heap or a pile file.
• New records are inserted at the end of the file.
• Deletion can be to mark a record as invalid
– Later compaction can be done to recover space.
• A linear search through the file records is
necessary to search for a record since the files are
unordered
– This requires reading and searching half the file blocks
on the average, and is hence quite expensive.
• Record insertion is quite efficient.
• Reading the records in order of a particular field
requires sorting the file records after reading.
Ordered Files
• Also called a sequential file.
• File records are kept sorted by the values of an ordering field (eg. SSN)
• Insertion is expensive: records must be inserted in the correct order.
– It is common to keep a separate unordered overflow (or transaction)
file for new records to improve insertion efficiency; this is
periodically merged with the main ordered file.
• A binary search can be used to search for a record on its ordering field
value.
– This requires reading and searching log2 of the file blocks on the
average, an improvement over linear search.
• Reading the records in order of the ordering field is quite efficient.
HOW DOES A DATABASE MANIPULATE DATA ON DISK?
ITEMS TABLE
Field Data Type

item_id int
title varchar
long_text text
item_date datetime
deleted Enum(‘Y’,’N’)
category int
FINDING DATA
SELECT * FROM items WHERE category=4;
How does MYSQL know where to find and return the data
for this query?
1. Start at the beginning of the file

2. Read in enough to know where the category data
field starts
3. Read in the category value
4. Determine if it satisfies the where condition
5. If it does add that record to the return set
6. If it doesn’t figure out where the next record set is
and repeat
FINDING DATA (CONTINUED)
 Database will read the entire data file off disk

 It does not matter how many rows satisfy the where clause
 This is very inefficient!
Using a SQL command, how can we make this process more

efficient?
MAKING DATA FINDING MORE EFFICIENT
 Use the LIMIT Keyword

 SELECT * FROM items WHERE category=4 LIMIT 1;
When does this query stop reading from disk?

After the correct row is found.
If row is at end of table, we still waste time reading the disk.
Can we make reading data more efficient?

INDEX: MAKING DATA FINDING MORE EFFICIENT
 An index is a data structure that makes finding data faster

 Adds additional storage space and writes to disk to maintain the
index data structure
 Holds a field value, and pointer to the record it relates to

 Indexes are sorted
What is a data structure?

A way of organizing data in a computer so that it can be used efficiently
DATA STRUCTURES
 Array
 Hashtable/DictionaryAssociative Array
 Tuple
 Graphs
 Trees
 Object
ARRAY: DATA STRUCTURES
A collection of elements (values or

variables), each identified by at least one
array index or key
WHAT ARE INDEXES?
 An index on a file speeds up selections on the search key

fields for the index
 Any subset of fields from a table can be a search key.
 Search key is not necessarily the same as the table’s key (minimal
set of fields that uniquely identify a record in a relation).
 An index contains a collection of data entries, and supports
efficient retrieval of all data entries k* with a given key
value k.
 Given data entry k*, we can find record with key k in at most one
disk I/O.
INDEXING
 Have we ever used indexes before?
 When we set primary keys

ARRAYS FOR INDEXING
 Holds a field value, and pointer to the record it relates to

 Indexes are sorted
Can an array be used for indexing?

TWO TYPES OF INDEXES
 Clustered Index
 Unclustered/Non-clustered Index
CLUSTERED INDEX
 Determines the order in which rows of a table are stored on

disk
 The rows of a table is stored on disk in the same exact order as
the clustered index
 Only one index per table
 Default index in MySQL when you create a primary key
CLUSTERED EXAMPLE
 Owners
 Owner_ID (PK)
 name
 age
 Cars
 Car_ID(PK)
 Owner_ID ((PK)
 type
CLUSTERED EXAMPLE
Owners
OwnerID | name | age
1 J 42
2 K 35
Cars
CarID | OwnerID | type
1 1 Ford
2 1 Mustang
 We run a query that frequently gets an owner and his cars.

 What column(s) in the Cars table should be clustered?
CLUSTERED INDEX EXAMPLE
 Create a clustered index on the (carID, ownerID)column in

the Cars table
 A given ownerID would have all his/her car entries stored
right next to each other on disk
 If the query to frequently get an owner and all his/her car
this runs extremely fast
Is there a disadvantage to using clustered indexes?
If we update one of the values of a clustered index, the database has to resort the rows
- This involves deleting and inserting, which is a performance hit!
Typically, clustered indexes are on PK and FK cause those values aren’t updated
much
UNCLUSTERED/NONCLUSTERED INDEX
 The index is stored separately from the table data

 Store the value of the column indexed and a pointer to the
row the data is stored
 Can have multiple unclustered indexes
 Called secondary indexes in MySQL
 Unclustered indexes are faster when updating

HASH-BASED INDEXING
 Place all records with a common attribute together.

 Index is a collection of buckets.
 Bucket = primary page plus zero or more overflow pages
 Buckets contain data entries.
 Hashing function h(r): Mapping from the index’s search key

to a bucket in which the (data entry for) record r belongs.
HASHING INDEX EXAMPLE
B TREES FOR INDEXING
 A tree data structure that keeps data sorted and allows

searches, sequential access, insertions, and deletions in
logarithmic time
 O(log N) basically means time goes up linearly while the n goes
up exponentially. So if it takes 1 second to compute 10 elements,
it will take 2 seconds to compute 100 elements, 3 seconds to
compute 1000 elements, and so on.
B TREE AND INDEXING EXAMPLE
Index for item_id
4 sorted values representing The child nodes have the same range values
the range of item_ids
last level nodes containing the final item_id

value and pointer to the byte in the disk file the
record lies
Looking for item_id 4
Is this really more efficient?

 We needed to do 3 hops to get to item id 4.

 We had to look at the entire index for item_id
Looking for item_id 20

 We needed to do 3 hops to get to item id 20.

 # of hops required increases in a sort-of logarithmic manner
with respect to database size
 Opposite to exponential growth
 Logarithmic shoots up in the beginning, but slows
 Exponential grows slowly at the beginning, but shoots up

rapidly
AN EXAMPLE OF AN INSERTION IN A B-TREE
INDEXING: GENERAL RULES OF THUMB
 Index fields in the WHERE CLAUSE of a SELECT Query

 User Table
 ID(INT) PK
 Email_address
 During login, MySQL must locate the correct ID by

searching for an email
 Without an index, every record in sequence is checked until
the email address is found
Should we add an index to every field?
 No, because indexes are regenerated during every table INSERT OR

UPDATE
 Hurts performance
 Only add indexes when necessary

 Indexes should not be used on small tables.
 Tables that have frequent, large batch update or insert

operations.
 Indexes should not be used on columns that contain a high

number of NULL values.
 Columns that are frequently manipulated should not be

indexed.

Inls 623 - Database Systems Ii - File Structures, Indexing, and Hashing

Uploaded by

Copyright:

Available Formats

Inls 623 - Database Systems Ii - File Structures, Indexing, and Hashing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inls 623 - Database Systems Ii - File Structures, Indexing, and Hashing

Uploaded by

Copyright:

Available Formats

INLS 623– DATABASE SYSTEMS

II– FILE STRUCTURES,

Instructor: Jason Carter

Database has tables and there are relationships between the

Where are those tables physically stored?

 How are those table stored in memory?

 Which type of memory do we typically store files in and

 Secondary Storage is persistent and cheaper (than primary

 Primary memory is faster

 We chose persistence and money over speed

 A track is divided into smaller blocks or sectors

 A file is a sequence of records, where each record is a collection

Field Data Type

1. Start at the beginning of the file

 Database will read the entire data file off disk

Using a SQL command, how can we make this process more

 Use the LIMIT Keyword

When does this query stop reading from disk?

If row is at end of table, we still waste time reading the disk.

Can we make reading data more efficient?

 An index is a data structure that makes finding data faster

 Holds a field value, and pointer to the record it relates to

What is a data structure?

A collection of elements (values or

 An index on a file speeds up selections on the search key

 Have we ever used indexes before?

 When we set primary keys

 Holds a field value, and pointer to the record it relates to

Can an array be used for indexing?

 Determines the order in which rows of a table are stored on

 We run a query that frequently gets an owner and his cars.

 Create a clustered index on the (carID, ownerID)column in

 The index is stored separately from the table data

 Called secondary indexes in MySQL

 Unclustered indexes are faster when updating

 Place all records with a common attribute together.

 Hashing function h(r): Mapping from the index’s search key

 A tree data structure that keeps data sorted and allows

last level nodes containing the final item_id

Is this really more efficient?

 We needed to do 3 hops to get to item id 4.

Looking for item_id 20

 We needed to do 3 hops to get to item id 20.

 Logarithmic shoots up in the beginning, but slows

 Exponential grows slowly at the beginning, but shoots up

 Index fields in the WHERE CLAUSE of a SELECT Query

 During login, MySQL must locate the correct ID by

Should we add an index to every field?

 No, because indexes are regenerated during every table INSERT OR

 Only add indexes when necessary

 Tables that have frequent, large batch update or insert

 Indexes should not be used on columns that contain a high

 Columns that are frequently manipulated should not be

You might also like