Data Structure Unit 3 Notes
Data Structure Unit 3 Notes
Queue
Queue is a linear data structure that follows FIFO (First In First Out) Principle, so the
first element inserted is the first to be popped out.
FIFO Principle in Queue:
FIFO Principle states that the first element added to the Queue will be the first one to be
removed or processed. So, Queue is like a line of people waiting to purchase tickets, where
the first person in line is the first person served. (i.e. First Come First Serve).
Basic Terminologies of Queue
Front: Position of the entry in a queue ready to be served, that is, the first entry that
will be removed from the queue, is called the front of the queue. It is also referred as
the head of the queue.
Rear: Position of the last entry in the queue, that is, the one most recently added, is
called the rear of the queue. It is also referred as the tail of the queue.
Size: Size refers to the current number of elements in the queue.
Capacity: Capacity refers to the maximum number of elements the queue can hold.
The four types of Queue are: Simple Queue, Double-ended queue, Circular Queue
and Priority Queue.
In this representation the Queue is implemented using the array. Variables used in this case
are
In this representation the queue is implemented using the dynamic data structure Linked List.
Using linked list for creating a queue makes it flexible in terms of size and storage. You don’t
have to define the maximum number of elements in the queue.
Pointers (links) to store addresses of nodes for defining a queue are.
FRONT- address of the first element of the Linked list storing the Queue.
REAR- address of the last element of the Linked list storing the Queue.
Circular Queue
There was one limitation in the array implementation of Queue. If the rear reaches to the end
position of the Queue then there might be possibility that some vacant spaces are left in the
beginning which cannot be utilized. So, to overcome such limitations, the concept of the
As we can see in the above image, the rear is at the last position of the Queue and front is
pointing somewhere rather than the 0 th position. In the above array, there are only two
elements and other three positions are empty. The rear is at the last position of the Queue; if
we try to insert the element then it will show that there are no empty spaces in the Queue.
There is one solution to avoid such wastage of memory space by shifting both the elements at
the left and adjust the front and rear end accordingly. It is not a practically good approach
because shifting all the elements will consume lots of time. The efficient approach to avoid
the wastage of the memory is to use the circular queue data structure.
Deque or Double Ended Queue is a type of queue in which insertion and removal of elements
can either be performed from the front or the rear. Thus, it does not follow FIFO rule (First In
First Out).
Types of Deque
Operations on a Deque
Below is the circular array implementation of deque. In a circular array, if the array is full, we
start from the beginning.
But in a linear array implementation, if the array is full, no more elements can be inserted. In
each of the operations below, if the array is full, "overflow message" is thrown.
Before performing the following operations, these steps are followed.
1. Take an array (deque) of size n.
2. Set two pointers front = -1 and rear = 0.
Initialize an array and pointers for deque
3. If the deque is empty (i.e. front = -1), deletion cannot be performed (underflow condition).
4. If the deque has only one element (i.e. front = rear), set front = -1 and rear = -1.
5. Else if front is at the last index (i.e. front = n - 1), set front = 0.
3. If the deque is empty (i.e. front = -1), deletion cannot be performed (underflow condition).
4. If the deque has only one element (i.e. front = rear), set front = -1 and rear = -1, else follow
the steps below.
5. If rear is at the first index (i.e. rear = 0), reinitialize rear = n - 1.
6. Else, rear = rear - 1.
7. Decrease the rear
5. Check Empty
This operation checks if the deque is empty. If front = -1, the deque is empty.
6. Check Full
This operation checks if the deque is full. If front = 0 and rear = n - 1 OR front = rear + 1,
the deque is full.
A priority queue is a type of queue that arranges elements based on their priority values.
Each element has a priority associated. When we add an item, it is inserted in a position
based on its priority.
Elements with higher priority are typically retrieved or removed before elements with
lower priority.
Ascending Order Priority Queue : In this queue, elements with lower values have
higher priority. For example, with elements 4, 6, 8, 9, and 10, 4 will be dequeued first
since it has the smallest value, and the dequeue operation will return 4.
Descending order Priority Queue : Elements with higher values have higher priority.
The root of the heap is the highest element, and it is dequeued first. The queue adjusts
by maintaining the heap property after each insertion or deletion.
1) Insertion : If the newly inserted item is of the highest priority, then it is inserted at the
top. Otherwise, it is inserted in such a way that it is accessible after all higher priority items
are accessed.
2) Deletion : We typically remove the highest priority item which is typically available at
the top. Once we remove this item, we need not move next priority item at the top.
3) Peek : This operation only returns the highest priority item (which is typically available
at the top) and does not make any change to the priority queue.
Sorting in data structures is the process of arranging data elements in a specific order
(ascending or descending) based on a particular criterion, crucial for efficient searching,
organization, and analysis.
Searching in data structures refers to finding a specific element within a collection of data,
and common algorithms include linear search, binary search, interpolation search, and
algorithms using data structures like binary search trees and hash tables.
2. Binary Search:
Concept: Efficiently searches sorted arrays or lists by repeatedly dividing the search interval
in half.
Data Structures: Works best on sorted arrays or lists.
Time Complexity: O(log n).
3. Interpolation Search:
Concept: An improvement over binary search, it estimates the position of the target element
based on its value and the values of the first and last elements in the search interval.
Data Structures: Works best on uniformly distributed data.
Time Complexity: O(log log n) in the best case, O(n) in the worst case.
4. Data Structures for Searching:
Binary Search Trees (BSTs):
Concept: A tree-based data structure where each node has a value, and the values in the left
subtree are less than or equal to the node's value, while the values in the right subtree are
greater than the node's value.
Searching: BSTs allow for efficient searching by traversing the tree based on the values of the
nodes.
Hash Tables:
Concept: A data structure that uses a hash function to map keys to values, allowing for fast
lookups.
Searching: Hash tables provide near-constant time (O(1)) for searching, insertion, and deletion
operations, on average.
Ternary Search Trees:
Concept: A variation of a binary tree that can store strings and perform prefix searches
efficiently.
Searching: Ternary search trees allow for fast searching and retrieval of strings.
Linked Lists:
Concept: A linear data structure where elements are stored in nodes, and each node contains a
pointer to the next node.
Searching: Searching in linked lists involves traversing the list sequentially, which can be slow
for large lists.
Fibonacci Search:
Concept: Uses the Fibonacci sequence to divide the array into sections and searches for the
target element.
Data Structures: Primarily used when the data structure prohibits direct access to elements,
such as in distributed data systems.
Jump Search:
Concept: A search algorithm that skips a fixed number of elements (block size) at each step
instead of checking elements one by one like Linear Search.
Data Structures: Designed for sorted arrays.
Exponential Search:
Concept: Combines binary search with a preliminary phase that helps find the range where the
target element lies.
Data Structures: Useful when the array is unbounded or when the size of the array is
unknown.
Selection sort
The Selection Sort algorithm finds the lowest value in an array and moves it to the front of
the array.
The algorithm looks through the array again and again, moving the next lowest values to the
front, until the array is sorted.
How it works:
Continue reading to fully understand the Selection Sort algorithm and how to implement it
yourself.
Before we implement the Selection Sort algorithm in a programming language, let's manually
run through a short array only one time, just to get the idea.
Step 2: Go through the array, one value at a time. Which value is the lowest? 3, right?
[ 7, 12, 9, 11, 3]
[ 3, 7, 12, 9, 11]
Step 4: Look through the rest of the values, starting with 7. 7 is the lowest value, and already
at the front of the array, so we don't need to move it.
[ 3, 7, 12, 9, 11]
Step 5: Look through the rest of the array: 12, 9 and 11. 9 is the lowest value.
[ 3, 7, 12, 9, 11]
[ 3, 7, 9, 12, 11]
[ 3, 7, 9, 12, 11]
[ 3, 7, 9, 11, 12]
#include <iostream>
using namespace std;
// driver code
int main() {
int data[] = {20, 12, 10, 15, 2};
int size = sizeof(data) / sizeof(data[0]);
selectionSort(data, size);
cout << "Sorted array in Acsending Order:\n";
printArray(data, size);
}
Insertion sort
The Insertion Sort algorithm uses one part of the array to hold the sorted
values, and the other part of the array to hold values that are not sorted yet.
The algorithm takes one value at a time from the unsorted part of the array
and puts it into the right place in the sorted part of the array, until the array
is sorted.
How it works:
1. Take the first value from the unsorted part of the array.
2. Move the value into the correct place in the sorted part of the array.
3. Go through the unsorted part of the array again as many times as there
are values.
Continue reading to fully understand the Insertion Sort algorithm and how to
implement it yourself.
[ 7, 12, 9, 11, 3]
Step 2: We can consider the first value as the initial sorted part of the array.
If it is just one value, it must be sorted, right?
[ 7, 12, 9, 11, 3]
Step 3: The next value 12 should now be moved into the correct position in
the sorted part of the array. But 12 is higher than 7, so it is already in the
correct position.
[ 7, 12, 9, 11, 3]
[ 7, 12, 9, 11, 3]
Step 5: The value 9 must now be moved into the correct position inside the
sorted part of the array, so we move 9 in between 7 and 12.
[ 7, 9, 12, 11, 3]
[ 7, 9, 11, 12, 3]
[ 7, 9, 11, 12, 3]
Step 9: We insert 3 in front of all other values because it is the lowest value.
#include <iostream>
using namespace std;
// Compare key with each element on the left of it until an element smaller than
// it is found.
// For descending order, change key<array[j] to key>array[j].
while (j >=0 && key < array[j])
{
array[j + 1] = array[j];
--j;
}
array[j + 1] = key;
}
}
// Driver code
int main() {
int data[] = {9, 5, 1, 4, 3};
int size = sizeof(data) / sizeof(data[0]);
insertionSort(data, size);
cout << "Sorted array in ascending order:\n";
printArray(data, size);
}
Merge sort
Divide: The algorithm starts with breaking up the array into smaller and
smaller pieces until one such sub-array only consists of one element.
Conquer: The algorithm merges the small pieces of the array back together
by putting the lowest values first, resulting in a sorted array.
The breaking down and building up of the array to sort the array is done
recursively.
How it works:
1. Divide the unsorted array into two sub-arrays, half the size of the original.
2. Continue to divide the sub-arrays as long as the current piece of the array
has more than one element.
3. Merge two sub-arrays together by always putting the lowest value first.
4. Keep merging until there are no sub-arrays left.
Take a look at the drawing below to see how Merge Sort works from a
different perspective. As you can see, the array is split into smaller and
smaller pieces until it is merged back together. And as the merging happens,
values from each sub-array are compared so that the lowest value comes
first.
Manual Run Through
Let's try to do the sorting manually, just to get an even better understanding
of how Merge Sort works before actually implementing it in a programming
language.
Step 1: We start with an unsorted array, and we know that it splits in half
until the sub-arrays only consist of one element. The Merge Sort function
calls itself two times, once for each half of the array. That means that the
first sub-array will split into the smallest pieces first.
[ 12, 8, 9, 3, 11, 5, 4]
[ 12, 8, 9] [ 3, 11, 5, 4]
[ 12] [ 8, 9] [ 3, 11, 5, 4]
[ 12] [ 8] [ 9] [ 3, 11, 5, 4]
Step 2: The splitting of the first sub-array is finished, and now it is time to
merge. 8 and 9 are the first two elements to be merged. 8 is the lowest
value, so that comes before 9 in the first merged sub-array.
[ 12] [ 8, 9] [ 3, 11, 5, 4]
[ 8, 9, 12] [ 3, 11, 5, 4]
[ 8, 9, 12] [ 3, 11] [ 5, 4]
[ 8, 9, 12] [ 3] [ 11] [ 5, 4]
Step 5: 3 and 11 are merged back together in the same order as they are
shown because 3 is lower than 11.
[ 8, 9, 12] [ 3, 11] [ 5, 4]
Step 6: Sub-array with values 5 and 4 is split, then merged so that 4 comes
before 5.
[ 8, 9, 12] [ 3, 11] [ 5] [ 4]
[ 8, 9, 12] [ 3, 11] [ 4, 5]
Step 7: The two sub-arrays on the right are merged. Comparisons are done
to create elements in the new merged array:
1. 3 is lower than 4
2. 4 is lower than 11
3. 5 is lower than 11
4. 11 is the last remaining value
[ 8, 9, 12] [ 3, 4, 5, 11]
Step 8: The two last remaining sub-arrays are merged. Let's look at how the
comparisons are done in more detail to create the new merged and finished
sorted array:
3 is lower than 8:
// Divide the array into two subarrays, sort them and merge them
void mergeSort(int arr[], int l, int r) {
if (l < r) {
// m is the point where the array is divided into two subarrays
int m = l + (r - l) / 2;
mergeSort(arr, l, m);
mergeSort(arr, m + 1, r);
// Merge the sorted subarrays
merge(arr, l, m, r);
}
}
// Driver program
int main() {
int arr[] = {6, 5, 12, 10, 9, 1};
int size = sizeof(arr) / sizeof(arr[0]);
mergeSort(arr, 0, size - 1);
cout << "Sorted array: \n";
printArray(arr, size);
return 0;
}
Efficiency of sorting methods
Sorting algorithm efficiency is typically judged by their time complexity, with algorithms
like Merge Sort, Quick Sort, and Heap Sort generally considered efficient, offering O(n log
n) average time complexity, while simpler algorithms like Bubble Sort and Selection Sort
have O(n^2) time complexity.
Big O notations
Big O notation is a mathematical notation used in computer science to describe the limiting
behavior of a function when the argument tends towards a particular value or infinity,
specifically used to analyze the efficiency of algorithms by focusing on their time and space
complexity as the input size grows.
What it is:
Mathematical Notation:
Big O notation, also known as asymptotic notation, provides a way to classify algorithms
based on how their runtime or space requirements grow with increasing input size.
Focus on Growth Rate:
It doesn't concern itself with exact execution time or memory usage, but rather the rate at
which these resources scale.
Upper Bound:
Big O notation typically represents the worst-case scenario, providing an upper bound on
the algorithm's performance.
Ignoring Constants and Lower-Order Terms:
It focuses on the dominant term in the complexity expression, discarding constants and
lower-order terms.
Why it's important:
Algorithm Comparison:
Big O notation allows developers to compare the efficiency of different algorithms and
choose the most optimal one for a given task.
Performance Optimization:
By understanding the time and space complexity of algorithms, developers can identify
potential bottlenecks and optimize code for better performance.
Predicting Scalability:
It helps predict how an algorithm's performance will degrade as the input size increases.
Common Big O Notations:
O(1) (Constant Time): The algorithm's runtime/space remains constant regardless of the
input size.
O(log n) (Logarithmic Time): The runtime/space grows proportionally to the logarithm of
the input size.
O(n) (Linear Time): The runtime/space grows proportionally to the input size.
O(n log n): The runtime/space grows proportionally to the product of the input size and its
logarithm.
O(n^2) (Quadratic Time): The runtime/space grows proportionally to the square of the
input size.
O(2^n) (Exponential Time): The runtime/space grows exponentially with the input size.
Examples:
Linear Search: Finding an element in an unsorted array (O(n) - worst case).
Binary Search: Finding an element in a sorted array (O(log n) - worst case).
Bubble Sort: Sorting an array (O(n^2) - worst case).
Merge Sort: Sorting an array (O(n log n) - worst case).
Hash tables
The reason Hash Tables are sometimes preferred instead of arrays or linked lists is because
searching for, adding, and deleting data can be done really quickly, even for large amounts of
data.
In a Linked List, finding a person "Bob" takes time because we would have to go from one
node to the next, checking each node, until the node with "Bob" is found.
And finding "Bob" in an Array could be fast if we knew the index, but when we only know
the name "Bob", we need to compare each element (like with Linked Lists), and that takes
time.
With a Hash Table however, finding "Bob" is done really fast because there is a way to go
directly to where "Bob" is stored, using something called a hash function.
To get the idea of what a Hash Table is, let's try to build one from scratch, to store unique
first names inside it.
To find "Bob" in this array, we need to compare each name, element by element, until we
find "Bob".
If the array was sorted alphabetically, we could use Binary Search to find a name quickly, but
inserting or deleting names in the array would mean a big operation of shifting elements in
memory.
To make interacting with the list of names really fast, let's use a Hash Table for this instead,
or a Hash Set, which is a simplified version of a Hash Table.
To keep it simple, let's assume there is at most 10 names in the list, so the array must be a
fixed size of 10 elements. When talking about Hash Tables, each of these elements is called
a bucket.
my_hash_set = [None,None,None,None,None,None,None,None,None,None]
Now comes the special way we interact with the Hash Set we are making.
We want to store a name directly into its right place in the array, and this is where the hash
function comes in.
A hash function can be made in many ways, it is up to the creator of the Hash Table. A
common way is to find a way to convert the value into a number that equals one of the Hash
Set's index numbers, in this case a number from 0 to 9. In our example we will use the
Unicode number of each character, summarize them and do a modulo 10 operation to get
index numbers 0-9.
Example
def hash_function(value):
sum_of_chars = 0
sum_of_chars += ord(char)
return sum_of_chars % 10
print("'Bob' has hash code:",hash_function('Bob'))
The character "B" has Unicode code point 66, "o" has 111, and "b" has 98. Adding those
together we get 275. Modulo 10 of 275 is 5, so "Bob" should be stored as an array element at
index 5.
The number returned by the hash function is called the hash code.
Unicode number: Everything in our computers are stored as numbers, and the Unicode code
point is a unique number that exist for every character. For example, the character A has
Unicode number (also called Unicode code point) 65. Just try it in the simulation below.
See this page for more information about how characters are represented as numbers.
After storing "Bob" where the hash code tells us (index 5), our array now looks like this:
my_hash_set = [None,None,None,None,None,'Bob',None,None,None,None]
We can use the hash function to find out where to store the other names "Pete", "Jones",
"Lisa", and "Siri" as well.
After using the hash function to store those names in the correct position, our array looks like
this:
my_hash_set = [None,'Jones',None,'Lisa',None,'Bob',None,'Siri','Pete',None]
We have now established a super basic Hash Set, because we do not have to check the array
element by element anymore to find out if "Pete" is in there, we can just use the hash function
to go straight to the right element!
To find out if "Pete" is stored in the array, we give the name "Pete" to our hash function, we
get back hash code 8, we go directly to the element at index 8, and there he is. We found
"Pete" without checking any other elements.
Example
my_hash_set = [None,'Jones',None,'Lisa',None,'Bob',None,'Siri','Pete',None]
def hash_function(value):
sum_of_chars = 0
sum_of_chars += ord(char)
return sum_of_chars % 10
def contains(name):
index = hash_function(name)
When deleting a name from our Hash Set, we can also use the hash function to go straight to
where the name is, and set that element value to None.
We give "Stuart" to our hash function, and we get the hash code 3, meaning "Stuart" should
be stored at index 3.
Trying to store "Stuart" creates what is called a collision, because "Lisa" is already stored at
index 3.
To fix the collision, we can make room for more elements in the same bucket, and solving the
collision problem in this way is called chaining. We can give room for more elements in the
same bucket by implementing each bucket as a linked list, or as an array.
After implementing each bucket as an array, to give room for potentially more than one name
in each bucket, "Stuart" can also be stored at index 3, and our Hash Set now looks like this:
my_hash_set = [
[None],
['Jones'],
[None],
['Lisa', 'Stuart'],
[None],
['Bob'],
[None],
['Siri'],
['Pete'],
[None]
]
Searching for "Stuart" in our Hash Set now means that using the hash function we end up
directly in bucket 3, but then be must first check "Lisa" in that bucket, before we find "Stuart"
as the second element in bucket 3.
To complete our very basic Hash Set code, let's have functions for adding and searching for
names in the Hash Set, which is now a two dimensional array.
Run the code example below, and try it with different values to get a better understanding of
how a Hash Set works.
Example
my_hash_set = [
[None],
['Jones'],
[None],
['Lisa'],
[None],
['Bob'],
[None],
['Siri'],
['Pete'],
[None]
def hash_function(value):
def add(value):
index = hash_function(value)
bucket = my_hash_set[index]
bucket.append(value)
def contains(value):
index = hash_function(value)
bucket = my_hash_set[index]
return value in bucket
add('Stuart')
print(my_hash_set)
print('Contains Stuart:',contains('Stuart'))
The next two pages show better and more detailed implementations of Hast Sets and Hash
Tables.
Try the Hash Set simulation below to get a better ide of how a Hash Set works in principle.
Hash Set
0:
Thomas
Jens
1:
2:
Peter
3:
Lisa
4:
Charlotte
5:
Adele
Bob
6:
7:
8:
Michaela
9:
Hash Code
275 % 10 = 5
The most important reason why Hash Tables are great for these things is that Hash Tables are
very fast compared Arrays and Linked Lists, especially for large sets. Arrays and Linked
Lists have time complexity O(n)O(n) for search and delete, while Hash Tables have
just O(1)O(1) on average
Hash Set vs. Hash Map
A Hash Table can be a Hash Set or a Hash Map. The next two pages describe these data
structures in more detail.
Here's how Hash Sets and Hash Maps are different and similar:
Every Hash Table element has a part that is unique that is called the key.
A collision happens when two Hash Table elements have the same hash code, because that
means they belong to the same bucket. A collision can be solved in two ways.
Chaining is the way collisions are solved in this tutorial, by using arrays or linked lists to
allow more than one element in the same bucket.
Open Addressing is another way to solve collisions. With open addressing, if we want to
store an element but there is already an element in that bucket, the element is stored in the
next available bucket. This can be done in many different ways, but we will not explain open
addressing any further here.
Hashing techniques
There are numerous hashing algorithms, each with distinct advantages and disadvantages.
The most popular algorithms include the following:
o MD5: A widely used hashing algorithm that produces a 128-bit hash value.
o SHA-1: A popular hashing algorithm that produces a 160-bit hash value.
o SHA-256: A more secure hashing algorithm that produces a 256-bit hash value.
What is MD5?
MD5 (message-digest algorithm) is a cryptographic protocol used for authenticating
messages as well as content verification and digital signatures. MD5 is based on a hash
function that verifies that a file you sent matches the file received by the person you sent it to.
Previously, MD5 was used for data encryption, but now it’s used primarily for authentication.
How does MD5 work?
MD5 runs entire files through a mathematical hashing algorithm to generate a signature that
can be matched with an original file. That way, a received file can be authenticated as
matching the original file that was sent, ensuring that the right files get where they need to
go.
The MD5 hashing algorithm converts data into a string of 32 characters. For example, the
word “frog” always generates this hash: 938c2cc0dcc05f2b68c4287040cfcf71. Similarly, a
file of 1.2 GB also generates a hash with the same number of characters. When you send that
file to someone, their computer authenticates its hash to ensure it matches the one you sent.
If you change just one bit in a file, no matter how large the file is, the hash output will be
completely and irreversibly changed. Nothing less than an exact copy will pass the MD5 test.
What is MD5 used for?
MD5 is primarily used to authenticate files. It’s much easier to use the MD5 hash to check a
copy of a file against an original than to check bit by bit to see if the two copies match.
MD5 was once used for data security and encryption, but these days its primary use is
authentication. Because a hacker can create a file that has the exact same hash as an entirely
different file, MD5 is not secure in the event that someone tampers with a file. But if you’re
simply copying a file from one place to another, MD5 will do the job.
Since MD5 is no longer used for encryption purposes, And if you want to encrypt your entire
internet connection, try Avast SecureLine VPN. Unlike MD5, a VPN encrypts all the data
moving in and out of your computer, making it completely invisible to hackers, ISPs,
governments, or anyone else. And with Avast, you’ll enjoy lightning-fast connection speeds.
How is an MD5 hash calculated?
The MD5 hashing algorithm uses a complex mathematical formula to create a hash. It
converts data into blocks of specific sizes and manipulates that data a number of times. While
this is happening, the algorithm adds a unique value into the calculation and converts the
result into a small signature or hash.
MD5 algorithm steps are incredibly complex for a reason — you cannot reverse this process
and generate the original file from the hash. But the same input will always produce the same
output, also known as the MD5 sum, hash, or the checksum. That’s what makes them so
useful for data validation.
An MD5 hash example looks like this: 0cc175b9c0f1b6a831c399e269772661. That’s the
hash for the letter “a.”
SHA-1 (Secure Hash Algorithm 1) is a cryptographic hash function that produces a 160-bit
hash value (message digest) from an input message of any size, designed by the NSA and
published by NIST. It is a one-way function, meaning it's computationally infeasible to derive
the original message from its hash value.
Purpose:
SHA-1 is used for verifying data integrity and authenticity, ensuring that data hasn't been
tampered with.
How it works:
Input: SHA-1 takes an input message (data) of any length, up to 2^64 bits.
Padding: The input message is padded to make its length a multiple of 512 bits.
Processing: The padded message is processed in 80 rounds using a series of logical functions
and bit operations.
Output: The final hash value is a 160-bit (20-byte) hash digest.
Weaknesses:
While SHA-1 was once considered secure, it has been found to have vulnerabilities,
including collision attacks, meaning that different inputs can produce the same hash value.
Alternatives:
Due to these weaknesses, SHA-1 is no longer considered secure for most cryptographic
applications, and newer algorithms like SHA-256 and SHA-3 are recommended.
Applications:
SHA-1 was widely used in security protocols like TLS, SSL, PGP, SSH, IPsec, and
S/MIME, but its use is now declining.
SHA-256 is a cryptographic hash function that produces a 256-bit (32-byte) hash value from
any input data, widely used for data integrity verification, digital signatures, and blockchain
technology.
Key Features:
Cryptographic Hash Function:
SHA-256, part of the SHA-2 family, is a cryptographic hash function, meaning it takes an
input (data of any length) and produces a fixed-size output (256-bit hash).
One-Way Function:
It's designed to be computationally infeasible to reverse the process, meaning you can't
determine the original input from the hash value.
Fixed-Size Output:
Regardless of the input size, the SHA-256 algorithm always produces a 256-bit hash value.
Security:
SHA-256 is considered a secure hash function, resistant to collision attacks (where two
different inputs produce the same hash).
Applications:
Data Integrity: Verifies that data hasn't been tampered with or corrupted.
Digital Signatures: Ensures the authenticity and integrity of digital documents.
Blockchain Technology: Used in cryptocurrencies like Bitcoin for securing transactions and
maintaining the integrity of the blockchain.
SSL/TLS Certificates: Used to secure web communications by verifying the integrity of
certificates.
Password Hashing: While not recommended for direct password storage due to its speed,
SHA-256 is sometimes used in combination with other techniques like salting and key
stretching.
How it works:
SHA-256 takes an input message and processes it through a series of mathematical
operations, including bitwise operations, addition, and shifting.
These operations are performed in rounds, with the output of each round feeding into the
next.
The final output is a 256-bit hash value, which is a unique "fingerprint" of the input data.
Why is it important?
Data Integrity:
By comparing the hash of a file or message before and after transmission or storage, you
can verify that it hasn't been altered.
Security:
SHA-256 helps protect sensitive data by ensuring its authenticity and integrity, making it
difficult for attackers to tamper with or forge data.
Blockchain Technology:
SHA-256 is a cornerstone of blockchain technology, ensuring the security and immutability
of transactions.
Collision resolution techniques
When two items hash to the same slot, we must have a systematic method for placing the
second item in the hash table. This process is called collision resolution. As we stated earlier,
if the hash function is perfect, collisions will never occur.
When one or more hash values compete with a single hash table slot, collisions occur. To
resolve this, the next available empty slot is assigned to the current hash value. The most
common methods are open addressing, chaining, probabilistic hashing, perfect hashing and
coalesced hashing techniques.
a) Chaining:
This technique implements a linked list and is the most popular out of all the collision
resolution techniques. Below is an example of a chaining process.
Since one slot here has 3 elements – {50, 85, 92}, a linked list is assigned to include the other
2 items {85, 92}. When you use the chaining technique, the insertion or deletion of items
with the hash table is fairly simple and high-performing. Likewise, a chain hash table
inherits the pros and cons of a linked list. Alternatively, chaining can use dynamic
arrays instead of linked lists.
b) Open Addressing:
This technique depends on space usage and can be done with linear or quadratic probing
techniques. As the name says, this technique tries to find an available slot to store the record.
It can be done in one of the 3 ways –
Linear probing – Here, the next probe interval is fixed to 1. It supports the best caching but
miserably fails at clustering.
c) Probabilistic hashing:
This is memory-based hashing that implements caching. When a collision occurs, either the
old record is replaced by the new or the new record may be dropped. Although this scenario
has a risk of losing data, it is still preferred due to its ease of implementation and high
performance.
d) Perfect hashing:
When the slots are uniquely mapped, the chances of collision are minimal. However, it can
be done where there is a lot of spare memory.
e) Coalesced hashing:
This technique is a combo of open address and chaining methods. A chain of items is are
stored in the table when there is a collision. The next available table space is used to store the
items to prevent collision.