Mini Project HPC
Mini Project HPC
Mini Project HPC
Project : 1
Mini Project: Evaluate performance enhancement of parallel Quicksort Algorithm using MPI
Objective: Sort a large dataset of numbers using parallel Quicksort Algorithm with MPI and
compare its performance with the serial version of the algorithm.
Approach: We will use Python and MPI to implement the parallel version of Quicksort Algorithm
and compare its performance with the serial version of the algorithm.
Requirements:
Python 3.x
mpi4py
Theory :
Similar to mergesort, QuickSort uses a divide-and-conquer strategy and is one of the fastest sorting
algorithms; it can be implemented in a recursive or iterative fashion. The divide and conquer is a
general algorithm design paradigm and key steps of this strategy can be summarized as follows:
• Divide: Divide the input data set S into disjoint subsets S1, S2, S3…Sk.
• Recursion: Solve the sub-problems associated with S1, S2, S3…Sk.
• Conquer: Combine the solutions for S1, S2, S3…Sk. into a solution for S.
• Base case: The base case for the recursion is generally subproblems of size 0 or 1.
Many studies [2] have revealed that in order to sort N items; it will take QuickSort an average
running time of O(NlogN). The worst-case running time for QuickSort will occur when the pivot is a
unique minimum or maximum element, and as stated in [2], the worst-case running time for
QuickSort on N items is O(N2). These different running times can be influenced by the input
distribution (uniform, sorted or semi-sorted, unsorted, duplicates) and the choice of the pivot element.
Here is a simple pseudocode of the QuickSort algorithm adapted from Wikipedia [1].
We have made use of Open MPI as the backbone library for parallelizing the QuickSort algorithm. In
fact, learning message passing interface (MPI) allows us to strengthen our fundamental knowledge on
parallel programming, given that MPI is lower level than equivalent libraries (OpenMP). As simple as
its name means, the basic idea behind MPI is that messages can be passed or exchanged among
different processes in order to perform a given task. An illustration can be a communication and
coordination by a master process which splits a huge task into chunks and shares them to its slave
processes. Open MPI is developed and maintained by a consortium of academic, research and
industry partners; it combines the expertise, technologies and resources all across the high
performance computing community [11]. As elaborated in [4], MPI has two types of communication
routines: point-to-point communication routines and collective communication routines. Collective
routines as explained in the implementation section have been used in this study.
Algorithm :
In general, the overall algorithm used here to perform QuickSort with MPI works as followed:
iii. Divide the input size SIZE by the number of participating processes npes to get each
chunk size local size.
1. Initialize MPI:
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
def quicksort_serial(arr):
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot]
middle = [x for x in arr if x == pivot]
right = [x for x in arr if x > pivot]
return quicksort_serial(left) + middle + quicksort_serial(right)
def quicksort_parallel(arr):
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = []
middle = []
right = []
for x in arr:
if x < pivot:
left.append(x)
elif x == pivot:
middle.append(x)
else:
right.append(x)
left_size = len(left)
middle_size = len(middle)
right_size = len(right)
chunk_left = []
chunk_middle = []
chunk_right = []
comm.barrier()
comm.Scatter(left, chunk_left, root=0)
comm.Scatter(middle, chunk_middle, root=0)
comm.Scatter(right, chunk_right, root=0)
return sorted_arr
import random
start_time = time.time()
quicksort_serial(arr)
5. Compare the performance of the serial and parallel versions of the algorithm
python:
if rank == 0:
print(f"Serial Quicksort Algorithm time: {serial_time:.4f} seconds")
print(f"Parallel Quicksort Algorithm time: {parallel_time:.4f} seconds")
Output:
Theory - Huffman Encoding is a lossless data compression algorithm that works by assigning
variable-length codes to the characters in a given text or data stream based on their frequency of
occurrence. This encoding scheme can be implemented on GPU to speed up the encoding process.
The variable-length codes assigned to input characters are Prefix Codes, means the codes (bit
sequences) are assigned in such a way that the code assigned to one character is not the prefix of code
assigned to any other character. This is how Huffman Coding makes sure that there is no ambiguity
when decoding the generated bitstream.
Let us understand prefix codes with a counter example. Let there be four characters a, b, c and d, and
their corresponding variable length codes be 00, 01, 0 and 1. This coding leads to ambiguity because
code assigned to c is the prefix of codes assigned to a and b. If the compressed bit stream is 0001, the
de-compressed output may be “cccd” or “ccb” or “acd” or “ab”.
To optimize the implementation for GPU, we can use parallel programming techniques such as
CUDA, OpenCL, or HIP to parallelize the calculation of character frequencies, construction of the
Huffman tree, and generation of Huffman codes.
Here are some specific optimizations that can be applied to each step:
1.Calculating character frequencies:
Use parallelism to split the input text into chunks and count the frequencies of each character in
parallel on different threads.
Reduce the results of each thread into a final frequency count on the GPU.
1.Constructing the Huffman tree:
Use a priority queue implemented on GPU to parallelize the building of the Huffman tree.
Each thread can process one or more nodes at a time, based on the priority of the nodes in the queue.
2.Generating Huffman codes:
Use parallelism to traverse the Huffman tree and generate Huffman codes for each character in
parallel.
Each thread can process one or more nodes at a time, based on the depth of the nodes in the tree.
3.Encoding the input text:
Use parallelism to split the input text into chunks and encode each chunk in parallel on different
threads.
Merge the encoded chunks into a final output on the GPU.
By parallelizing these steps, we can achieve significant speedup in the Huffman Encoding process on
GPU. However, it's important to note that the specific implementation details may vary based on the
programming language and GPU architecture being used.
Source Code -
// Free memory
delete[] output_text;
cudaFree(d_freq_count);
cudaFree(d_output_text);
delete root;
return 0;
}
Output -
Input text: Hello, world!
Encoded text: 01000110 11010110 10001011 10101110 11110100 11011111 00101101 01000000
11111010
Mini Project : 3
Title - Implement Parallelization of Database Query optimization
Theory -
Query processing is the process through which a Database Management System (DBMS) parses,
verifies, and optimizes a given query before creating low-level code that the DB understands.
Query Processing in DBMS, like any other High-Level Language (HLL) where code is first generated
and then executed to perform various operations, has two phases: compile-time and runtime.
Query the use of declarative languages and query optimization is one of the main factors contributing
to the success of RDBMS technology. Any database allows users to create queries to request specific
data, and the database then uses effective methods to locate the requested data.
A database optimization approach based on CMP has been studied by numerous other academics. But
the majority of their effort was on optimizing join operations while taking into account the L2-cache
and the parallel buffers of the shared main memory.
The following techniques can be used to make a query parallel
• I/O parallelism
• Internal parallelism of queries
• Parallelism among queries
• Within-operation parallelism
• Parallelism in inter-operation
I/O parallelism :
This type of parallelism involves partitioning the relationships among the discs in order to speed up
the retrieval of relationships from the disc.
The inputted data is divided within, and each division is processed simultaneously. After processing
all of the partitioned data, the results are combined. Another name for it is data partitioning.
Hash partitioning is best suited for point queries that are based on the partitioning attribute and have
the benefit of offering an even distribution of data across the discs.
It should be mentioned that partitioning is beneficial for the sequential scans of the full table stored
on “n” discs and the speed at which the table may be scanned. For a single disc system, relationship
takes around 1/n of the time needed to scan the table. In I/O parallelism, there are four different
methods of partitioning:
Hash partitioning :
A hash function is a quick mathematical operation. The partitioning properties are hashed for each
row in the original relationship.
Let’s say that the data is to be partitioned across 4 drives, numbered disk1, disk2, disk3, and disk4.
The row is now stored on disk3 if the function returns.
Range partitioning :
Each disc receives continuous attribute value ranges while using range partitioning. For instance, if
we are range partitioning three discs with the numbers 0, 1, and 2, we may assign a relation with a
value of less than 5 is written to disk0, numbers from 5 to 40 are sent to disk1, and values above 40
are written to disk2.
It has several benefits, such as putting shuffles on the disc that have attribute values within a specified
range.
Round-robin partitioning :
Any order can be used to study the relationships in this method. It sends the ith tuple to the disc
number (i% n).
Therefore, new rows of data are received by discs in turn. For applications that want to read the full
relation sequentially for each query, this strategy assures an even distribution of tuples across drives.
Schema Partitioning :
Various tables inside a database are put on different discs using a technique called schema
partitioning.
Intra-query parallelism :
Using a shared-nothing paralleling architecture technique, intra-query parallelism refers to the
processing of a single query in a parallel process on many CPUs.
This employs two different strategies:
First method — In this method, a duplicate task can be executed on a small amount of data by each
CPU.
Second method — Using this method, the task can be broken up into various sectors, with each CPU
carrying out a separate subtask.
Inter-query parallelism
Each CPU executes numerous transactions when inter-query parallelism is used. Parallel transaction
processing is what it is known as. To support inter-query parallelism, DBMS leverages transaction
dispatching.
We can also employ a variety of techniques, such as effective lock management. This technique runs
each query sequentially, which slows down the running time.
In such circumstances, DBMS must be aware of the locks that various transactions operating on
various processes have acquired. When simultaneous transactions don’t accept the same data,
inter-query parallelism on shared storage architecture works well.
Additionally, the throughput of transactions is boosted, and it is the simplest form of parallelism in
DBMS.
Intra-operation parallelism :
In this type of parallelism, we execute each individual operation of a task, such as sorting, joins,
projections, and so forth, in parallel. Intra-operation parallelism has a very high parallelism level.
Database systems naturally employ this kind of parallelism. Consider the following SQL example:
SELECT * FROM the list of vehicles and sort by model number;
Since a relation might contain a high number of records, the relational operation in the
aforementioned query is sorting.
Because this operation can be done on distinct subsets of the relation in several processors, it takes
less time to sort the data.
Inter-operation parallelism :
This term refers to the concurrent execution of many operations within a query expression. They
come in two varieties:
Pipelined parallelism — In pipeline parallelism, a second operation consumes a row of the first
operation’s output before the first operation has finished producing the whole set of rows in its output.
Additionally, it is feasible to perform these two processes concurrently on several CPUs, allowing one
operation to consume tuples concurrently with another operation and thereby reduce them.
It is advantageous for systems with a limited number of CPUs and prevents the storage of interim
results on a disc.
Independent parallelism- In this form of parallelism, operations contained within query phrases that
are independent of one another may be carried out concurrently. This analogy is extremely helpful
when dealing with parallelism of a lower degree.
The development of parallel database systems is an example of how database management and
parallel computing can work together. A given SQL statement can be divided up in the parallel
database system PQO such that its components can run concurrently on several processors in a
multi-processor machine.
Full table scans, sorting, sub-queries, data loading, and other common operations can all be
performed in parallel.
As a form of parallel database optimization, Parallel Query enables the division of SELECT or DML
operations into many smaller chunks that can be executed by PQ slaves on different CPUs in a single
box.
The order of joins and the method for computing each join are fixed in the first part of the Fig, which
is sorting and rewriting. The second phase, parallelization, turns the query tree into a parallel plan.
Parallelization divides this stage into two parts: extraction of parallelism and scheduling.
Optimizing database queries is an important task in database management systems to improve the
performance of database operations. Parallelization of database query optimization can significantly
improve query execution time by dividing the workload among multiple processors or nodes.
1. Partitioning: The first step is to partition the data into smaller subsets. The partitioning can be done
based on different criteria, such as range partitioning, hash partitioning, or list partitioning. This can
be done in parallel by assigning different processors or nodes to handle different parts of the
partitioning process.
2. Query optimization: Once the data is partitioned, the next step is to optimize the queries. Query
optimization involves finding the most efficient way to execute the query by considering factors such
as index usage, join methods, and filtering. This can also be done in parallel by assigning different
processors or nodes to handle different parts of the query optimization process.
3. Query execution: After the queries are optimized, the final step is to execute the queries. The
execution can be done in parallel by assigning different processors or nodes to handle different parts
of the execution process. The results can then be combined to generate the final result set.
Here's an example of how we can parallelize the query optimization process using OpenMP:
//C++
// Partition the data
std::vector<std::vector<int>> partitions;
int num_partitions = omp_get_num_threads();
#pragma omp parallel for
for (int i = 0; i < num_partitions; i++) {
std::vector<int> partition = partition_data(data, i, num_partitions);
partitions.push_back(partition);
}
In this example, we first partition the data into smaller subsets using OpenMP parallelism. Then we
optimize each query in parallel by assigning different processors or nodes to handle different parts of
the optimization process. Finally, we execute the queries in parallel by assigning different processors
or nodes to handle different parts of the execution process.
Parallelization of database query optimization can significantly improve the performance of database
operations and reduce query execution time. However, it requires careful consideration of the
workload distribution, synchronization, and communication between processors or nodes.
Mini Project : 4
Title - Implement Non-Serial Polyadic Dynamic Programming with GPU Parallelization.
Theory -
Here's an example code that implements non-serial polyadic dynamic programming with GPU
parallelization using CUDA:
#include <iostream>
#include <cuda.h>
int main() {
// Allocate memory for the input arrays on the CPU
float* x = new float[N];
float* y = new float[M];
float* z = new float[K];
float* d_z;
cudaMalloc(&d_z, K * sizeof(float));
cudaMemcpy(d_z, z, K * sizeof(float), cudaMemcpyHostToDevice);