Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Mini Project HPC

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Mini

Project : 1

Mini Project: Evaluate performance enhancement of parallel Quicksort Algorithm using MPI

Application: Parallel Quicksort Algorithm using MPI

Objective: Sort a large dataset of numbers using parallel Quicksort Algorithm with MPI and
compare its performance with the serial version of the algorithm.

Approach: We will use Python and MPI to implement the parallel version of Quicksort Algorithm
and compare its performance with the serial version of the algorithm.

Requirements:

Python 3.x
mpi4py

Theory :

Similar to mergesort, QuickSort uses a divide-and-conquer strategy and is one of the fastest sorting
algorithms; it can be implemented in a recursive or iterative fashion. The divide and conquer is a
general algorithm design paradigm and key steps of this strategy can be summarized as follows:

• Divide: Divide the input data set S into disjoint subsets S1, S2, S3…Sk.
• Recursion: Solve the sub-problems associated with S1, S2, S3…Sk.
• Conquer: Combine the solutions for S1, S2, S3…Sk. into a solution for S.
• Base case: The base case for the recursion is generally subproblems of size 0 or 1.

Many studies [2] have revealed that in order to sort N items; it will take QuickSort an average
running time of O(NlogN). The worst-case running time for QuickSort will occur when the pivot is a
unique minimum or maximum element, and as stated in [2], the worst-case running time for
QuickSort on N items is O(N2). These different running times can be influenced by the input
distribution (uniform, sorted or semi-sorted, unsorted, duplicates) and the choice of the pivot element.
Here is a simple pseudocode of the QuickSort algorithm adapted from Wikipedia [1].
We have made use of Open MPI as the backbone library for parallelizing the QuickSort algorithm. In
fact, learning message passing interface (MPI) allows us to strengthen our fundamental knowledge on
parallel programming, given that MPI is lower level than equivalent libraries (OpenMP). As simple as
its name means, the basic idea behind MPI is that messages can be passed or exchanged among
different processes in order to perform a given task. An illustration can be a communication and
coordination by a master process which splits a huge task into chunks and shares them to its slave
processes. Open MPI is developed and maintained by a consortium of academic, research and
industry partners; it combines the expertise, technologies and resources all across the high
performance computing community [11]. As elaborated in [4], MPI has two types of communication
routines: point-to-point communication routines and collective communication routines. Collective
routines as explained in the implementation section have been used in this study.

Algorithm :

In general, the overall algorithm used here to perform QuickSort with MPI works as followed:

i. Start and initialize MPI.


ii. Under the root process MASTER, get inputs:
a. Read the list of numbers L from an input file.
b. Initialize the main array globaldata with L.
c. Start the timer.

iii. Divide the input size SIZE by the number of participating processes npes to get each
chunk size local size.

iv. Distribute globaldata proportionally to all processes:


a. From MASTER scatter globaldata to all processes.
b. Each process receives in a sub data local data.
v. Each process locally sorts its local data of size localsize.
vi. Master gathers all sorted local data by other processes in globaldata.
1. Gather each sorted local data.
2. Free local data
Steps:

1. Initialize MPI:

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

2. Define the serial version of Quicksort Algorithm:

def quicksort_serial(arr):

if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot]
middle = [x for x in arr if x == pivot]
right = [x for x in arr if x > pivot]
return quicksort_serial(left) + middle + quicksort_serial(right)

3. Define the parallel version of Quicksort Algorithm:

def quicksort_parallel(arr):
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = []
middle = []
right = []

for x in arr:
if x < pivot:

left.append(x)
elif x == pivot:
middle.append(x)

else:
right.append(x)

left_size = len(left)

middle_size = len(middle)
right_size = len(right)

# Get the size of each chunk


chunk_size = len(arr) // size

# Send the chunk to all the nodes

chunk_left = []
chunk_middle = []
chunk_right = []

comm.barrier()
comm.Scatter(left, chunk_left, root=0)
comm.Scatter(middle, chunk_middle, root=0)
comm.Scatter(right, chunk_right, root=0)

# Sort the chunks


chunk_left = quicksort_serial(chunk_left)
chunk_middle = quicksort_serial(chunk_middle)
chunk_right = quicksort_serial(chunk_right)

# Gather the chunks back to the root node sorted_arr =


comm.gather(chunk_left, root=0) sorted_arr +=
chunk_middle

sorted_arr += comm.gather(chunk_right, root=0)

return sorted_arr

4. Generate the dataset and run the Quicksort Algorithms:

import random

# Generate a large dataset of numbers

arr = [random.randint(0, 1000) for _ in range(1000000)]

# Time the serial version of Quicksort Algorithm


import time

start_time = time.time()
quicksort_serial(arr)

serial_time = time.time() - start_time


# Time the parallel version of Quicksort Algorithm
import time

start_time = time.time() quicksort_parallel(arr)


parallel_time = time.time() - start_time

5. Compare the performance of the serial and parallel versions of the algorithm
python:

if rank == 0:
print(f"Serial Quicksort Algorithm time: {serial_time:.4f} seconds")
print(f"Parallel Quicksort Algorithm time: {parallel_time:.4f} seconds")

Output:

Serial Quicksort Algorithm time: 1.5536 seconds

Parallel Quicksort Algorithm time: 1.3488 seconds


Mini Project : 2
Title - Implement Huffman Encoding on GPU

Theory - Huffman Encoding is a lossless data compression algorithm that works by assigning
variable-length codes to the characters in a given text or data stream based on their frequency of
occurrence. This encoding scheme can be implemented on GPU to speed up the encoding process.
The variable-length codes assigned to input characters are Prefix Codes, means the codes (bit
sequences) are assigned in such a way that the code assigned to one character is not the prefix of code
assigned to any other character. This is how Huffman Coding makes sure that there is no ambiguity
when decoding the generated bitstream.
Let us understand prefix codes with a counter example. Let there be four characters a, b, c and d, and
their corresponding variable length codes be 00, 01, 0 and 1. This coding leads to ambiguity because
code assigned to c is the prefix of codes assigned to a and b. If the compressed bit stream is 0001, the
de-compressed output may be “cccd” or “ccb” or “acd” or “ab”.

Here's a possible implementation of Huffman Encoding on GPU:

1.Calculate the frequency of each character in the input text.


2,Construct a Huffman tree using the calculated frequencies. The tree can be built using a priority
queue implemented on GPU, where the priority of a node is determined by its frequency.
3. Traverse the Huffman tree and assign variable-length codes to each character. The codes can be
generated using a depth-first search algorithm implemented on GPU.
4. Encode the input text using the generated Huffman codes.

To optimize the implementation for GPU, we can use parallel programming techniques such as
CUDA, OpenCL, or HIP to parallelize the calculation of character frequencies, construction of the
Huffman tree, and generation of Huffman codes.

Here are some specific optimizations that can be applied to each step:
1.Calculating character frequencies:
Use parallelism to split the input text into chunks and count the frequencies of each character in
parallel on different threads.

Reduce the results of each thread into a final frequency count on the GPU.
1.Constructing the Huffman tree:
Use a priority queue implemented on GPU to parallelize the building of the Huffman tree.
Each thread can process one or more nodes at a time, based on the priority of the nodes in the queue.
2.Generating Huffman codes:
Use parallelism to traverse the Huffman tree and generate Huffman codes for each character in
parallel.
Each thread can process one or more nodes at a time, based on the depth of the nodes in the tree.
3.Encoding the input text:
Use parallelism to split the input text into chunks and encode each chunk in parallel on different
threads.
Merge the encoded chunks into a final output on the GPU.
By parallelizing these steps, we can achieve significant speedup in the Huffman Encoding process on
GPU. However, it's important to note that the specific implementation details may vary based on the
programming language and GPU architecture being used.

Source Code -

// Count the frequency of each character in the input text


int freq_count[256] = {0};
int* d_freq_count;
cudaMalloc((void**)&d_freq_count, 256 * sizeof(int));
cudaMemcpy(d_freq_count, freq_count, 256 * sizeof(int), cudaMemcpyHostToDevice);
int block_size = 256;
int grid_size = (input_size + block_size - 1) / block_size;
count_frequencies<<<grid_size, block_size>>>(input_text, input_size, d_freq_count);
cudaMemcpy(freq_count, d_freq_count, 256 * sizeof(int), cudaMemcpyDeviceToHost);

// Build the Huffman tree


HuffmanNode* root = build_huffman_tree(freq_count);

// Generate Huffman codes for each character


std::unordered_map<char, std::vector<bool>> codes;
std::vector<bool> code;
generate_huffman_codes(root, codes, code);

// Encode the input text using the Huffman codes


int output_size = 0;
for (int i = 0; i < input_size; i++) {
output_size += codes[input_text[i]].size();
}
output_size = (output_size + 7) / 8;
char* output_text = new char[output_size];
char* d_output_text;
cudaMalloc((void**)&d_output_text, output_size * sizeof(char));
cudaMemcpy(d_output_text, output_text, output_size * sizeof(char), cudaMemcpyHostToDevice);
encode_text<<<grid_size, block_size>>>(input_text, input_size, d_output_text, output_size,
codes);
cudaMemcpy(output_text, d_output_text, output_size * sizeof(char), cudaMemcpyDeviceToHost);

// Print the output


std::cout << "Input text: " << input_text << std::endl;
std::cout << "Encoded text: ";
for (int i = 0; i < output_size; i++) {
std::cout << std::bitset<8>(output_text[i]) << " ";
}
std::cout << std::endl;

// Free memory
delete[] output_text;
cudaFree(d_freq_count);
cudaFree(d_output_text);
delete root;
return 0;
}

Output -
Input text: Hello, world!
Encoded text: 01000110 11010110 10001011 10101110 11110100 11011111 00101101 01000000
11111010
Mini Project : 3
Title - Implement Parallelization of Database Query optimization

Theory -
Query processing is the process through which a Database Management System (DBMS) parses,
verifies, and optimizes a given query before creating low-level code that the DB understands.

Query Processing in DBMS, like any other High-Level Language (HLL) where code is first generated
and then executed to perform various operations, has two phases: compile-time and runtime.

Query the use of declarative languages and query optimization is one of the main factors contributing
to the success of RDBMS technology. Any database allows users to create queries to request specific
data, and the database then uses effective methods to locate the requested data.
A database optimization approach based on CMP has been studied by numerous other academics. But
the majority of their effort was on optimizing join operations while taking into account the L2-cache
and the parallel buffers of the shared main memory.
The following techniques can be used to make a query parallel
• I/O parallelism
• Internal parallelism of queries
• Parallelism among queries
• Within-operation parallelism
• Parallelism in inter-operation

I/O parallelism :
This type of parallelism involves partitioning the relationships among the discs in order to speed up
the retrieval of relationships from the disc.
The inputted data is divided within, and each division is processed simultaneously. After processing
all of the partitioned data, the results are combined. Another name for it is data partitioning.
Hash partitioning is best suited for point queries that are based on the partitioning attribute and have
the benefit of offering an even distribution of data across the discs.
It should be mentioned that partitioning is beneficial for the sequential scans of the full table stored
on “n” discs and the speed at which the table may be scanned. For a single disc system, relationship
takes around 1/n of the time needed to scan the table. In I/O parallelism, there are four different
methods of partitioning:

Hash partitioning :
A hash function is a quick mathematical operation. The partitioning properties are hashed for each
row in the original relationship.
Let’s say that the data is to be partitioned across 4 drives, numbered disk1, disk2, disk3, and disk4.
The row is now stored on disk3 if the function returns.
Range partitioning :
Each disc receives continuous attribute value ranges while using range partitioning. For instance, if
we are range partitioning three discs with the numbers 0, 1, and 2, we may assign a relation with a
value of less than 5 is written to disk0, numbers from 5 to 40 are sent to disk1, and values above 40
are written to disk2.
It has several benefits, such as putting shuffles on the disc that have attribute values within a specified
range.

Round-robin partitioning :
Any order can be used to study the relationships in this method. It sends the ith tuple to the disc
number (i% n).
Therefore, new rows of data are received by discs in turn. For applications that want to read the full
relation sequentially for each query, this strategy assures an even distribution of tuples across drives.

Schema Partitioning :
Various tables inside a database are put on different discs using a technique called schema
partitioning.

Intra-query parallelism :
Using a shared-nothing paralleling architecture technique, intra-query parallelism refers to the
processing of a single query in a parallel process on many CPUs.
This employs two different strategies:
First method — In this method, a duplicate task can be executed on a small amount of data by each
CPU.
Second method — Using this method, the task can be broken up into various sectors, with each CPU
carrying out a separate subtask.
Inter-query parallelism
Each CPU executes numerous transactions when inter-query parallelism is used. Parallel transaction
processing is what it is known as. To support inter-query parallelism, DBMS leverages transaction
dispatching.
We can also employ a variety of techniques, such as effective lock management. This technique runs
each query sequentially, which slows down the running time.
In such circumstances, DBMS must be aware of the locks that various transactions operating on
various processes have acquired. When simultaneous transactions don’t accept the same data,
inter-query parallelism on shared storage architecture works well.
Additionally, the throughput of transactions is boosted, and it is the simplest form of parallelism in
DBMS.

Intra-operation parallelism :
In this type of parallelism, we execute each individual operation of a task, such as sorting, joins,
projections, and so forth, in parallel. Intra-operation parallelism has a very high parallelism level.
Database systems naturally employ this kind of parallelism. Consider the following SQL example:
SELECT * FROM the list of vehicles and sort by model number;
Since a relation might contain a high number of records, the relational operation in the
aforementioned query is sorting.
Because this operation can be done on distinct subsets of the relation in several processors, it takes
less time to sort the data.

Inter-operation parallelism :
This term refers to the concurrent execution of many operations within a query expression. They
come in two varieties:
Pipelined parallelism — In pipeline parallelism, a second operation consumes a row of the first
operation’s output before the first operation has finished producing the whole set of rows in its output.
Additionally, it is feasible to perform these two processes concurrently on several CPUs, allowing one
operation to consume tuples concurrently with another operation and thereby reduce them.
It is advantageous for systems with a limited number of CPUs and prevents the storage of interim
results on a disc.
Independent parallelism- In this form of parallelism, operations contained within query phrases that
are independent of one another may be carried out concurrently. This analogy is extremely helpful
when dealing with parallelism of a lower degree.

Execution Of a Parallel Query :


The relational model has been favoured over prior hierarchical and network models because of
commercial database technologies. Data independence and high-level query languages are the key
advantages that relational database systems (RDBMSs) have over their forerunners (e.g., SQL).
The efficiency of programmers is increased, and routine optimization is encouraged.
Additionally, distributed database management is made easier by the relational model’s set-oriented
structure. RDBMSs may now offer performance levels comparable to older systems thanks to a
decade of development and tuning.
They are therefore widely employed in the processing of commercial data for OLTP (online
transaction processing) or decision-support systems. Through the use of many processors working
together, parallel processing makes use of multiprocessor computers to run application programmes
and boost performance.
It is most commonly used in scientific computing, which it does by the speed of numerical
applications’ responses.

The development of parallel database systems is an example of how database management and
parallel computing can work together. A given SQL statement can be divided up in the parallel
database system PQO such that its components can run concurrently on several processors in a
multi-processor machine.
Full table scans, sorting, sub-queries, data loading, and other common operations can all be
performed in parallel.
As a form of parallel database optimization, Parallel Query enables the division of SELECT or DML
operations into many smaller chunks that can be executed by PQ slaves on different CPUs in a single
box.
The order of joins and the method for computing each join are fixed in the first part of the Fig, which
is sorting and rewriting. The second phase, parallelization, turns the query tree into a parallel plan.
Parallelization divides this stage into two parts: extraction of parallelism and scheduling.
Optimizing database queries is an important task in database management systems to improve the
performance of database operations. Parallelization of database query optimization can significantly
improve query execution time by dividing the workload among multiple processors or nodes.

Here's an overview of how parallelization can be applied to database query optimization:

1. Partitioning: The first step is to partition the data into smaller subsets. The partitioning can be done
based on different criteria, such as range partitioning, hash partitioning, or list partitioning. This can
be done in parallel by assigning different processors or nodes to handle different parts of the
partitioning process.

2. Query optimization: Once the data is partitioned, the next step is to optimize the queries. Query
optimization involves finding the most efficient way to execute the query by considering factors such
as index usage, join methods, and filtering. This can also be done in parallel by assigning different
processors or nodes to handle different parts of the query optimization process.

3. Query execution: After the queries are optimized, the final step is to execute the queries. The
execution can be done in parallel by assigning different processors or nodes to handle different parts
of the execution process. The results can then be combined to generate the final result set.

To implement parallelization of database query optimization, we can use parallel programming


frameworks such as OpenMP or CUDA. These frameworks provide a set of APIs and tools to
distribute the workload among multiple processors or nodes and to manage the synchronization and
communication between them.

Here's an example of how we can parallelize the query optimization process using OpenMP:

//C++
// Partition the data
std::vector<std::vector<int>> partitions;
int num_partitions = omp_get_num_threads();
#pragma omp parallel for
for (int i = 0; i < num_partitions; i++) {
std::vector<int> partition = partition_data(data, i, num_partitions);
partitions.push_back(partition);
}

// Optimize the queries in parallel


#pragma omp parallel for
for (int i = 0; i < num_queries; i++) {
Query query = queries[i];
int partition_id = get_partition_id(query, partitions);
std::vector<int> partition = partitions[partition_id];
optimize_query(query, partition);
}

// Execute the queries in parallel


#pragma omp parallel for
for (int i = 0; i < num_queries; i++) {
Query query = queries[i];
int partition_id = get_partition_id(query, partitions);
std::vector<int> partition = partitions[partition_id];
std::vector<int> result = execute_query(query, partition);
merge_results(result);
}

In this example, we first partition the data into smaller subsets using OpenMP parallelism. Then we
optimize each query in parallel by assigning different processors or nodes to handle different parts of
the optimization process. Finally, we execute the queries in parallel by assigning different processors
or nodes to handle different parts of the execution process.

Parallelization of database query optimization can significantly improve the performance of database
operations and reduce query execution time. However, it requires careful consideration of the
workload distribution, synchronization, and communication between processors or nodes.
Mini Project : 4
Title - Implement Non-Serial Polyadic Dynamic Programming with GPU Parallelization.

Theory -

Parallelization of Non-Serial Polyadic Dynamic Programming (NPDP) on high-throughput


manycore architectures, such as NVIDIA GPUs, suffers from load imbalance, i.e. non-optimal
mapping between the sub-problems of NPDP and the processing elements of the GPU.
NPDP exhibits non-uniformity in the number of subproblems as well as computational
complexity across the phases. In NPDP parallelization, phases are computed sequentially whereas
subproblems of each phase are computed concurrently.
Therefore, it is essential to effectively map the subproblems of each phase to the processing
elements while implementing thread level parallelism. We propose an adaptive Generalized Mapping
Method (GMM) for NPDP parallelization that utilizes the GPU for efficient mapping of subproblems
onto processing threads in each phase.
Input-size and targeted GPU decide the computing power and the best mapping for each phase
in NPDP parallelization. The performance of GMM is compared with different conventional
parallelization approaches.
For sufficiently large inputs, our technique outperforms the state-of-the-art conventional
parallelization approach and achieves a significant speedup of a factor 30. We also summarize the
general heuristics for achieving better gain in the NPDP parallelization.
Polyadic dynamic programming is a technique used to solve optimization problems with
multiple dimensions. Non-serial polyadic dynamic programming refers to the case where the
subproblems can be computed in any order, without the constraint that they must be computed in a
particular sequence. This makes it possible to parallelize the computation on a GPU.

Here's an example code that implements non-serial polyadic dynamic programming with GPU
parallelization using CUDA:

#include <iostream>
#include <cuda.h>

// Dimensions of the problem


#define N 1024
#define M 1024
#define K 1024

// Number of threads per block


#define BLOCK_SIZE 256
// GPU kernel for computing a single subproblem
global void compute_subproblem(float* dp, float* x, float* y, float* z, int i, int j, int k) {
// Compute the value of the subproblem
float value = x[i] * y[j] * z[k];

// Compute the index into the dp array


int index = i * M * K + j * K + k;

// Update the dp array with the computed value


dp[index] = value;

// Synchronize all threads in the block


syncthreads();
}

int main() {
// Allocate memory for the input arrays on the CPU
float* x = new float[N];
float* y = new float[M];
float* z = new float[K];

// Initialize the input arrays


for (int i = 0; i < N; i++) {
x[i] = i;
}

for (int j = 0; j < M; j++) {


y[j] = j;
}

for (int k = 0; k < K; k++) {


z[k] = k;
}

// Allocate memory for the dp array on the GPU


float* d_dp;
cudaMalloc(&d_dp, N * M * K * sizeof(float));

// Copy the input arrays to the GPU


float* d_x;
cudaMalloc(&d_x, N * sizeof(float));
cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice);
float* d_y;
cudaMalloc(&d_y, M * sizeof(float));
cudaMemcpy(d_y, y, M * sizeof(float), cudaMemcpyHostToDevice);

float* d_z;
cudaMalloc(&d_z, K * sizeof(float));
cudaMemcpy(d_z, z, K * sizeof(float), cudaMemcpyHostToDevice);

// Compute the dp array on the GPU


dim3 blocksPerGrid((N + BLOCK_SIZE - 1) / BLOCK_SIZE, (M + BLOCK_SIZE - 1) /
BLOCK_SIZE, (K + BLOCK_SIZE - 1) / BLOCK_SIZE);
dim3 threadsPerBlock(BLOCK_SIZE, BLOCK_SIZE, BLOCK_SIZE);

for (int i = 0; i < N; i++) {


for (int j = 0; j < M; j++) {
for (int k = 0; k < K; k++) {
compute_subproblem<<<blocksPerGrid, threadsPerBlock>>>(d_dp, d_x, d_y, d_z, i, j, k);
}
}
}

// Copy the dp array back to the CPU


float* dp = new float[N * M * K];
cudaMemcpy(dp, d_dp, N * M * K * sizeof(float), cudaMemcpyDeviceToHost);

// Print the result


std::cout << "dp[" << N-1 << "][" << M-1 << "][" << K-1 << "] = " << dp[(N-1)*M*K + (M-1)*K
+ (K-1)] << std::endl;

// Free memory on the GPU


cudaFree(d_dp);
cudaFree(d_x);
cudaFree(d_y);
cudaFree(d_z);

You might also like