Papers by Ripon Patgiri
Every activity these days generates data, from purchasing a movie ticket to researching the benef... more Every activity these days generates data, from purchasing a movie ticket to researching the benefits of a product. This data is immediately trained by various models. Due to privacy, security, and usability concerns, a user may desire for their privately owned data to be forgotten by the system. Machine unlearning for security is studied in this context. Several spam email detection methods exist, each of which employs a different algorithm to detect undesired spam emails. But these models are vulnerable to attacks. Many attackers exploit the model by polluting the data, which are trained to the model in various ways. So to act deftly in such situations model needs to readily unlearn the polluted data without the need for retraining. Retraining is impractical in most cases as there is already a massive amount of data trained to the model in the past, which needs to be trained again just for removing a small amount of polluted data, which is often significantly less than 1%. This pro...
The world has been evolving with new technologies and advances day-by-day. With the advent of var... more The world has been evolving with new technologies and advances day-by-day. With the advent of various learning technologies in every field, the research community is able to provide solution in every aspect of life with the applications of Artificial Intelligence, Machine Learning, Deep Learning, Computer Vision, etc. However, with such high achievements, it is found to lag behind the ability to provide explanation against its prediction. The current situation is such that these modern technologies are able to predict and decide upon various cases more accurately and speedily than a human, but failed to provide an answer when the question of why to trust its prediction is put forward. In order to attain a deeper understanding into this rising trend, we explore a very recent and talked-about novel contribution which provides rich insight on a prediction being made -- ``Explainability.'' The main premise of this survey is to provide an overview for researches explored in the d...
ICST Transactions on Scalable Information Systems
Distributed Denial-of-Service (DDoS) is a menace for service provider and prominent issue in netw... more Distributed Denial-of-Service (DDoS) is a menace for service provider and prominent issue in network security. Defeating or defending the DDoS is a prime challenge. DDoS make a service unavailable for a certain time. This phenomenon harms the service providers, and hence, loss of business revenue. Therefore, DDoS is a grand challenge to defeat. There are numerous mechanism to defend DDoS, however, this paper surveys the deployment of Bloom Filter in defending a DDoS attack. The Bloom Filter is a probabilistic data structure for membership query that returns either true or false. Bloom Filter uses tiny memory to store information of large data. Therefore, packet information is stored in Bloom Filter to defend and defeat DDoS. This paper presents a survey on DDoS defending technique using Bloom Filter.
In this Exa Byte scale era, the data increases at an exponential rate. This is in turn generating... more In this Exa Byte scale era, the data increases at an exponential rate. This is in turn generating a
massive amount of metadata in the file system. Hadoop is the most widely used framework to deal with Big Data.
But due to this growth of huge amount of metadata, the efficiency of Hadoop is questioned numerous time by
many researchers. Therefore, it is essential to create an efficient and scalable metadata management for Hadoop.
Hash-based mapping and subtree partitioning are suitable in distributed metadata management schemes. Subtree
partitioning does not uniformly distribute workload among the metadata servers, and metadata need to be migrated
to keep the load roughly balanced. Hash-based mapping suffers from a constraint on the locality of metadata, though
it uniformly distributes the load among NameNodes, which is the metadata server of Hadoop. In this paper, we
present a circular metadata management mechanism named Dynamic Circular Metadata Splitting (DCMS). DCMS
preserves metadata locality using consistent hashing as well as locality-preserving hashing, keeps replicated metadata
for excellent reliability, as well as dynamically distributes metadata among the NameNodes to keep load balancing.
NameNode is a centralized heart of the Hadoop which keeps the directory tree of all files and thus failure of which
causes the Single Point of Failure (SPOF). DCMS removes Hadoop’s SPOF as well as provides an efficient and
scalable metadata management. The new framework is named ‘Dr.Hadoop’ after the name of the authors.
Smart Innovation, Systems and Technologies, 2015
Big Data, the buzz around the globe in recent days is used for large-scale data which have huge v... more Big Data, the buzz around the globe in recent days is used for large-scale data which have huge volume, variety and with some genuinely difficult complex structure. The last few years of internet technology as well as computer world has seen a lot of growth and popularity in the field of cloud computing. As a consequence, these cloud applications are continually generating this big data. There are various burning problems associated with big data in the research field, like how to store, analysis and visualize these for generating further outcomes. This paper initially points out the recent developed information technologies in the field of big data. Later on, the paper outlines the major key problems like, proper load balancing, storage and processing of small files and de-duplication regarding the big data.
Size of the data used in today’s enterprises has been
expanding at a huge range from last few yea... more Size of the data used in today’s enterprises has been
expanding at a huge range from last few years. Simultaneously,
the need to process and analyze the large volumes of data has
also increased. Hadoop Distributed File System (HDFS), is an
open source implementation of Apache, designed for running
on commodity hardware to handle applications having large
datasets (TB, PB). HDFS architecture is based on single master
(Name Node), which handles the metadata for large number of
slaves. To get maximum efficiency, Name Node stores all of the
metadata in its RAM. So, when dealing with huge number of
small files, Name Node often becomes a bottleneck for HDFS
as it might run out of memory. Apache Hadoop uses Hadoop
ARchive (HAR) to deal with small files. But it is not so efficient
for multi-NameNode environment, which requires automatic
scaling of metadata. In this paper, we have designed hashtable
based architecture, Hadoop ARchive Plus (HAR+) using
sha256 as the key, which is a modification of existing HAR.
HAR+ is designed to provide more reliability which can also
provide auto scaling of metadata. Instead of using one
NameNode for storing the metadata, HAR+ uses multiple
NameNodes. Our result shows that HAR+ reduces the load of a
single NameNode in significant amount. This makes the cluster
more scalable, more robust and less prone to failure unlike of
Hadoop ARchive.
2014 International Conference on High Performance Computing and Applications (ICHPCA), 2014
Size of the data used in today's enterprises has been growing at exponential rates from last few ... more Size of the data used in today's enterprises has been growing at exponential rates from last few years. Simultaneously, the need to process and analyze the large volumes of data has also increased. To handle and for analysis of large datasets, an open-source implementation of Apache framework, Hadoop is used now-a-days. For managing and storing of all the resources across its cluster, Hadoop possesses a distributed file system called Hadoop Distributed File System (HDFS). HDFS is written completely in Java and is depicted in such a way that in can store Big Data more reliably, and can stream those at high processing time to the user applications. In recent days, Hadoop is used widely by popular organizations like Yahoo, Facebook and various online shopping market venders. On the other hand, Experiments on Data Intensive computations are going on to parallelize the processing of data. None of them could actually achieve a desirable performance. Hadoop, with its Map-Reduce parallel data processing capability can achieve these goals efficiently [I]. This paper initially provides an overview of the HDFS in details. Later on, the paper reports the experimental work of Hadoop with the big data and suggests the various factors that affects the Hadoop cluster performance. Paper concludes with providing the different real field challenges of Hadoop in recent days and scope for future work
Undoubtedly, the MapReduce is the most powerful programming paradigm in distributed computing. Th... more Undoubtedly, the MapReduce is the most powerful programming paradigm in distributed computing. The enhancement of the MapReduce is essential and it can lead the computing faster. Therefore, there are many scheduling algorithms to discuss based on their characteristics. Moreover, there are many shortcoming to discover in this field. In this article, we present the state-of-the-art scheduling algorithm to enhance the understanding of the algorithms. The algorithms are presented systematically such that there can be many future possibilities in scheduling algorithm through this article. In this paper, we provide in-depth insight on the MapReduce scheduling algorithm. In addition, we discuss various issues of MapReduce scheduler developed for large-scale computing as well as heterogenous environment.
inproceedings by Ripon Patgiri
Undoubtedly, the Big Data is the most promising technology to serve an organization in a better w... more Undoubtedly, the Big Data is the most promising technology to serve an organization in a better way. It provides an organized way to think about data, whatever the data size is, and whatever the data type is. Moreover, the Big Data provides a platform to make decisions, and to analyze future possibilities using the past and present data. The Big Data technology eases the large dataset to store, process and manage. The Big Data is the most fashionable trendsetter in the world of computing i.e., the most popular buzzword around the globe upon which the future of the most of IT industries depends on it. This paper presents a study report on numerous research issues and challenges of Big Data which is employed in very large dataset. This paper uncovers the nuts and bolts of Big Data. This study report provides rich insight on the Big Data.
Big Data, the most popular paradigm, consists of a huge set of data. Big Data storage is built us... more Big Data, the most popular paradigm, consists of a huge set of data. Big Data storage is built using block-store, file system, object-store and/or hybrid of these systems, cohered with metadata. In this paper, we present role of the metadata server (MDS) in a file system for Big Data. The metadata is large in size for Big Data storage system, and therefore, a standalone metadata server (MDS) cannot hold entire metadata of the storage system. Thus, the metadata servers are augmented to form a clustered metadata server. Moreover, the clustered metadata server (MDS) defines the retrieval process of data from the clustered storage media. Besides, the clustered MDS defines the scalability of Big Data storage. In addition, the clustered MDS offers high efficiency and fast accessing of data in a large scale data storage. The size of Big Data is typically Petabytes to Exabytes. Therefore, there is a strong requirement of efficient and effective clustered MDS to define the access mechanism of Big Data storage. In this paper, we sketch the effect of MDS on very large scale file system.
In this paper, we present a probabilistic data structure for membership filter, called rFilter. T... more In this paper, we present a probabilistic data structure for membership filter, called rFilter. The rFilter is an extension of the Bloom filter data structure. The rFilter is a simple, yet powerful membership filter among its kind. The rFilter requires constant time and very low space complexity. In addition, the rFilter is highly scalable and space-efficient as compared to other variants of Bloom filter. The rFilter avoids complex hashing overhead. The rFilter boosts up the space performance by 16× or 93.75% over integer representation of an array. Besides, the rFilter reduces 4× or 75% over character representation of an array. Moreover, the rFilter drastically reduces the chances of "false positive" over the conventional method.
In this paper, we present an elastic array, called Elastica (Elastic Array). The Elastica is the ... more In this paper, we present an elastic array, called Elastica (Elastic Array). The Elastica is the most space efficient resizable array, that is, Elastica allows increment and decrement of an array size. The Elastica is an array-of-array data structure which promotes increment and decrement of an array size dynamically. Most interestingly, the Elastica allocates memory blocks-wise and provides a functionality of accessing an element with O(1) time complexity. The Elastica allows the available memory blocks to be non-contiguous without violating the properties of the conventional single dimensional array. The Elastica offers the most space-efficient resizable array with O (log2n) extra space complexity. The Elastica is used to implement various data structures similar to conventional array. Besides, the Elastica has got other added advantages in implementing the available data structure.
It's Big Data era. In this epoch, the terabytes of data are a dime a dozen as well as large metad... more It's Big Data era. In this epoch, the terabytes of data are a dime a dozen as well as large metadata size. The large scale metadata size becomes a barrier in the era of exascale computation. However, a fine-tuning, and a well designing of metadata can enhance the performance of a file system. Therefore, the large scale metadata server (MDS) design becomes key research point now-a-days. The designing MDS becomes a prominent challenge when metadata size grows and becomes unmanageable. Besides, the metadata management becomes also a challenging task in serving very large scale metadata. The distributed Metadata Servers (dMDS) are becoming mature enough to serve huge sized metadata. In this paper, we present state-of-the-art metadata server technology. Moreover, the paper encompasses the issues, and challenges of dMDS.
The Big Data is the most prominent paradigm now-a-days. The Big Data starts rule slowly from 2003... more The Big Data is the most prominent paradigm now-a-days. The Big Data starts rule slowly from 2003, and expected to rule and dominate the IT industries at least up to 2030. Furthermore, the Big Data conquer the technological war and easily capture the entire market since 2009. The Big Data is blasting everywhere around the World in every domain. The Big Data, a massive amount of data, able to generate billions of revenue. The secret behind of these billions of revenue is ever growing volume. This paper presents the redefinition of volume of Big Data. The volume is redefined by engaging three other V's, namely, voluminosity, vacuum, and vitality. Furthermore, this paper augments two new V's to the Big Data paradigm, namely, vendee and vase. This paper explores all V's of Big Data. There are lots of controversy and confusion regarding V's of Big Data. This paper uncovers the confusions of the V family of the Big Data.
incollections by Ripon Patgiri
inbooks by Ripon Patgiri
Size of the data used in todays enterprises has been growing at exponential rates from last few y... more Size of the data used in todays enterprises has been growing at exponential rates from last few years. Simultaneously, the need to process and analyze the large volumes of data has also increased. To handle and for analysis of large scale datasets, an open-source implementation of Apache framework, Hadoop is used now-a-days. For managing and storing of all the resources across its cluster, Hadoop possesses a distributed file system called Hadoop Distributed File System (HDFS). HDFS is written completely in Java and is depicted in such a way that in can store Big data more reliably, and can stream those at high processing time to the user applications. Hadoop has been widely used in recent days by popular organizations like Yahoo, Facebook and various online shopping market venders. On the other hand, experiments on Data-Intensive computations are going on to parallelize the processing of data. None of them could actually achieve a desirable performance. Hadoop, with its Map-Reduce parallel data processing capability can achieve these goals efficiently. This chapter initially provides an overview of the HDFS in details. The next portion of the paper evaluates Hadoops performance with various factors in different environments. The chapter shows how files less than the block size affect Hadoops R/W performance and how the time of execution of a job depends on block size and number of reducers. Chapter concludes with providing the different real challenges of Hadoop in recent days and scope for future work.
articles by Ripon Patgiri
Uploads
Papers by Ripon Patgiri
massive amount of metadata in the file system. Hadoop is the most widely used framework to deal with Big Data.
But due to this growth of huge amount of metadata, the efficiency of Hadoop is questioned numerous time by
many researchers. Therefore, it is essential to create an efficient and scalable metadata management for Hadoop.
Hash-based mapping and subtree partitioning are suitable in distributed metadata management schemes. Subtree
partitioning does not uniformly distribute workload among the metadata servers, and metadata need to be migrated
to keep the load roughly balanced. Hash-based mapping suffers from a constraint on the locality of metadata, though
it uniformly distributes the load among NameNodes, which is the metadata server of Hadoop. In this paper, we
present a circular metadata management mechanism named Dynamic Circular Metadata Splitting (DCMS). DCMS
preserves metadata locality using consistent hashing as well as locality-preserving hashing, keeps replicated metadata
for excellent reliability, as well as dynamically distributes metadata among the NameNodes to keep load balancing.
NameNode is a centralized heart of the Hadoop which keeps the directory tree of all files and thus failure of which
causes the Single Point of Failure (SPOF). DCMS removes Hadoop’s SPOF as well as provides an efficient and
scalable metadata management. The new framework is named ‘Dr.Hadoop’ after the name of the authors.
expanding at a huge range from last few years. Simultaneously,
the need to process and analyze the large volumes of data has
also increased. Hadoop Distributed File System (HDFS), is an
open source implementation of Apache, designed for running
on commodity hardware to handle applications having large
datasets (TB, PB). HDFS architecture is based on single master
(Name Node), which handles the metadata for large number of
slaves. To get maximum efficiency, Name Node stores all of the
metadata in its RAM. So, when dealing with huge number of
small files, Name Node often becomes a bottleneck for HDFS
as it might run out of memory. Apache Hadoop uses Hadoop
ARchive (HAR) to deal with small files. But it is not so efficient
for multi-NameNode environment, which requires automatic
scaling of metadata. In this paper, we have designed hashtable
based architecture, Hadoop ARchive Plus (HAR+) using
sha256 as the key, which is a modification of existing HAR.
HAR+ is designed to provide more reliability which can also
provide auto scaling of metadata. Instead of using one
NameNode for storing the metadata, HAR+ uses multiple
NameNodes. Our result shows that HAR+ reduces the load of a
single NameNode in significant amount. This makes the cluster
more scalable, more robust and less prone to failure unlike of
Hadoop ARchive.
inproceedings by Ripon Patgiri
incollections by Ripon Patgiri
inbooks by Ripon Patgiri
articles by Ripon Patgiri
massive amount of metadata in the file system. Hadoop is the most widely used framework to deal with Big Data.
But due to this growth of huge amount of metadata, the efficiency of Hadoop is questioned numerous time by
many researchers. Therefore, it is essential to create an efficient and scalable metadata management for Hadoop.
Hash-based mapping and subtree partitioning are suitable in distributed metadata management schemes. Subtree
partitioning does not uniformly distribute workload among the metadata servers, and metadata need to be migrated
to keep the load roughly balanced. Hash-based mapping suffers from a constraint on the locality of metadata, though
it uniformly distributes the load among NameNodes, which is the metadata server of Hadoop. In this paper, we
present a circular metadata management mechanism named Dynamic Circular Metadata Splitting (DCMS). DCMS
preserves metadata locality using consistent hashing as well as locality-preserving hashing, keeps replicated metadata
for excellent reliability, as well as dynamically distributes metadata among the NameNodes to keep load balancing.
NameNode is a centralized heart of the Hadoop which keeps the directory tree of all files and thus failure of which
causes the Single Point of Failure (SPOF). DCMS removes Hadoop’s SPOF as well as provides an efficient and
scalable metadata management. The new framework is named ‘Dr.Hadoop’ after the name of the authors.
expanding at a huge range from last few years. Simultaneously,
the need to process and analyze the large volumes of data has
also increased. Hadoop Distributed File System (HDFS), is an
open source implementation of Apache, designed for running
on commodity hardware to handle applications having large
datasets (TB, PB). HDFS architecture is based on single master
(Name Node), which handles the metadata for large number of
slaves. To get maximum efficiency, Name Node stores all of the
metadata in its RAM. So, when dealing with huge number of
small files, Name Node often becomes a bottleneck for HDFS
as it might run out of memory. Apache Hadoop uses Hadoop
ARchive (HAR) to deal with small files. But it is not so efficient
for multi-NameNode environment, which requires automatic
scaling of metadata. In this paper, we have designed hashtable
based architecture, Hadoop ARchive Plus (HAR+) using
sha256 as the key, which is a modification of existing HAR.
HAR+ is designed to provide more reliability which can also
provide auto scaling of metadata. Instead of using one
NameNode for storing the metadata, HAR+ uses multiple
NameNodes. Our result shows that HAR+ reduces the load of a
single NameNode in significant amount. This makes the cluster
more scalable, more robust and less prone to failure unlike of
Hadoop ARchive.