Jifs223295 2
Jifs223295 2
Jifs223295 2
DOI:10.3233/JIFS-223295
IOS Press
PY
M.R. Sundarakumara,∗ , G. Mahadevanb , R. Natchadalingamc , G. Karthikeyand , J. Ashoke ,
J. Samuel Manoharanf , V. Sathyag and P. Velmurugadassh
a Research Scholar, AMC Engineering College, Bangalore, India
b AMC Engineering College, Bangalore, India
c School of Computing and Information Technology, Reva University, Bengaluru, India
CO
d Department of EEE, Sona College of Technology, Salem, Tamil Nadu, India
e Department of ECE, V.S.B Engineering College, Karur, Tamil Nadu, India
f Department of ECE, Sir Isaac Newton College of Engineering and Technology, Nagapattinam, Tamil Nadu,
India
g Department of Artificial Intelligence and Data Science, Panimalar Engineering College, Chennai, India
h Department of Computer Science & Engineering, Kalasalingam Academy of Research and Education,
OR
Tamil Nadu, India
TH
Abstract. In the modern era, digital data processing with a huge volume of data from the repository is challenging due
to various data formats and the extraction techniques available. The accuracy levels and speed of the data processing on
larger networks using modern tools have limitations for getting quick results. The major problem of data extraction on the
repository is finding the data location and the dynamic changes in the existing data. Even though many researchers created
different tools with algorithms for processing those data from the warehouse, it has not given accurate results and gives low
AU
latency. This output is due to a larger network of batch processing. The performance of the database scalability has to be
tuned with the powerful distributed framework and programming languages for the latest real-time applications to process
the huge datasets over the network. Data processing has been done in big data analytics using the modern tools HADOOP
and SPARK effectively. Moreover, a recent programming language such as Python will provide solutions with the concepts
of map reduction and erasure coding. But it has some challenges and limitations on a huge dataset at network clusters. This
review paper deals with Hadoop and Spark features also their challenges and limitations over different criteria such as file
size, file formats, and scheduling techniques. In this paper, a detailed survey of the challenges and limitations that occurred
during the processing phase in big data analytics was discussed and provided solutions to that by selecting the languages and
techniques using modern tools. This paper gives solutions to the research people who are working in big data analytics, for
improving the speed of data processing with a proper algorithm over digital data in huge repositories.
PY
1.1. Literature survey
1. Introduction
For parallel processing of data over networks,
In this digital world, all people are generating a Hadoop Framework is used. The data processing
huge volume of information as data for their real- between the nodes is based on their location and
CO
world applications and needs. Every day plenty of migration of their place. So aligning and arrang-
data were created in various domains like health- ing all individual nodes for performing distributed
care, retail, banking, industries, and companies [1]. data processing at a time is complicated [10] using
The data warehouse has been generated to store and normal client-server or peer-to-peer networks. The
the time taken for getting it on time is miserable. distributed framework was used to disseminate the
Multiple methods and algorithms are used in a data data from the repositories but the accuracy level and
warehouse as a mining process but are not apt for latency are the factors affected in large networks
OR
plenty of situations. Later, data analytics disseminate while accessing huge data sets. To overcome this
to the market for managing large amounts of data problem Hadoop framework came and manage all
from various repositories [2]. Accessing those data is those critical situations easily as commodity hard-
using modern tools like Hadoop and Spark [3], and ware. It is a vertical storage data processing system
after that applying data mining algorithms for analyt- that is fast to recover the data elements even from a
TH
ics. Modified data stored in various places according large dataset or huge repository.
to user requirements. The major problem will occur
during the extraction phase due to data location and 1.2. History of HADOOP
volume of the repository over the network [4]. Fig-
ure 1 explains the main V’s used in big data and how In earlier days, distributed network node files are
AU
PY
database that was not archived quickly. So Google monitored for data flow access. This will help to
has introduced the concept of Google File System find a minimal or optimal solution for time consump-
(GFS) [14] with a file access index table for the ref- tion issues in the Hadoop framework. Nevertheless,
erence of the files in a network. Based on that index data generation and extraction have to be monitored
CO
searching method instead of web crawlers, the entire using any of the tools in a Hadoop ecosystem that
distributed network finishes the searching element will give an immense result of required data to the
task within optimal time. After that Google intro- user on time among the clusters. Researchers find
duced Map Reduce programming concept to optimize difficulty over the network optimization time of the
the searched elements from a huge database by map ETL (Extraction, Transaction, and Loading) process
() and reduce () functions. Distributed File System normally, because of the CAP (Consistency, Avail-
(DFS) was introduced to store a large volume of data ability, Partition Tolerance) theorem concepts [16]. If
OR
from the commodity hardware nodes using Hadoop. any nodes got failure, then data alterations are quickly
So it is called a Hadoop Distributed File System reflected in the cluster by Hadoop Eco-System Tools.
(HDFS). Yahoo was supported by 1000 individual So this system deals with the entire big data analytics
nodes as a cluster to distribute database parallel. But concept via various tools. Table 1 the entire Hadoop
when Hadoop came into this scenario, all nodes are Eco-System structure.
TH
Hadoop has introduced its commercial product in Hadoop Distributed File System (HDFS) consists
the name of Apache Hadoop with basic versions. of Name Node (NN) and Data Node (DD) in a sin-
Though Hadoop has supported parallel distributed gle node or multi-node cluster setup. Classic Hadoop
databases with Shared Nothing Architecture (SNA) contains Job Tracker by name node and task tracker
[15] principle, it will support some modern tools for by data node to find the flow of the data access.
doing data processing. It is called Hadoop Eco Sys- But the limitations of Hadoop made this architec-
tem which supports all data processing and analytics ture with a new concept called replication. Each end
work. Figure 2 gives a detailed history of Hadoop and every input the job has to complete and the output
its limitations. data will be stored in 3 data nodes as a replication
[17]. The Metadata of the output data has been stored
to avoid software or hardware fault during transmis-
1.3. HADOOP ecosystem sion time. If any node gets failure in a cluster the other
nodes get activated and the data has to transfer with-
Hadoop has supported many data mining algo- out delay. In later versions of Hadoop, a Secondary
rithms and methods for accessing data from a huge Name Node (SNN) was introduced to avoid the fail-
data set with the help of modern tools as a support- ure of the name node and its data has to be copied
ing system. Data collection from different resources as a FSI image. Figure 3 denotes the architecture of
and stored in a warehouse has to be controlled and HDFS and its replication principles.
5234 M.R. Sundarakumar et al. / Review of tuning performance on database scalability
Table 1
Hadoop Eco System
Name Sqoop FLUME HIVE HBASE PIG R
Functions Structured Collect the SQL query Store data in data Latin Refine data from
data logs ware-house. programming the ware- house.
Language DML JAVA JAVA JAVA Latin R
Database Model RDBMS NoSQL JSON JSON NoSQL RDBMS, No SQL
Consistency Concepts Yes Yes Yes Yes Yes Yes
Concurrency Yes Yes Yes Yes Yes Yes
Durability Yes Yes Yes Yes Yes Yes
Replication default default No No No No
Storage Method LOCAL HDFS HDFS HDFS HDFS HDFS
• Hadoop 2. X
PY
Hadoop is Master-Slave architecture by nature and
it is controlled by Name Node (NN) as a Master.
The remaining nodes which are connected to this
Name Node are called Data Nodes (DN) as a slave.
CO
If suppose NN got failure or is disconnected from the
cluster the entire system will get collapsed. In this
critical situation, the Name node has taken a photo-
copy of its data and stored it in a different node called
Secondary Name Node (SNN) over the network to
avail the CAP theorem concepts. This additional fea-
Fig. 3. HDFS Architecture. ture is available in Hadoop 2. X with the name of
OR
YARN (Yet Another Resource Negotiator). Here also
1.5. HADOOP versions replication factor is 3 but the block size is 128 MB for
input data storage [17, 19]. Below Table 2 will give
The Hadoop framework is used to provide parallel the technical differences between these versions.
distributed database access with a basic java program- • Hadoop 3-version
TH
ming paradigm [18]. It emphasizes the work done Hadoop 3. X is the latest version of the Apache
simplified by Map-reduce concepts working among Hadoop developed by Apache to overcome the prob-
the clusters. Hadoop was developed by Apache and lems of previous versions. The problem in previous
the basic version was released with several features versions is mainly lying in the number of blocks allo-
to do data processing within a short time. Initially, the cated for input data. For example, if 6 blocks are
AU
Hadoop framework was designed only for performing needed for storing input data into blocks we need
data processing tasks on a distributed database paral- 6X3 = 18 blocks for replication. So the overhead stor-
lel. The entire framework is running as a cluster-based age value is calculated using extra blocks divided by
network. original blocks and it will be multiplied by 100 which
• Hadoop 1. X gives a 200 percent result. The extra memory space
Hadoop 1. X version is a basic version that is allocation causes more cost usage problems for busi-
explained two major components Map reduce and ness people. So in Hadoop 3. x erasure coding [20,
HDFS storage. Map-reduce is a programming model 21] is used to reduce that extra memory space to 50
that reveals the input file is divided into the number percent overhead. Figure 4 and Fig. 5 will explain it.
of maps and converted into key-value pairs. Com- The above diagram describes erasure coding in the
biner parts get these maps as input and reduce them Hadoop 3. x feature. The replication of 3 nodes can be
according to the keys produced by mappers. Finally, divided and combined with two nodes using the XOR
the reduced data will be stored in HDFS storage. Per- function as parity block storage. The same 6 blocks
haps, this is a reliable storage system and redundant were taken for input file storage, instead of 18 blocks
for a distributed database. It consists of a replication only 9 blocks were allocated for storage which means
factor as 3 by default in master-slave architecture. 3 blocks for extra storage. So the overhead storage
Data nodes created 64 MB of blocks to store input is 3 divided by 6 and multiplied by 100 gives 50%
data in HDFS. only. Here the storage has to be denoted as Data Lake
M.R. Sundarakumar et al. / Review of tuning performance on database scalability 5235
Table 2
Hadoop versions differences
HADOOP 1. X HADOOP 2. X
4000 nodes per cluster 10,000 nodes per cluster
Job Tracker work is the bottleneck YARN cluster is used
One namespace Multiple in HDFS
Static maps and reducer Not restricted
Only one job to map-reduce Any applications that integrated with HADOOP
Working based on the number of tasks in a cluster Working based on cluster size
PY
version of Hadoop which improves the performance
of the data processing speed in big data analytics.
Table 3 will denote all the technical features of their
versions.
CO
1.6. Schedulers used in HADOOP
PY
Table 3
CO
Fig. 6. Hadoop 3. X YARN Architecture.
Table 4
Schedulers’ drawbacks
Type of Scheduler Pros Cons Remarks
FIFO Effective Implementation Poor data location Static Allocation
FAIR Short response time Unbalanced workload Homogeneous System
CAPACITY Unused Capacity jobs Complex implementation Homogeneous System, Non-primitive
Delay Simple Scheduling Not work in all situations Homogeneous System, Static
Matchmaking Good Data locality More response time Homogeneous System, Static
LATE Heterogeneity Lack of reliability Homogeneous System & Heterogeneity
Deadline Constraint Optimizing Timing Cost is high Homogeneous System, Heterogeneity, Dynamic
Resource Aware Cluster nodes Monitoring Extra time for monitoring Homogeneous System, Heterogeneity, Dynamic
HPCA High hit rate and redundancy Cluster change state Homogeneous System, Heterogeneity, Dynamic
Round Robin Proper work completion No priority is given Homogeneous System, Heterogeneity, Dynamic
some limitations while distributed data processing mance of Hadoop over distributed data processing
running inside the clusters [27]. Multiple factors will scenarios. Some of the points are discussed below
affect the Hadoop features and reduce the perfor- with their major parameters.
M.R. Sundarakumar et al. / Review of tuning performance on database scalability 5237
Table 5
Hadoop Features
Features Usage
Various Data Sources Multiple networks
Availability It has a replication feature which means the data in which stored in a node can replicate in three different
nodes. So there is no problem with availability issues.
Scalable A lot of nodes can be connected in a cluster as a single node and multi-node at anytime, anywhere
concept.
Cost- Effective Hadoop is an open-source framework for the usage of all companies that created a huge volume of data
dynamically.
Low Network Traffic The traffic would not be affecting the data processing task because of connectivity among cluster nodes.
High Throughput The Map-Reduce programming paradigm provides high throughput between the nodes connected in
Hadoop by its divide and conquer method job process.
Compatibility Hadoop is a framework that accepts all platforms of operating systems, programming languages, and
modern tools of the Hadoop ecosystem.
PY
Multiple Language Support Hadoop is suitable for all object-oriented programming languages like java, python, and Scala. Moreover,
it is integrated with Hadoop ecosystem tools effectively
• While accessing the small files [28] due to the of problems with parallel processing among nodes.
default block size their speed has less and the There are some bottlenecks which are affected the
allocation of memory is huge. To avoid this
Merging of small files, HAR extension files
(Hadoop Archives), and H Base tools can be
used.
CO
performance of Hadoop processing over the network.
They are [31]
• High-level data storage and network-level prob- Hadoop tuning problems in data processing are
lems are raised when we talk about security discussed below with solutions.
concerns [29] in a larger network that can be
◦ A large volume of source data can be tuned
solved using HDFS ACL for authentication
by Huge I/O input at the map stage [32] with
purposes and YARN (Yet Another Resource
AU
LZ0.LZ4 codex
Negotiation) as Application Manager.
◦ Spilled records in the Partition and Sort phases
• Batch-wise data input processing is working
are using a circular memory buffer using the
but not real-time data accessing. The tools like
formula
SPARK and FLINK is used to handle that.
Sort Size = (16 + R) * N / 1,048,576
• More lines of code (1, 20,000) [30] cannot be
R–number of Map
accessed but using SPARK and FLINK it is pos-
N –dividing the Map output records
sible.
by the number of map tasks are
• It does not support repetitive computations and
mapred.local.dir = 100MB
no delta iterations but the SPARK tool supported
◦ Network Traffic at Map and reduce side can be
all with in-memory analytics technique.
tuned by Writing small snippets to enable or
• No Caching and Abstraction features are run-
disable in the map-reduce program and default
ning in the Hadoop framework whereas SPARK.
replication factor of 1,3,5,7 nodes in the single
and multi-node cluster configuration.
2.2. Tuning Hadoop performance ◦ Insufficient Parallel Tasks [33] in idle resources
are handled by adjusting Map, Reduce Tasks
Hadoop is used to perform parallel distributed numbers and memory. There are 2 map, re-
data processing in different clusters. But it has a lot duce tasks, 1 CPU vcore and 1024MB memory
5238 M.R. Sundarakumar et al. / Review of tuning performance on database scalability
PY
with low-cost open-source. Though data warehouse first step of Map Reduce. According to the data size,
engines work effectively, the speed of data retrieval the entire file is disseminated into individual tasks
is the major problem [34] in analytics. To improve by a splitter. The input text format is changed into
the speed of the data processing in big data analyt- key-value pairs by the record reader function. Com-
ics the above-said tuning parameters of Hadoop can biner is taking care of that key matches and it will
CO
be implemented with any latest algorithms like Deep make partitions over the HDFS disk based on the
Learning, Machine learning, Artificial Intelligence, file size. The partitions are stored in the intermedi-
Genetic Algorithms, Data Mining, Data Warehouse ate data of the mapper function to give the output to
algorithms, and block-chain [35, 36] concepts. Hence the next phase. But alignment is the major problem
the huge dataset of big data is the cause for handling that leads to cause latency or throughput problems.
real-world scenarios in many companies. All their So shuffling of keys and value pairs for each partition
worry is to maintain that with low-cost server con- is running on the HDFS disk. The next important pro-
OR
figuration and consistency should be controlled on cess that happened in Map Reduce is sorting [40, 41]
time. The retrieval of data from the data warehouse based on keys from the HDFS. Using index search-
has to be improved with the Hadoop framework by ing techniques the sorted values are generated for the
high throughput is succeeded. next phase. Reducer is important in map-reduce to
optimize all the values into an appropriate format.
TH
used in the Hadoop framework that accesses a high The cluster may vary in their nodes named as a
volume of data in parallel by disseminating the whole single node or multi-node cluster have master-slave
work into individual tasks. So that the input file can architecture. The main problem of Map Reduce is
be accessed by map-reduce functions to minimize the extracting data from a huge dataset within a stipu-
size of the file coming in the output part with com- lated time but that is not achieved because of the input
pression [37]. After this process, the user or client file size of data from HDFS. The challenge in map-
will get the exact files that they expected from the reduce is to minimize or optimize the whole volume
large volume of datasets. of data into compressed format low volume data. But
the time to complete that process is very high. In other
words, latency and throughput are very low. Normal
3.1. Importance of map-reduce data extraction from the data warehouse is a little bit
slower because of the patterns and algorithms used
Map-reduce is used to access a huge dataset that for processing [42].
is stored in HDFS parallelly. Increasing the velocity
and reliability of the cluster map-reduce plays a major 3.4. Read/write operations in map reduce
role in processing. The latency and throughput of the
entire system will be increased because of the time Map Reduce is running with batch processing on
taken to complete the job. Hadoop cluster data input format which means once
M.R. Sundarakumar et al. / Review of tuning performance on database scalability 5239
PY
Fig. 7. Data Sharing in Map Reduce. number of occurrences as an output. Based on this
word count all the files are handled by batch process-
the input has to be taken another input is waiting ing and perform Map Reduce operations. Figure 8
for the completion of the previous task. This is the summarizes the word count example.
most important problem in Map Reduce and it will
CO
be accessed through iterations [43] in Map Reduce.
Because once the reading operation has taken place 4. Map reduce versions (MRV)
from HDFS it will be processed by the Map-Reduce
phases and write the output on HDFS [44]. The Map Reduce function done in Hadoop cluster by
next iteration has taken the input from these pre- job tracker and task tracker. Classic versions of Map
vious writes on the HDFS disk. Likewise, if more Reducev1 function is working with trackers. But
number of iteration processes is compiled in Map- latest version MRV2 is running with YARN archi-
OR
Reduce [45] then it will store HDFS permanently. If tecture. Because it gives the tracking feature of Map
the user requires particular data from that they have reduce job in every stage [51]. The schedulers and
to write queries using any Data Manipulation Lan- queues are used to give the job status of a given task.
guages (DML) for their results. In this scenario, more MRV1 only deals with output whereas MRV2 gives
iterative operations are not possible by Map Reduce the status of the entire job. Figures 9, 10 illustrate the
TH
because in batch processing only once the input has advantages and disadvantages of MR versions.
to take. If more iterative operations (looping) [46–48]
are running it will not apt for low latency data pro- 4.1. HADOOP map reduces performance tuning
cessing. Because every time the map-reduce model
runs repetitive functions, it may not complete the The Map Reduce performance can be accessed by
AU
task within time. Moreover, latency is also high while several factors of the Hadoop framework and its fea-
doing data processing. Figure 7 explains the read and tures. Map Reduce performance can be affected in
writes operations of the data sharing function in Map terms of speed, latency, throughput, and time taken
reduce. to complete the task. There are several other factors
that may exist during the transmission of data in the
3.5. Map reduce word count example Hadoop cluster that will affect map-reduce [52]. They
are
The best example for Map Reduce is a java based
Word Count Program in the Hadoop cluster. Initially, a. Performance
three sentences have to be taken for input and it b. Programming model & Domain
will be split into different individual tasks as input c. Configuration and automation
split. The next mapping phase takes care of indi- d. Trends
vidual tasks and converts that input split into keys e. Memory
and values which means the number of presence of
the word is calculated. Based on the alphabet cri- 4.1. Performance
teria the keys are shuffled and sorted as an output
of the mapper. Reducer collects those outputs and Initialization of Hadoop and Map Reduce will
gives them as input to the combiner for alignment of affect the performance due to the techniques used in
5240 M.R. Sundarakumar et al. / Review of tuning performance on database scalability
PY
CO
OR
Fig. 8. Word Count Example for Map Reduce.
TH
AU
PY
CO
OR
the entire data processing system. Because Hadoop 1. tracked and sent to YARN for monitoring. Finally, any
TH
x gives only the output but cannot give time to com- jobs that want to kill or delete during the processing
plete the task. But Hadoop 2. x overcomes this issue time should be controlled by YARN because of this
and tracks the status of the job throughout the task. coordination. Any data processing model contains a
At last, the latest Hadoop 3. x version describes the single input system for processing whereas here both
advanced MRV2 process for quick response over the inputs are merged together as a tagging method for
AU
network on the Hadoop cluster through its erasure easy access to the huge data sets. Figure 12 gives the
coding techniques [52, 53]. So Hadoop framework issues of performance in Map Reduce.
and Map Reduce installation is the major issue in the
performance of Map Reduce consideration. Figure 11 4.2. Programming model and domain
gives the issues of performance in Map Reduce.
Scheduling of jobs in Map-reduce is an impor- Map Reduce writing map and reduce functions
tant concept in the Hadoop cluster. Continuously jobs using good programming is essential for the users.
are assigned in Hadoop Framework by the clients; There are various programming languages supported
the order of jobs taken for Map Reduce is a typical by Hadoop for performing Map-Reduce operations.
process. So the schedulers are used to perform this Every language is based on platform dependent or
work with the help of queues. Three main schedulers independent employing their characteristics. Some
are available in Hadoop namely FIFO, Capacity, and of the languages that support the Hadoop ecosystem
FAIR [54]. Coordination of jobs between the nodes is, are SQL, NoSQL, Java, Python, Scala, and JSON
coordination between the nodes on the Hadoop clus- [55]. They have their own set of properties to per-
ter disseminates the details of all nodes to consider form operations like join and cross properties of the
as the main factor in tuning the Map-reduce func- dataset. It supports the techniques of running itera-
tion. While accessing a variety of jobs sequentially tions and incremental computations among the nodes
the resource manager. The status of the jobs will keep in Hadoop for accessing distributed databases paral-
5242 M.R. Sundarakumar et al. / Review of tuning performance on database scalability
PY
CO
OR
Fig. 11. Performance issue 1.
4.4. Trends
lelly. Perhaps, many iteration operations will affect
Map Reduce performance. Figures 13 and Fig. 14 Data warehouse data are accessed by the database
denotes issues of programming models. engine on Map-reduce. But the data size is very large,
and extraction of small data from that engine made
4.3. Configuration and automation it difficult. The time taken to complete the process is
very high. But instead of disk processing, it should be
Self-tuning of the workload between the nodes done by memory processing directly will improve the
can be balanced by a load balancer on Hadoop and MR performance by I/O disks. Indexing [57] is the
the data flow sharing among the nodes is controlled traditional database technique that is used to search
M.R. Sundarakumar et al. / Review of tuning performance on database scalability 5243
PY
Fig. 13. Programming Model issue 1.
CO
OR
TH
AU
the elements in the database or files run in nodes. It to any issues the next job or node will get active and
gives the extracted data to the user very fast. It might start the process over the network without waiting for
not depend on the size of a file, in each file the same manual intervention. The materials required for the
techniques have been used. Memory caching [58] MR process can be verified initially before the start
between the nodes is very important to improve the of the job allocation by the resource manager.
performance in MR. It describes the status of every
job condition and the previous computation level also. 4.5. Memory
Caching helps to identify the location of the data on
the node specifically by its memory allocated by the Map Reduce function fully depends on the num-
jobs. Even though the nodes or jobs are canceled due ber of maps and reducers used for every task in the
5244 M.R. Sundarakumar et al. / Review of tuning performance on database scalability
Table 6
Map Reduce Implementations
Map Reduce Implement Methods Advantages Disadvantages
Google Map Reduce multiple data blocks on different nodes Batch processing-based architecture is not
to avoid fault tolerance problem suitable for real-time applications
Hadoop High scalability Cluster maintenance is difficult.
Grid Grain Subtask distribution and load balancing Does not support non-java applications
Mars Massive Thread Parallelism in GPU Not for atomic operations due to expensive
Tiled-Map Reduce Convergence and Generalization Cost is high
Phoenix Multicore CPU Scalability is less
Twister Tools are used effectively Not possible to break huge data set
Hadoop cluster. If it will get increase immediately the be removed. For example, in the word count
PY
performance of the system goes very slow in terms Map Reduce program written by java only case
of time taken to complete the task. the sensor output is required means making
–DwordCount.case.sensitive = true/ false com-
• Calculation of number of maps mand during the run time will give better
The number of maps assigned for every job by performance than the previous one [59]. Because
CO
a client is too calculated by the size of the input the bad records can be eliminated using these
file [59] and allocated blocks for accessing those commands.
data. The following formula denotes the number • Task execution & environment
of maps required for performing Map Reduce The task tracker in data nodes keeps track
operations. of all information about the jobs and is sent
Number of Maps = Total size of the input to YARN Resource Manager consequently.
OR
But there is a limitation over these oper-
file/Total number of blocks ations in terms of memory allocation in a
(1) map and reduction for task execution. The
By default, minimum of 10 – 100 maps per command –Djava.library.path=<-Xm512M/-
node is assigned for the job. A maximum of 300 Xm1024M executes Map Reduce environment
TH
maps can be allocated to do Map Reduce job. [60] within that memory limit successfully.
For example, 10TB of input file size and 128MB The following Table 6 & provides details of
block size are allocated by Hadoop 2. x means Map Reduce Implementation methods and their
10TB/12b MB = 82,000 maps are approximately applications.
assigned for completing that job.
AU
Table 7
Map Reduce Implementations
Map Reduce Applications Pros Cons
Distributed Grep Data analysis is generic Less response time
Word Count Massive document collection of occurrences Limited only
Tera Sort Load balancing transparency
Inverted Index Collection of unique posting list Lots of pairs in shuffling & sorting
Term Vector Host analysis search Sequential tasks
Random Forest Scalability is high Low
Extreme Learning Machine union and simplification Uncertainty
Spark Data fit in memory Huge memory needed
Algorithms Data exhaustive applications Time uncontrollable
DNA Fragment Parallel Algorithm Large memory
Mobile sensor data Extracting data is easy Difficult to implement
Social Networks Quick response Need more techniques for analysis
PY
how data can deviate from the flow during run time. • Work sharing
Because these factors are rectified means even a big Map Reduce is specially designed for handling
job running on the Hadoop cluster will give output multiple jobs parallelly. If multiple jobs are running
CO
with low latency. Below Fig. 15 listed the factors for simultaneously, it is recommended to share those jobs
job optimizations. by individual maps [65] in the function. That work
• Operator pipelining was done by a splitter in the map-reduce function. The
It is mainly used in Map reduce concept for aggre- time taken to complete the job is decreased because
gation of databases to utilize the filter data and of this sharing job process.
perform operations like grouping, sorting, and con- • Data reuse
verting [62, 63] output from one form to another form Data that is used for the Map Reduce function
OR
of operators. Pipelining is used to connect two jobs from the HDFS storage can be reused for next-level
simultaneously to complete the job within time. But changes in the same input file. Reusability [66] in the
the issue is extended database lock or tie when read- form of inheritance and will reduce the number of
ing/writing in response to the user request. So the lines of codes in a program.
iterate operations are used at that particular time to • Skew mitigation
TH
improve their performance during pipeline events. Skew Mitigation is the main issue in Map reduce,
• Approximate results solved by different techniques to avoid data trans-
The result of the map-reduce is approximate in mission. Using skew-resilient operators, classical
terms of size, time, and accuracy. Even though the skew-mitigation problems were solved. By reparti-
performance has to be increased during the running tioning the concept, skew mitigation can be handled
AU
time it cannot be predictable by its output. Any files in a big data environment using three major methods.
can be taken as an input format it will provide an out- Minimizing the number of times of repartition to any
put of map reduced function. The output cannot be task can reduce repartitioning overhead. Then min-
accurate or reliable in such cases. imizing repartitioning side effects can be removed
• Indexing and sorting during the struggling time to remove mitigation ambi-
Since Map-Reduce works with key-value pairs, it guity. At last, unnecessary recompilations are used to
is very complicated to align the order of the jobs by minimize the total transparency of skew mitigation
a resource manager. It allocates the task to the data [27, 61, 67].
node which may cause conflicts rapidly [57, 64]. So • Data colocation
indexing techniques are used in this job execution by Same location files will be collocated on the similar
searching the elements based on the index key values locate of nodes is a new concept based on the locator
stored in the index table. The table contains all key of file attribute in the file characteristics. When the
values of the independent task in the mapper task and new file is creating its location, the list of data nodes
will give exact data to the combiner to perform the and the number of files in the same case can be iden-
merging option. But the issue is that merging also tified and stored all those input files in the same set
it is complicated by arranging values in any order. of nodes automatically [17, 62, 68]. It will improve
So sorting is a function used in between these and the map-reduce performance by avoiding duplication
performs reducer value output effectively. and repetitions [69, 70] of files in a Hadoop cluster.
5246 M.R. Sundarakumar et al. / Review of tuning performance on database scalability
PY
CO
Fig. 15. Data colocation.
Figure 15 describes the example of data colocation time. So obviously it is working faster than
OR
in the Hadoop cluster. Hadoop. Approximately 100 times better than
Map Reduce on Hadoop due to memory.
• Stream Processing: It supports stream pro-
6. Map reduction using java and python cessing which means input and output data
are continuously accessed. It is mainly used to
TH
Map Reduce function can be written in java or any access real-time application data processing.
other higher languages, the performance should be • Latency: Resilient Distributed Dataset (RDD) is
changed according to the features of selected lan- used to catch the data using memory in between
guages. Table 8 narrates the differences between java the nodes on the cluster. RDD manages logi-
and python coding languages when map reduce can cal partitions for distributed data processing and
AU
• In-memory Processing: This technique is used II. Real world scenarios of SPARK
to capture moving data or processes inside Many companies created terabytes of data through
and outside of the disk without spending more human and machine generation applications. Apache
M.R. Sundarakumar et al. / Review of tuning performance on database scalability 5247
Table 8
Map Reduce written in java and python differences
Features Java Python
File size Handling <1 GB is easy >1 GB is easy
Library Files All in JAR format Separate library files
File Extension .java .py
Method of calling Main No main method
Data collection Arrays, Index List, set, dictionary, tuples
Object oriented Required Optional
Case Sensitive Required Optional
Compilation Easy in all platform Easy in Linux
Productivity Less More
Applications Desktop, Mobile, Web Analytics, Mathematical,Calculations
Type of files Batch processing, embedded application Real time processing files also
Functions Return 0 & 1 is used Dict is used for return
PY
Programming concepts Dynamic less Cannot push threads of single processor to another
Syntax Specific types Simple only
Basic programming C,c++ basics(oops) Higher end concepts like ML
Number of codes High Less code size
Input data format Streaming with STDIN,STDOUT by binary not text Both binary and text
Areas Working Architecture, tester, developer, administrator Analytics, manipulation, retrieval, visual reports,
CO
AI, Neural Networks
Speed 25 times greater than python Low due to interpreter
Execution Time High because of code length Easy
Typing Dynamic Static
Verbose Syntax Low Normal
Frameworks Spring, Blade Django, Flask
Gaming Jmonkey Engine Pandas3D,cocos
Ml Libraries Weka, Mallet Tensorflow, pytorch
OR
TH
AU
Spark is used to improve the company’s business consistency of data at each second so that the
insights [72]. Few examples of companies using customer relationship is very strong on their
SPARK in real-world applications. feedback.
B. Alibaba: Analyze big data, and extrac-
• E-commerce: To improve consumer satisfac- tion of image data can be handled by Alibaba
tion over competitive problems, a few industries Company using SPARK as an implementa-
are implementing SPARK to handle this situa- tion tool. They are used on a large graph, for
tion. They are: deriving results.
A. eBay: Discounts and or offers for online • Healthcare: MyFitnessPal, which is used to
purchases and any other purchase transaction improve a healthier lifestyle through diet and
SPARK can be developed using real-time to scan through the food calorie data of about
data. It will provide the updating status and 100 million users to find the quality of the
5248 M.R. Sundarakumar et al. / Review of tuning performance on database scalability
PY
Fig. 17. Working of Spark.
food system using SPARK in-memory process- takes replication of every job output data in the HDFS
ing techniques.
• Media and Entertainment: Netflix, for video
streaming uses Apache Spark to control and
monitored its users compared with the earlier
CO
cluster disks.
Spark is a distributed cluster framework for pro-
cessing data on the memory of the nodes by its process
engine. In-memory analytics data processing is used
shows that they have watched. in SPARK, so the output of each step is stored in
between the node memories for clients. For this, it
OR
III. HADOOP AND SPARK SIMILARITIES consumes a lot of memory for storage. One big advan-
tage of SPARK is to access real-time applications
• Stand-alone Mesos and Cloud are the places
frequently. Although it is used for online generated
where Spark can run on Hadoop.
data processing, streaming is mainly used. There is
• Machine Learning algorithms can be executed
plenty of data generated online with every second.
faster inside the memory using Spark’s MLlib
TH
PY
CO
Fig. 18. Working difference between SPARK and Hadoop.
Fig. 19. Architecture of SPARK. and R3 are the partitions that collect the output of the
mapper and stored it accordingly. During the shuffle
section, c1 is a core that is used to denote mapper 1,
cluster manager, the executor module in the worker
and c2, c3, and c4 denote other mappers. So if shuf-
node access the input data from HDFS and immedi-
fling will happen in SPARK, the mapper output of
AU
PY
CO
OR
Table 9
TH
Table 10
Experimental results of multi-node cluster
Parameters Hadoop Records SPARK Records FLINK Records
Data Size 102.5TB 100TB >100TB
Elapsed Time 72mins 23 min >23mins
Nodes 2100 206 190
Cores 50400 physical 6592 virtualized 6080 virtualized
Throughput in cluster 3150GB/sec 618 GB/sec 570 GB/sec
Network 10Gbps EC2 >10Gbps
Sort Rate 1.42TB/min 4.27 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
I. Authors were taken various techniques from question raises in the software industry that the real
PY
many research papers on the topic of tuning the world scenario problems have been solved only by
performance of the databases while scalability is big industries or those who are ready to invest more
increased, and all papers are discussed about the data money is the only possibility. But there are other
extraction techniques from the huge repositories with factors also considered in the same scene taken by
low latency and high accuracy over large networks. different industries. The main problem is data-driven
CO
II. Authors were written this review paper about from the large datasets with fewer resources is a chal-
Hadoop versions and their features to extract the lenging one. This paper deals with all the points to
data from the repositories and also SPARK tool fea- improve the data processing velocity of big data ana-
tures with the latest techniques. A detailed review has lytics by the famous framework Hadoop vs. SPARK.
been written in this paper while selecting the tool for Henceforth, the data generated day by day in the
extractions with their advantages and disadvantages. real-world can handle different latest algorithms for
III. Authors have suggested ways to improve the analytics, and processing from the huge volume is
OR
performance of the databases extraction from the being possible with tuning the already existing meth-
repositories. Moreover, the difficulties faced in previ- ods or trends. There must be proper analysis and
ous methods. Though modern tools are used for data research problems finding capacity that should be
extraction writing a map-reduce program in Hadoop needed to implement all the innovative solutions for
with a recent algorithm is a challenging task. SPARK real-world problems. Finally, the user wants to find
TH
is an advanced tool but the cost spend for used that a solution for their problem with big data analytics
tool is unimaginable for small-scale companies. Here Hadoop and SPARK are the main frameworks to pro-
authors were given suggestions to improve the per- vide solutions but according to the user requirement,
formance in both tools. they have to choose the best one. For example, the
client wants to start a company which has low invest-
AU
Table 11
Hadoop Vs SPARK
Features Hadoop SPARK
File Processing Method Batch processing Batch/Real Time/iterative/graph Processing
Programming Language Java, Python Scala
Data Storage type Scale-out Data Lake or Pool
Programming Model Map reduce In Memory processing
Job Scheduler Externally Not required
Cost Low High
RAM Usage Less Lot of RAMs
Memory Type Single memory Execution & Storage memory Separately
Data Size Up to GB is fine PB is fine
Latency High Low latency
Data taken as input Text, images, videos RDD(Resilient Distributed Dataset)
Disk Type HDD (Hard Disk) SDD (Solid Disk)
PY
N/w Performance Low High
Speed rate <3x <3x. 1/10 nodes
Algorithm by default Divide and conquer ALS (Alternate Least Square)
Data Location details Index Table Abstraction using Mlib
Data Hiding Low High using function calls
Dataset size Small set Huge set > TB
CO
Shuffle speed Low High
Storage memory of mapper Directly in Disk RAM to Disk
output
Containers Usage Releases after every map Release only after the entire job completion
Dynamic Allocation Not possible Possible but hectic
Replications 1,3,5 nodes Pipelines
Delay High due to assign JVM for each task Low due to quick launch
Mechanism for message passing Parsing and JAR files Remote Procedure Call (RPC)
OR
Time Taken to complete job Minutes because of small data set Hours for big data set.
Allocating Memory Erasure Coding DAG (Directed Acyclic Graph)
Data Input method Hadoop Streaming SPARK Streaming
Data conversion formats Text to binary All forms
Job Memory Large Low
Input Memory Less High
Processing type Parallel and distributed Parallel and distributed
TH
and SPARK are the tools used in very high-speed FLINK, and Kafka [80] are also available for access-
data processing by various factors. How long have ing both batch and real-time processing in big data
these tools ruled the world with their updated versions analytics. Only the techniques are varied in all tools.
and techniques? New tools of Apache like FLUME, The new FLUME tools are used to collect various
M.R. Sundarakumar et al. / Review of tuning performance on database scalability 5253
logs and events from different resources and stored in Proceedings of the 2018 International Conference on Big
HDFS with high throughput and low latency. Apache Data and Education (2018, March), (pp. 52–56).
[12] S. Kaisler, F. Armour, J.A. Espinosa and W. Money, Big
FLINK is used to access the huge datasets by the data: Issues and challenges moving forward. In 2013 46th
micro-batch method which runs the data in a single Hawaii International Conference on System Sciences (2013,
run time with closed-loop operations. So the time to January), (pp. 995–1004). IEEE.
complete tasks is very low and identifying the cor- [13] S. Kaisler, F. Armour, W. Money and J.A. Espinosa, Big data
issues and challenges. In Encyclopedia of Information Sci-
rupted data part is also easy. Another tool Apache ence and Technology, Third Edition (2015), (pp. 363–370).
Kafka is a modern tool used to handle feed with IGI Global.
high throughput and low latency in social media. [14] D. Che, M. Safran and Z. Peng, From big data to big
Finally, plenty of tools are used in big data analyt- data mining: challenges, issues, and opportunities. In Inter-
national Conference on Database Systems for Advanced
ics for handling a huge volume of data sets with Applications (2013, April), (pp. 1-15). Springer, Berlin,
different mechanisms and approaches. User has to Heidelberg.
PY
take a decision very carefully in accessing and pro- [15] A. O’Driscoll, J. Daugelaite and R.D. Sleator, ‘Big data’,
Hadoop and cloud computing in genomics, Journal of
tecting their data with big data analytics world. This
Biomedical Informatics 46(5) (2013), 774–781.
paper has covered the challenges and limitations of [16] Y. Demchenko, C. Ngo and P. Membrey, Architecture
big data analytic tools in all aspects and provides solu- framework and components for the big data ecosystem,
tions to handle those problems in a systematic way Journal of System and Network Engineering 4(7) (2013),
1–31.
CO
of approach. [17] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I.Z. Khalil and
A. Bouras, A survey of clustering algorithms for big data:
Taxonomy and empirical analysis, IEEE Transactions on
Emerging Topics in Computing 2(3) (2014), 267–279.
[18] Y. Arfat, S. Usman, R. Mehmood and I. Katib, Big Data
References for Smart Infrastructure Design: Opportunities and Chal-
lenges. In Smart Infrastructure and Applications (2020),
(pp. 491–518). Springer, Cham.
OR
[1] A. Katal, M. Wazid, and R.H. Goudar, Big data: issues,
challenges, tools and good practices. In 2013 Sixth interna- [19] S.U. Ahsaan, H. Kaur and S. Naaz, An Empirical Study
tional conference on contemporary computing (IC3) (2013, of Big Data: Opportunities, Challenges and Technologies.
August). (pp. 404–409). IEEE.. In New Paradigm in Decision Science and Management
[2] N. Khan, I. Yaqoob, I.A.T. Hashem, Z. Inayat, M. Ali, W. (2020), (pp. 49–65). Springer, Singapore.
Kamaleldin,... and A. Gani, Big data: survey, technologies, [20] A. Mohamed, M.K. Najafabadi, Y.B. Wah, E.A.K. Zaman
opportunities, and challenges, The Scientific World Journal and R. Maskat, The state of the art and taxonomy of big
data analytics: view from new big data framework, Artificial
TH
(2014), 2014.
[3] N. Elgendy and A. Elragal, Big data analytics: a litera- Intelligence Review 53(2) (2020), 989–1037.
ture review paper. In Industrial Conference on Data Mining [21] Y. Arfat, S. Usman, R. Mehmood and I. Katib, Big Data
(2014, July), (pp. 214–227). Springer, Cham. Tools, Technologies, and Applications: A Survey. In Smart
[4] C.W. Tsai, C.F. Lai, H.C. Chao and A.V. Vasilakos, Big data Infrastructure and Applications (2020), (pp. 453–490).
analytics: a survey, Journal of Big data 2(1) (2015), 21. Springer, Cham.
[5] J.F. Weets, M.K. Kakhani and A. Kumar, Limitations and [22] P.L.S. Kumari, Big Data: Challenges and Solutions. In Secu-
AU
challenges of HDFS and MapReduce. In 2015 International rity, Privacy, and Forensics Issues in Big Data (2020), (pp.
Conference on Green Computing and Internet of Things 24–65). IGI Global.
(ICGCIoT) (2015, October), (pp. 545–549). IEEE. [23] A. Jaiswal, V.K. Dwivedi and O.P. Yadav, Big Data and its
[6] W. Yu, Y. Wang, X. Que and C. Xu, Virtual shuffling for Analyzing Tools: A Perspective. In 2020 6th International
efficient data movement in mapreduce, IEEE Transactions Conference on Advanced Computing and Communication
on Computers 64(2) (2013), 556–568. Systems (ICACCS) (2020, March), (pp. 560–565). IEEE.
[7] D.P. Acharjya and K. Ahmed, A survey on big data analytics: [24] A. Sharma, G. Singh and S. Rehman, A Review of Big
challenges, open research issues and tools, International Data Challenges and Preserving Privacy in Big Data. In
Journal of Advanced Computer Science and Applications Advances in Data and Information Sciences (2020), (pp.
7(2) (2016), 511–518. 57–65). Springer, Singapore.
[8] S. Yu, Big privacy: Challenges and opportunities of pri- [25] S. Riaz, M.U. Ashraf and A. Siddiq, A Comparative Study
vacy study in the age of big data, IEEE Access bf 4 (2016), of Big Data Tools and Deployment PIatforms. In 2020
2751–2763. International Conference on Engineering and Emerging
[9] M.A. Wani and S. Jabin, Big data: issues, challenges, and Technologies (ICEET) (2020, February), (pp. 1–6). IEEE.
techniques in business intelligence. In Big data analytics [26] N.K. Gupta and M.K. Rohil, Big Data Security Chal-
(2018), (pp. 613–628). Springer, Singapore. lenges and Preventive Solutions. In Data Management,
[10] A. Oussous, F.Z. Benjelloun, A.A. Lahcen and S. Belfkih, Analytics and Innovation (2020), (pp. 285–299). Springer,
Big Data technologies: A survey, Journal of King Saud Singapore.
University-Computer and Information Sciences 30(4) [27] D.K. Tayal and K. Meena, A new MapReduce solution for
(2018), 431–448. associative classification to handle scalability and skew-
[11] N. Khan, M. Alsaqer, H. Shah, G. Badsha, A.A. Abbasi and ness in vertical data structure, Future Generation Computer
S. Salehian, The 10 Vs, issues and challenges of big data. In Systems 103 (2020), 44–57.
5254 M.R. Sundarakumar et al. / Review of tuning performance on database scalability
[28] P. Abimbola, A. Sanga and S. Mongia, Hadoop Framework ing and virtual machine migration, IEEE Access 7 (2019),
Ecosystem: Ant Solution to an Elephantic Data. (2019), 92259–92284.
Available at SSRN 3463635. [47] Q. Chen, J. Yao, B. Li and Z. Xiao, PISCES: Optimiz-
[29] R. Kashyap, Big Data Analytics Challenges and Solutions. ing Multi-Job Application Execution in MapReduce, IEEE
In Big Data Analytics for Intelligent Healthcare Manage- Transactions on Cloud Computing 7(1) (2016), 273–286.
ment (2019), (pp. 19–41). Academic Press. [48] R.H. Hariri, E.M. Fredericks and K.M. Bowers, Uncertainty
[30] S.U. Ahsaan, H. Kaur and S. Naaz, An Empirical Study in big data analytics: survey, opportunities, and challenges,
of Big Data: Opportunities, Challenges and Technologies. Journal of Big Data 6(1) (2019), 44.
In New Paradigm in Decision Science and Management [49] G. Yu, X. Wang, K. Yu, W. Ni, J.A. Zhang and R.P. Liu,
(2020), (pp. 49–65). Springer, Singapore. Survey: Sharding in blockchains, IEEE Access 8 (2020),
[31] P.L.S.K. Kaur and V. Bharti, A Survey on Big Data—Its 14155–14181.
Challenges and Solution from Vendors. In Big Data Pro- [50] J. Luengo, D. Garcı́a-Gil, S. Ramı́rez-Gallego, S. Garcı́a and
cessing Using Spark in Cloud (2019), (pp. 1–22). Springer, F. Herrera, Dimensionality Reduction for Big Data. In Big
Singapore. Data Preprocessing (2020), (pp. 53–79). Springer, Cham.
[32] P.L.S. Kumari, Big Data: Challenges and Solutions. In Secu- [51] J. Luengo, D. Garcı́a-Gil, S. Ramı́rez-Gallego, S. Garcı́a and
PY
rity, Privacy, and Forensics Issues in Big Data (2020), (pp. F. Herrera, Imbalanced Data Preprocessing for Big Data. In
24–65). IGI Global. Big Data Preprocessing (2020), (pp. 147–160). Springer,
[33] M.A. Wani and S. Jabin, Big data: issues, challenges, and Cham.
techniques in business intelligence. In Big data analytics [52] A. Chugh, V.K. Sharma and C. Jain, Big Data and Query
(2018), (pp. 613–628). Springer, Singapore. Optimization Techniques. In Advances in Computing and
[34] I. Anagnostopoulos, S. Zeadally and E. Exposito, Handling Intelligent Systems (2020), (pp. 337–345). Springer, Singa-
big data: research challenges and future directions, The pore.
CO
Journal of Supercomputing 72(4) (2016), 1494–1516. [53] S. Vengadeswaran and S.R. Balasundaram, CLUST: Group-
[35] G. Kapil, A. Agrawal and R.A. Khan, Big Data ing Aware Data Placement for Improving the Performance
Security challenges: Hadoop Perspective, International of Large-Scale Data Management System. In Proceedings
Journal of Pure and Applied Mathematics 120(6) (2018), of the 7th ACM IKDD CoDS and 25th COMAD (2020), (pp.
11767–11784. 1–9).
[36] M. Li, Z. Liu, X. Shi and H. Jin, ATCS: Auto-Tuning Con- [54] M. Naisuty, A.N. Hidayanto, N.C. Harahap, A. Rosyiq, A.
figurations of Big Data Frameworks Based on Generative Suhanto and G.M.S. Hartono, Data protection on Hadoop
Adversarial Nets, IEEE Access 8 (2020), 50485–50496. distributed file system by using encryption algorithms: a
OR
[37] E. Mohamed and Z. Hong, Hadoop-MapReduce job systematic literature review. In Journal of Physics: Confer-
scheduling algorithms survey. In 2016 7th International ence Series (2020, January). (Vol. 1444, No. 1, p. 012012).
Conference on Cloud Computing and Big Data (CCBD) IOP Publishing
(2016, November), (pp. 237–242). IEEE. [55] M.H. Mohamed, M.H. Khafagy and M.H. Ibrahim, Recom-
[38] J. Wang, X. Zhang, J. Yin, R. Wang, H. Wu and D. Han, mender Systems Challenges and Solutions Survey. In 2019
Speed up big data analytics by unveiling the storage distri- International Conference on Innovative Trends in Computer
bution of sub-datasets, IEEE Transactions on Big Data 4(2) Engineering (ITCE) (2019, February), (pp. 149–155). IEEE.
TH
(2016), 231–244. [56] I.A.T. Hashem, N.B. Anuar, A. Gani, I. Yaqoob, F. Xia
[39] S.M. Nabavinejad, M. Goudarzi and S. Mozaffari, The and S.U. Khan, MapReduce: Review and open challenges,
memory challenge in reduce phase of MapReduce appli- Scientometrics 109(1) (2016), 389–422.
cations, IEEE Transactions on Big Data 2(4) (2016), [57] N.M. Elzein, M.A. Majid, I.A.T. Hashem, I. Yaqoob, F.A.
380–386. Alaba and M. Imran, Managing big RDF data in clouds:
[40] U. Sivarajah, M.M. Kamal, Z. Irani and V. Weerakkody, Challenges, opportunities, and solutions, Sustainable Cities
AU
Critical analysis of Big Data challenges and analytical meth- and Society 39 (2018), 375–386.
ods, Journal of Business Research 70 (2017), 263–286. [58] S. Pouyanfar, Y. Yang, S.C. Chen, M.L. Shyu and S.S.
[41] S. Dolev, P. Florissi, E. Gudes, S. Sharma and I. Singer, Iyengar, Multimedia big data analytics: A survey, ACM
A survey on geographically distributed big-data process- Computing Surveys (CSUR) 51(1) (2018), 1–34.
ing using MapReduce, IEEE Transactions on Big Data 5(1) [59] M.S. Al-kahtani and L. Karim, Designing an Efficient
(2017), 60–80. Distributed Algorithm for Big Data Analytics: Issues and
[42] Y. Guo, J. Rao, D. Cheng and X. Zhou, ishuffle: Improving Challenges, International Journal of Computer Science and
Hadoop performance with shuffle-on-write, IEEE Trans- Information Security (IJCSIS) 15(11) (2017).
actions on Parallel and Distributed Systems 28(6) (2016), [60] Z. Lv, H. Song, P. Basanta-Val, A. Steed and M. Jo, Next-
1649–1662. generation big data analytics: State of the art, challenges,
[43] K. Wang, Q. Zhou, S. Guo and J. Luo, Cluster frameworks and future research topics, IEEE Transactions on Industrial
for efficient scheduling and resource allocation in data cen- Informatics 13(4) (2017), 1891–1899.
ter networks: A survey, IEEE Communications Surveys & [61] P. Basanta-Val and M. Garcı́a-Valls, A distributed real-time
Tutorials 20(4) (2018), 3560–3580. java-centric architecture for industrial systems, IEEE Trans-
[44] S.S.R.P. Time, Cluster Frameworks for Efficient Schedul- actions on Industrial Informatics 10(1) (2013), 27–34.
ing and Resource Allocation in Data Center Networks: A [62] P. Basanta-Val, N.C. Audsley, A.J. Wellings, I. Gray and N.
Survey. Fernández-Garcı́a, Architecting time-critical big-data sys-
[45] M. Hajeer and D. Dasgupta, Handling big data using a data- tems, IEEE Transactions on Big Data 2(4) (2016), 310–324.
aware HDFS and evolutionary clustering technique, IEEE [63] Q. Liu, W. Cai, J. Shen, X. Liu and N. Linge, An adap-
Transactions on Big Data 5(2) (2017), 134–147. tive approach to better load balancing in a consumer-centric
[46] N.S. Dey and T. Gunasekhar, A comprehensive survey cloud environment, IEEE Transactions on Consumer Elec-
of load balancing strategies using Hadoop queue schedul- tronics 62(3) (2016), 243–250.
M.R. Sundarakumar et al. / Review of tuning performance on database scalability 5255
[64] A. Montazerolghaem, M.H. Yaghmaee, A. Leon-Garcia, [74] M.P. Kumar and S. Pattern, Security Issues in Hadoop Asso-
M. Naghibzadeh and F. Tashtarian, A load-balanced call ciated With Big Data.
admission controller for IMS cloud computing, IEEE Trans- [75] W. Inoubli, S. Aridhi, H. Mezni, M. Maddouri and E.
actions on Network and Service Management 13(4) (2016), Nguifo, (2018, August). A comparative study on streaming
806–822. frameworks for big data.
[65] J. Zhao, K. Yang, X. Wei, Y. Ding, L. Hu and G. Xu, [76] Z. Yu, Z. Bei and X. Qian, Data size-aware high dimensional
A heuristic clustering-based task deployment approach for configurations are auto-tuning of in-memory cluster com-
load balancing using Bayes theorem in a cloud environment, puting. In Proceedings of the Twenty-Third International
IEEE Transactions on Parallel and Distributed Systems Conference on Architectural Support for Programming
27(2) (2015), 305–316. Languages and Operating Systems (2018, March), (pp.
[66] A.K. Singh and J. Kumar, Secure and energy-aware load 564–577).
balancing framework for cloud data center networks, Elec- [77] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M.
tronics Letters 55(9) (2019), 540–541. McCauly,... and I. Stoica, Resilient distributed datasets: A
[67] D. Shen, J. Luo, F. Dong and J. Zhang, Virto: joint coflow fault-tolerant abstraction for in-memory cluster computing.
scheduling and virtual machine placement in cloud data In Presented as part of the 9th USENIX Symposium on
PY
centers, Tsinghua Science and Technology 24(5) (2019), Networked Systems Design and Implementation (NSDI 12)
630–644. (2012), (pp. 15–28).
[68] M. Bhattacharya, R. Islam and J. Abawajy, Evolutionary [78] M. Li, J. Tan, Y. Wang, L. Zhang and V. Salapura, Spark-
optimization: a big data perspective, Journal of Network bench: a comprehensive benchmarking suite for in-memory
and Computer Applications 59 (2016), 416–426. data analytic platform spark. In Proceedings of the 12th
[69] Q. Chen, C. Liu and Z. Xiao, Improving MapReduce perfor- ACM International Conference on Computing Frontiers
mance using a smart speculative execution strategy, IEEE (2015, May), (pp. 1–8).
CO
Transactions on Computers 63(4) (2013), 954–967. [79] D. Agrawal, A. Butt, K. Doshi, J.L. Larriba-Pey, M. Li, F.R.
[70] H. Wang, Z. Xu, H. Fujita and S. Liu, Towards felicitous Reiss and Y. Xia, Spark Bench–a spark performance testing
decision making: An overview on challenges and trends of suite. In Technology Conference on Performance Evaluation
Big Data, Information Sciences 367 (2016), 747–765. and Benchmarking (2015, August), (pp. 26–44). Springer,
[71] A.G. Shoro and T.R. Soomro, Big data analysis: Apache Cham.
spark perspective, Global Journal of Computer Science and [80] A. Jaiswal, V.K. Dwivedi and O.P. Yadav, Big Data and its
Technology (2015). Analyzing Tools: A Perspective. In 2020 6th International
[72] M. Zaharia, R.S. Xin, P. Wendell, T. Das, M. Armbrust, A. Conference on Advanced Computing and Communication
OR
Dave,... and A. Ghodsi, Apache spark: a unified engine for Systems (ICACCS) (2020, March), (pp. 560–565). IEEE.
big data processing, Communications of the ACM 59(11)
(2016), 56–65.
[73] S. Salloum, R. Dautov, X. Chen, P.X. Peng and J.Z. Huang,
Big data analytics on Apache Spark, International Journal
of Data Science and Analytics 1(3–4) (2016), 145–164.
TH
AU