V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
RESEARCH ARTICLE
www.ijera.com
OPEN ACCESS
A Close-Up View About Spark in Big Data Jurisdiction
1
FirojParwej*,2NikhatAkhtar**,3Dr. Yusuf Perwej***
1*
(Research Scholar-Ph.D. (Computer Science & Engineering),Department of Computer Science & Engineering
Singhania University, Pacheri Bari, Jhunjhunu, Rajasthan, India
2**
(Research Scholar-Ph.D (Computer Science & Engineering), Department of Computer Science &
Engineering,
BabuBanarasi Das University, Lucknow, India
***3
(Ph.D (Computer Science & Engineering), M.Tech , Assistant Professor, Department of Information
Technology,
Al Baha University, Al Baha, Kingdom of Saudi Arabia(KSA),
Corresponding Author:FirojParwej
ABSTRACT
The Big data is the name used ubiquitously now a day in distributed paradigm on the web. As the name point out
it is the collection of sets of very large amounts of data in pet bytes, Exabyte etc. related systems as well as the
algorithms used to analyze this enormous data. Hadoop technology as a big data processing technology has
proven to be the go to solution for processing enormous data sets. MapReduce is a conspicuous solution for
computations, which requirement one-pass to complete, but not exact efficient for use cases that need multi-pass
for computations and algorithms. The Job output data between every stage has to be stored in the file system
before the next stage can begin. Consequently, this method is slow, disk Input/output operations and due to
replication. Additionally, Hadoop ecosystem doesn’t have every component to ending a big data use case.
Suppose we want to do an iterative job, you would have to stitch together a sequence of MapReduce jobs and
execute them in sequence. Every this job has high-latency, and each depends upon the completion of the
previous stage. Apache Spark is one of the most widely used open source processing engines for big data, with
wealthy language-integrated APIs and an extensive range of libraries. Apache Spark is a usual framework for
distributed computing that offers high performance for both batch and interactive processing. In this paper, we
aimed to demonstrate a close-up view about Apache Spark and its features and working with Spark using
Hadoop. We are in a nutshell discussing about the Resilient Distributed Datasets (RDD), RDD operations,
features, and limitation. Spark can be used along with MapReduce in the same Hadoop cluster or can be used
lonely as a processing framework. In the last comparative analysis between Spark and Hadoop and MapReduce
in this paper.
Keywords: Big Data, Spark, Resilient Distributed Datasets (RDD), MapReduce, Hadoop, Spark Ecosystem.
I. INTRODUCTION
We are live in the information era, where
almost everything is data. Day-to-day the big world
of internet is creating 2.6 quintillion bytes of data on
a regular basis according to the statistics the
percentage of data that has been generated from last
two years is 90%. This data comes from many
industries like climate information [1] collects by the
sensor, Internet of Things (IoT) applications, and
various stuff from digital images, social media sites,
and videos, various records of the buying transaction.
This data is called big data. Big data gets generated in
multi-terabyte quantities [2]. It transformation fast
and comes in multiformityof forms that are arduous
to manage and process using RDBMS or other
traditional technologies. Today scenario, 85% of the
data getting generated is unstructured and cannot be
maintainedby our traditional technologies. Before an
amount of data generated was not that
frenetic.Presently data generation is in petabytes that
www.ijera.com
it is not possible to archive the data again and again
and retrieve it again when demand as data,
scientistsrequirement to play with data now and then
for predictive analysis distinct historical as used to be
done with traditional. In this scenario big data
solutions provide the tools, methodologies, and
technologies that are used to capture, processing,
store, search, and analyze the data in seconds to
explore relationships and insights for innovation and
competitive benefit that were already unavailable.
Analogous technologies are Apache Hadoop, Apache
Spark, Apache Flink, etc. Apache Spark is asubstitute
to Hadoop MapReduce rather than a substitution of
Hadoop. Apache Spark is considered as next
generation big data tool, It is lightning rapid cluster
computing engine which is 100 times faster than
Hadoop-MapReduce [3]. The Apache Spark is an
open-source cluster computing framework for realtime processing. It is of the most prosperous projects
in the Apache software foundation. Spark distinctly
DOI: 10.9790/9622-0801022641
26 | P a g e
V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
developed as the market leader for big data
processing [4]. At present, Spark is being adopted by
major players such as Amazon, eBay, Yahoo and
many organizations run Spark on clusters with
thousands of nodes. Apache Spark is endow highlevel API in Java, Python, R and Scala [5][6]. It can
access data from HDFS, HBase, Hive, Cassandra,
Tachyon, and any Hadoop data source.
II.
Prior to briefly discuss first question arsie
our mind why Spark when we have Hadoop is
previously there?.To answer thisquestion,we have to
look at the scheme of batch and real-time processing.
The Hadoop is based on batch processing of big data.
This means that the data is stored over a period of
time and is then processed using Hadoop shown in
figure1. But in Spark, processing can take place in
real-time shown in figure2.
INSUFFICIENCY WITH HADOOP
AND MAPREDUCE
Hadoop as a big data processing technology
has proven to be the go to solution for processing
huge data sets. MapReduce is a magnificent solution
for computations, which exigency one-pass to
complete, but not very [2] efficient for use cases that
need multi-pass for computations and algorithms.
Everylevel in the data processing workflow has one
Map and one Reduce phase. To leverage MapReduce
solution ourrequirement to alter our use case into
MapReduce pattern [3]. The Job output data between
every step has to be stored in the file system before
the next level can begin. Consequently, this
procedure is sluggish, due to replication & disk
Input/output operations. Additionally, Hadoop
ecosystem doesn’t have every component to
finisheda big data use case. The MapReduce job is
submitted for running in Hadoop and once the job is
finished, the output can be taken from the output
location stipulated.Another issue comes when there
are multiple MapReduce jobs to be completed in a
chained fashion. In other words, if a big data
processing work is to be accomplished by two
MapReduce jobs in such a way that the output of the
first MapReduce [7] job is the input of the second
MapReduce job. In this circumstance, whatsoever
may be the size ofthe output of the first MapReduce
job, it has to be written to the disk before the second
MapReduce could utilizeit as its input. In this
situation, there is a definite and unnecessary write
operation. In many of the batch data processing use
cases, these I/O operations are not a big problem. If
the outcomeis highly reliable, for many batch data
processing use cases, the latency is tolerated. The
mainissue comes when doing real-time data
processing. The large amount of I/O operations
involved in MapReduce jobs makes it improper for
real-time data processing with the less possible
reaction time.The Iterative and Interactive
applications in need of quicker data sharing across
parallel jobs. The data sharing is low in MapReduce
due to serialization, replication [2], and disk IO. In
the matter of storage system, most of the Hadoop
applications, they spend more than 90% of the time
doing HDFS read-write operations.
III.
www.ijera.com
NECESSITY FOR APACHE SPARK
www.ijera.com
Figure 1. The Data Processing in MapReduce
This real-time processing power with Spark
assistance us to solve the use cases of real time
analytics. Spark is also capable of doing batch
processing 100 times faster than that of [7] Hadoop
MapReduce in large data sets.
Figure 2. The Data Processing in Spark
Apache Spark is fasted extensive purpose
big data analytics engine and it is very appropriate for
any kind of big data analysis.Spark makes use of
RDD [8] which allows us to store data in memory
and persevere it as per the requirements. This permit
a massive increase in batch processing job
performance.Spark also permits us to cache the data
in memory, which is profitable in case of iterative
DOI: 10.9790/9622-0801022641
27 | P a g e
V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
algorithms like as those used in machine
learning.Spark utilization state-of-the-art Directed
Acyclic Graph (DAG) data processing engine. What
it means is that for each Spark job, a DAG of tasks is
created to be executed by the engine. The DAG in
mathematical [9] parlance consists of a set of vertices
and directed edges concatenate them. The tasks are
executed as per the DAG layout.The in-memory data
processing mingled with its DAG-based data
processing engine makes Spark more proficient.
Spark permit us to perform stream processing with
huge input data and deal with only a chunk of data on
the fly. This can also be used for online machine
learning, and is highly convenient for use cases with
a requirement for real time analysis, which happens
to be practically ubiquitous requirement in the
industry. There are many reasons to choose Spark we
are discussing below section.
3.1 Ingenuity
Spark’s ability is accessible via a set of rich APIs, all
designed especially for interacting swiftly and easily
with data at scale. These APIs are well documented,
and structured in a way that makes it ingenious for
data scientists and application developers to swiftly
put Spark to work.
3.2. Deficiency of MemoryResources
The Spark is fasted common purpose engine
due to the fact that it retainall its current operations
inside memory. Consequently requires an access
amount of memory, so in this case, when available
memory is very limited, Apache Hadoop Map
Reduce may assistance preferable, considering the
large performance gap.
3.3. Swiftness
The Spark is designed for swiftness,
operating both in memory and ondisk. Since 2014,
Spark was used to conquer the Daytona Gray Sort
benchmarking challenge, processing 100 terabytes of
data stored on solid-state drives in only 23 minutes.
The former winner used Hadoop and a different
cluster configuration, but it took 72 minutes. This
conquer was the outcome of processing a static data
set. The Spark performance can [10] be even greater
when helpful interactive queries of data stored in
memory, withclaims that Spark can be 100 times
faster than Hadoop MapReduce in these
circumstances.
3.4. Compatibility
Spark supports a many type of programming
languages, including Java, Python, R, and Scala. In
spite of the fact that mostclosely associated with the
Hadoop underlying storage system, HDFS, Spark
includes connatural support for tight integration with
a number of leading storage solutions in the Hadoop
www.ijera.com
www.ijera.com
ecosystem. Besides, the Apache Spark community is
huge, active, and international. The increasingly set
of commercial providers, including Databricks, IBM,
and all of the main Hadoop vendors deliver ambient
support for Spark-based solutions.
IV. ABOUT APACHE SPARK
Spark is a general-purpose data processing
engine, suitable for use in a wide range of
circumstances and it is intense compared to many
other data processing structures. The Spark was
emergingat the University of California, Berkeley
and later became one of the top projects in Apache
and version 1.0 of Apache Spark was released in May
2014. Spark version 2.0 was released in July
2016.From the commencement, Spark was optimized
to execute in memory, helping process [6] data far
more quickly than substitute approaches such as
Hadoop MapReduce, which tends to write data to and
from computer hard drives between every stage of
processing. Spark is a general-purpose data
processing engine, appropriate for use in a wide
range of circumstances [11]. Theprocessing of
streaming data from sensors or financial systems,
interactive queries across huge data sets, and
machine learning tasks tend to be most frequently
related to Spark. The Spark programming paradigm
is very strong and exposes a uniform programming
model supporting the application development in
multiple programming languages and its extensive
support for languages such as Java, Python, R and
Scala and also Spark can be deployed on a variety of
platforms. Spark runs on the various types,operating
systems such as Windows and UNIX Linux and Mac
. Spark can be deployed in a standalone mode on a
single node having a supported operating system.
Spark can also be deployed in cluster node on
Hadoop YARN as well as Apache Mesos. The Spark
mostly makes use of side by side Hadoop data
storage module, HDFS, but can also integrate equally
well with other famous data storage subsystems such
as Cassandra, MapR-DB, HBase, MongoDB and
Amazon’s S3. Therewith the core data processing
engine, Spark comes with a strong stack of domain
conspicuous libraries that use the core Spark libraries
and providedifferent functionalities useful for
different big data processing requirements.
V.
WHAT IS APACHE SPARK USED
FOR?
The Spark is a data processing engine, an
APIcapability which application programmer
incorporates into their applications to expeditiously
query, analyze and alteration data at scale. Spark
pliability makes it favorable to tackling a range of use
cases, and it is competent of handling several
petabytes [5] of data at a time, distributed across a
DOI: 10.9790/9622-0801022641
28 | P a g e
V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
cluster of thousands of cooperating physical or virtual
servers.
5.1 Stream Processing
The log files to sensor data, application
developers progressively have to cope with streams
of data. This data arrives in a regular stream,
frequently from multiple sources
simultaneously. Until it is certainly feasible to permit
these data streams to be stored on disk and analyzed
retrospectively, it can infrequently be sensible or
important to process and act upon the data as it
arrives. Streams of vital data respectively,for
financial transactions.
5.2. Machine Learning
As data volumes increase, machine learning
approaches
become
more
practicable
and
increasingly accurate. Spark [12]capability to store
data in memory and expeditiously run repeated
queries makes it wellsuitedto training machine
learning algorithms. Executing broadly same queries
again and again, significantly detract the time
required to [13] iterate through a set of possible
solutions in order to discover the most efficient
algorithms.
5.3. Interactive Analytics
Ifexecuting pre-defined queries to create
static dashboards of sales or production line
productivity or stock prices, business analysts and
data scientists increasingly want to find out their data
by asking a question, viewing the outcome, and either
make changes to the initial question slightly or
drilling deeper into outcome. This interactive query
process needs systems like as Spark that are able to
respond and adapt fast.
5.4. Data Integration
If data produced by dissimilar systems
across a business are rarely neat or consistent enough
to simply and effortlessly be combined for reporting
or analysis. The extract, transform, and load
processes are time and again used to pull data from
dissimilar systems, neat and standardize it, and then
load it into a distinct system for analysis. The Spark
is increasingly being used to detract the cost and time
expected for this process.
VI. APACHE SPARK APPLICATION
ARCHITECTURE
The Spark is being an open source
distributed data processing engine for clusters, which
endow a unified programming model engine across
various types data processing workloads and [4]
platforms.Apache Spark application architecture
consists of the following key software components
and it is necessary to understand every one of them to
get to grips with the complexities of the framework
shown in figure3.
www.ijera.com
www.ijera.com
6.1. Apache Spark Driver
The Spark driver program is the
distinguishing program of your Spark application.
The driver is the process that executes the user code
thatcreates RDDs, and execution transformation and
action, and also creates Spark Context. The Spark
application process is executed is called the driver
node, and the process is called the driver process.
When the Spark Shell is launched, this notifies that
we have created a driver program. If termination of
the driver, the application is ended. The driver
program partitioned the Spark application into the
task and schedules them to execute on the executor.
The task scheduler lives in the driver and distributes
task among workers.
Figure 3. Apache Spark Application Architecture
6.2.Apache Spark Tasks
The Sparktask is a unit of work that will be
dispatchedto one executor. Aecheloned is a logical
chunk of data distributed across a Spark cluster. This
command sent from the driver program to an
executor by serializing your function object. The
executor de-serializes the command (this is part of
your JAR that has previously been loaded) and
executes it on a split [14]. In the manysituationsSpark
would be reading data out of a distributed storage,
and would echeloned the data in order to parallelize
the processing across the cluster. For example, if you
are reading data from HDFS, aecheloned would be
created for every HDFS split. The split isnecessary
because Spark will execute one task for each split.
This here upon implies that the number of split
isnecessary.
6.3. Apache Spark Cluster Manager
A cluster manager as the name
disclose the manages a cluster. Spark depend on the
cluster manager to launch executors and in some
DOI: 10.9790/9622-0801022641
29 | P a g e
V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
situation, even the drivers are launched through it.
This is a pluggable component in Spark. Thecluster
manager, jobs and action within, a spark application
is scheduled by a Spark scheduler in a FIFO fashion.
On the contrary, the scheduling can also be done in
Round Robin fashion. The resources used by a Spark
application can be dynamically adjusted based on the
workload. Accordingly, the application can free
unutilized resources and request them again when
there is a [15] demand. The Spark has the capability
to work with a multitude of cluster managers,
including YARN, Mesosand a cluster manager. A
cluster manager consists of two long execution
daemons, firstly on the master node, and secondly on
each of the worker nodes.
6.4. Apache Spark Worker
Supposing you are familiar with Hadoop, a
worker node is something same as to a slave node.
The worker machines are the machines where the real
work is happening in terms of execution within Spark
executors. This process is reported the obtainable
resources on the node to the master node. Normally
every node in a Spark cluster except the master
execute a worker process. Ourselves commonly start
one spark worker daemon per worker node, which
then starts and watch executors for the applications.
6.5. Apache Spark Session
Normally, a session is an interaction
between two or more entities.The Apache Spark
session is the entry point of programming with Spark
with the dataset and DataFrame API.
6.6. Apache Spark Executors
In the master allocates the resources and
uses the workerexecution across the cluster to
makeexecutors for the driver. The driver can then use
these executors to run its tasks. The personal task in
the given Spark job executesin the Spark
executors.Again, the executors are only launched
when a job execution starts on a worker node in other
words executors are launched once in the
commencement of Spark application and then they
execute for the whole lifetime of [4] an application.
Further,if the Spark executor lapse, the Spark
application can continue tocomfort. This also leads to
the fallout of application isolation and non-sharing of
data between multiple applications. Executors are
accountable for execution tasks and hold the data in
memory or disk storage across them.
6.7. Apache SparkContext
The Spark Context is the penetration point
of the Spark session. It is your connection to the
Spark cluster and can be used to create RDDs,
circulation variables on that cluster, and
www.ijera.com
www.ijera.com
accumulators. It is superior to have one Spark
Context active per JVM, and consequently you
should call stop () on the active Spark Context before
you make a new one. You might have perceive
already that in the local mode, whenever we start a
Python or Scala shell we have a Spark Context object
created automatically and the variable screference to
the SparkContext object. We didn’trequirement to
make the Spark Context, but as an alternative started
using it to create RDDs from text files.
VI. APACHE SPARK ECOSYSTEM
The Spark puts the assurance for faster data
processing and convenient development. Apache
Spark is considered as the normal purpose system in
the big data world. Apache Spark is common purpose
cluster computing system.It be made up of a lot of
libraries that help to perform different analytics on
your data.
It endowshigh-level API in
Java,Python,Scala,
andR. Spark also endowsan
optimized engine that supports common execution
graph. Apache Spark permit[5] for entirely new use
cases to increase the value of big data.It also has
copious high-level tools for structured data
processing, streaming, machine learning, graph
processing. The Spark can either execute alone or on
aalive cluster manager. Primarily, Spark Ecosystem
comprises the following componentsshown in figure
4.
Figure 4.The Apache Spark Ecosystem
7.1. ApacheSpark Core Component
As its name says, the Spark core library
made up of all the core modules of Spark. This is the
heart of Spark, and is accountable for
managementfunctions like as task scheduling. The
Spark center component is the foundation for parallel
and distributed processing of huge datasets. Spark
center component is responsible for all the basic I/O
functionalities, networking with various storage
systems, fault recovery, scheduling, monitoring the
jobs on spark clusters, task dispatching, and skillful
memory management.Whole functionalities being
provided by Spark are built on the top of Spark core.
DOI: 10.9790/9622-0801022641
30 | P a g e
V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
Spark core makes use of a special data structure
known [16] as RDD (Resilient Distributed Datasets).
Data sharing or reuse in distributed computing
systems like Hadoop MapReduce is in need of the
data to be stored in intermediate stores like Amazon
S3 or HDFS. The Apache Spark ecosystem is built on
top of the core execution engine that has extensible
[17] API’s in various languages. It endowsin-emory
computing ability to deliver speed, a generalized
execution model to support a wide diversifictionof
applications, and Scala, Java, R Language, SQL and
Python APIs for the convenience of development.
7.1.1. Scala
The Spark structure is built in Scala, so
programming in Scala for Spark can provide access
to some of the latest and greatest features that might
not be available in other supported programming
spark languages.
7.1.2. Python
The Python is a programming language
widely used by data analysts and data scientists these
days. There are many scientific and statistical data
processing libraries available, as well as plotting
libraries and charting, that can be used in Python
programs.Python language has wonderful libraries
for data analysis like Pandas and Sci-Kit learn, but is
comparatively sluggish than Scala. Python is also
widely used as a programming language to develop
data processing applications in Spark.
7.1.3. R Language
The R programming language has a wealthy
environment for machine learning and statistical
analysis which assistance to risedeveloper
productivity. R was developed by Ross Ihaka and
Robert Gentleman. Nowadays, data scientists can
now use R language along with Spark through
SparkR for processing data that cannot be handled by
a single machine.The R is highly extensible and for
that, external packages can be created. As soon as an
external package is created, it has to be installed and
loaded for any program to use it. A collection of like
packages under a directory forms an R library. R is
also a few built-in data types to hold numerical
values, character values, and boolean values. There
are composite data structures in existence and the
most important ones are, namely, vectors, lists,
matrices, and data frames. R has inherent support for
many statistical functions and many scalar data types.
7.1.4.SparkSQL
L is a library built on top of Spark. It shows
up SQL interface and DataFrame API. If the structure
of the data is known in advance, if the data fits into
the model of rows and
columns, it doesn't matter from where the data is
coming and Spark SQL can use all of it jointly and
process it as if all the data is coming from a single
source [14].The most essential aspect to highlight
www.ijera.com
www.ijera.com
here is the ability of Spark SQL to deal with data
from a very wide variety of data sources. If it is
available as aDataFrame in Spark, Spark SQL can
process data in a completely distributed way,
combining the DataFrames coming from different
data sources to process and query as if theentire
dataset were coming from a single source.
7.1.5. Java
Java is a general-purpose computer
programming language, class based, multithreaded,
dynamic, distributed, object oriented, platform
independent, portable, architecturally neutral,
portable and robust interpreted. Java capabilities are
not limited to any specific application domain rather
it can be used in various application domains and
hence it is called general-purposeprogramming
language.The Java have a unique feature application
programmer write once, run anywhere,meaning that
compiled Java code can execute on all platforms that
support Java without the requirement for
recompilation. Java is a widely used programming
language expressly designed for use in
the distributed environment of the internet.
7.2. Apache Spark SQL Component
The Spark SQL component is a distributed
framework for structured data processing. Spark gets
more information about the structure of data and the
computation. The Spark SQL library helps to analyze
structured data using the very famous SQL queries.
Spark SQL components act as a library on top of
Apache Spark that has been built based on Shark.
Again Spark developers can leverage the power of
declarative queries and optimized storage by
executing SQL like queries on Spark data, that is
extant in RDDs and other outer sources. The
consumer can perform, extract, transform and load
functions on the data coming from different formats
such as JSON or Parquet or Hive and then run ad-hoc
queries using Spark SQL. Spark SQL simple the
process of extracting and merging different datasets
so that the datasets are ready to use for machine
learning.Spark SQL works to access structured and
semi-structured information [14]. It also enables
powerful, interactive, analytical application across
both streaming and archival data. Spark SQL is a
Spark module for structured data processing.
Therefore, it acts as a distributed SQL query engine.
7.3. Apache Spark Streaming
The Spark Streaming mainly enables you to
create analytical and interactive applications for
existingstreaming data. The Spark Streaming library
consists of modules that help users to execute realtime streaming processing on the arriving data. It
helps to maintain the velocity part of the big data
domain. Spark Streaming is a lightweight API that
permit developers to perform batch processing and
DOI: 10.9790/9622-0801022641
31 | P a g e
V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
streaming of data with convenience, in the same
application. It makes use of a continuous stream of
input data to process data [18] in real-time. The
Spark streaming leverages the fast scheduling
capacity of Apache Spark core to perform streaming
analytics by swallowingdata in mini-batches as well
as alteration are applied on those mini batches of
data. Micro-batching is a technique that permitsa
process or task to treat a stream as a sequence of
small batches of data. With the result that Spark
streaming, groups the live data into small batches. It
then delivers it to the batch system for processing. It
also endowsfault tolerance characteristics. The data
in Spark streaming is swallowedfrom [19] different
data sources and exist streams like IoT Sensors,
Amazon Kinesis,Twitter, Apache Kafka, Akka
Actors, Apache Flume, etc. On event drive, faulttolerant and type-secure applications.Spark streaming
is most advantageous for online advertisements and
finance, supply chain management, etc.
7.4. Apache Spark MLlib (Machine Learning
Library)
The Spark MLlib helps to apply different
machine learning techniques on your data, leveraging
the distributed and scalable ability of Spark.MLlib is
a low-level machine learning library that can be
called fromPython, Scala and Java programming
languages. MLlib is easy to use, scalable, compatible
with different programming languages and can be
comfortably integrated with other tools. MLlibsimple
the deployment and development of scalable machine
learning pipelines.MLlib has aeasy application
programming interface for data, scientists who are
already familiar with data science programming tools
such asPython and R. The data, scientists can build
Machine learning models as a multi-step journey
from data ingestion through train [20] and error to
production.It contains machine learning libraries that
have an implementation of various machine learning
algorithms. For example, clustering, different
regression, classification and collaborative filtering.
7.5. Apache Spark GraphX
For graphs and graphical computations,
Spark has its personal Graph computation engine,
called GraphX. The Spark GraphX library provides
APIs for graph-based computations. In this library,
the user can perform parallel computations on graphbased data.It is a network graph analytics engine and
a
data
store.Spark
GraphXinitiateResilient
Distributed Graph (RDG). The RDG associate
records with the vertices and edges in a graph and
also help data, scientists perform various graph
operations through [21] various expressive
computational primitives. These primeval help
developers implement pregel and pagerank
abstraction in approximately 25 lines of code or even
www.ijera.com
www.ijera.com
less than that.The GraphX also optimize the way in
which we can represent vertex and edges when they
are primeval data types and it supports fundamental
operators likesubgraph, join Vertices, and aggregate
Messages as well as an optimized variant of the
Pregel
API.GraphX
component
of
Spark
endorsement multiple use cases like social network
analysis, fraud detection,and recommendation.
7.6. Apache SparkR
There are several people from data science
track, who must be conscious that for statistical
analysis, R is among the best. The Spark R library is
used to execute R scripts or commands on Spark
cluster. This helps to endow distributed environment
for R scripts to run. R also endowsoftware provision
for data manipulation, graphical display and
calculation. For this reason, the main opinion behind
SparkR was to discovervarious techniques to
integrate the usability of R with the scalability of
Spark. This R package that gives the light-weight
front-end to use Apache Spark from R [22].Spark R
dataFrames also inherit all the optimizations made to
the computation engine in terms of code generation,
memory management.The R dataFrames can execute
on terabytes of data and clusters [23] with thousands
of nodes. The RStudio or Rshell and can run R scripts
which will execute on the Spark cluster.
VII. SPARK APPLICATION RUNS ON A
HADOOP CLUSTER
In this section we are discussingthe how a
Spark application run shown in figure 5. A Spark
application runs as independent processes,
coordinated by the SparkContext object in the driver
program.The task scheduler launches tasks via the
cluster manager. The cluster manager appointstasks
to workers, one task per segmentation [23]. A task
enforcesits unit of work to the elements in its
segmentation, and outputs a new segmentation.The
segmentation can be read from an HDFS block,
HBase or other source and cached on a worker
node.The outcomeis sent back to the driver
application.
DOI: 10.9790/9622-0801022641
32 | P a g e
V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
Figure 5.Spark Application Run
VIII.
THE RESILIENT DISTRIBUTED
DATASET IN SPARK
The Resilient Distributed Datasets (RDD)
are a basic data structure of Spark. It is an
unswerving distributed collection of objects. Every
dataset in RDD is split into logical segmentation,
which may be computed on various nodes of the
cluster.For what reason requirement RDD in view of
the fact that MapReduce is widely adopted for
processing and generating enormous data sets with a
parallel, distributed algorithm on a cluster. It permits
users to write parallel computations, using a set of
high-level operators, without having to anxiety about
work distribution and fault tolerance. This is slow
due to replication, serialization, and disk I/O. For that
reason there was a necessity [24] for substitute
programming model called RDD.There are three
ways to create RDDs in Spark like as firstly data in
static storage, second RDDs, and third parallelizing
previously existing collection in the driver
program.Spark RDD can also be cached and
manually segmentation and caching is advantageous
when
we
use
RDD
many
times.
If
manualsegmentation is essential to correctly balance
segmentation. Normally, miniature segmentation
permitshave been distributing [25] RDD data more
equally, among more executors.TheSpark keeps
tenacious RDDs in memory by default, but it can
spill them to disk if there is not sufficient RAM.
Users can also request other tenacious strategies, like
as storing the RDD only on disk or facsimile it across
machines, via the flags to persevere.
8.1. Why do we need RDD in Spark
The Apache Spark lets you deport your
input files approximately such as any other variable,
which you cannot do in Hadoop MapReduce. RDD
are automatically distributed across the network by
means of segmentation. When it comes to iterative
distributed computing, i.e. Processing data over
www.ijera.com
www.ijera.com
several jobs in computations like as Page rank
algorithms,
Logistic
Regression,
K-means
clustering.This isimpartially common to reuse or
share the data among several jobs or it may involve
multiple ad-hoc queries over a shared data set.This
makes it very significant to have a very good data
sharing architecture so that we can perform rapid
computations. There is abasic issue with data reuse or
data sharing in current distributed computing systems
(like as MapReduce) and that is yourequirement to
[24] store data in some intermediate stable distributed
store like as HDFS or Amazon S3. This makes the
overall computations of jobs loweventual it involves
several I/O operations, replications and serializations
in the process. The RDD effort to solve thisissue by
enabling fault tolerant distributed In-memory
computations. The most important challenge in
designing RDD is defining a program interface that
provides fault tolerance proficiently [25]. The Spark
shows up RDD through language integrated API. In
integrated APIevery data set to appear in an object
and transformation is involved using the method of
these objects. Apache Spark evaluates RDDs idly. It
is called when demand, which saves lots of time and
improves competence. The first time they are used in
an action so that it can pipeline the alteration.
8.2. Spark RDD Operations
There are two categoriesof operations that you can
perform on an RDD ,first Transformations and
second Actions.
8.2.1. Transformations
The
Spark
RDD
Transformations
are functions that take an RDD as the input and
produce one or many RDDs as the outputshown in
figure 6. They do not transformthe input
RDD,however, eternally produce one or more new
RDDs by applying the computations they represent
e.g. reduceByKey(), Map(), filter() etc.The
transformations are sluggish operations on an RDD in
Spark and it also creates one or many new RDDs,
which run when anAction occurs [24]. Accordingly,
transformation makes a new dataset from an existing
one.Few transformations can be pipelined, which is
an optimization method, that Spark uses to retouch
the performance of computations.
8.2.2. Actions
When an Action in Spark returns the
eventual outcome of RDD computations.The
execution using a lineage graph to load the data into
original RDD, carry out all intermediate
transformations and return the eventual outcome to
driver program or write it out to file system. Actions
are RDD operations that produce non-RDD values.
They materialize a value in a Spark program. An
Action is one of the ways to send outcome from
DOI: 10.9790/9622-0801022641
33 | P a g e
V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
executors to the driver. The first(), take(), reduce(),
collect(), the count() is some of the Actions in spark.
Figure 6. Resilient Distributed
DatasetsTransformations
8.3. Features of RDD in Spark
There are manycharacteristic of Apache Spark
Resilient Distributed Datasets (RDD).
8.3.1. In-memory Computation
The Spark RDDs have a provision of in-memory
computation. It stores intermediate outcome in
distributed memory(RAM) as an alternative of stable
storage(such as a disk).
9.3.2. Sluggish Evaluations
All transformations in Apache Spark are sluggish, in
that they do not compute their outcome right away.
Alternatively, they just remember the transformations
applied to some base data set.Spark computes
transformations when an action need anoutcome for
the driver program.
9.3.3. Fault Tolerance
Spark RDDs are fault tolerant as they track data
lineage information to rebuild the missing data
automatically on lack of success. They rebuild the
missing data on nihility [24] using lineage, each RDD
recall how it was created from other datasets to
recreate itself.
9.3.4. Fixity
The data is secure to share across processes. It can
also be created or retrieved anytime which makes
caching, sharing & replication convenient.
Consequently, it is a way to reach consistency in
computations.
9.3.5. Segmentation
The segmentation is the fundamental unit of
parallelism in Spark RDD. Each segmentationis one
logical division of data which is changeable. One can
create a segmentationthrough some transformations
on a alive segmentation.
9.3.6. Stubbornness
The subscriber can state which RDDs they will reuse
and choose a storage strategy for them like as inmemory storage or on Disk.
9.3.7. Voluminous Grained Operations
www.ijera.com
www.ijera.com
It enforcesto all elements in datasets through maps or
a filter or group by operation.
9.3.8. LocationAdhesiveness
RDDs are able to defining placement
preference to compute segmentation. The placement
preference refers to information about the location of
RDD. The DAG scheduler places the segmentation in
like a way that the task is close to the data as much as
possible. Consequently, speed up computation.
9.4. Obstaclesof Spark RDD
There are manyobstaclesof Spark Resilient
Distributed Datasets (RDD) talk about below
segment.
9.4.1. No Built-in Optimization Engine
Whenever working with structured data, RDDs
cannot take the benefit of Spark’s advanced
optimizers including catalyst optimizer and Tungsten
execution engine. The computer programmer
necessity to optimize each RDD based on its
attributes.
9.4.2. Care of Structured Data
RDDdoes not endow schema view of data. It has no
provision for the care of structured data.Dataset and
DataFrameendow the schema view of data. It is a
distributed accumulation of data organized into
named columns.
9.4.3. Performance Interrupt
The existence in-memory JVM objects, RDDs
involve the overhead of sweepings accumulation and
Java serialization which are expensive when data
increase in size.
9.4.4. Storage Boundary
The RDDs demean when there is not sufficient
memory to store them. If you can also store that
segmentation of RDD on disk which does not fit in
RAM. As anoutcome, it will endow identical
performance to present data-parallel systems.
9.4.5. Runtime Type Protection
There is no stable typing and run-time type
protection in RDD. It does not permit us to scrutiny
error at the runtime.The dataset endow compile-time
type protection to build complex data workflows.
Compile-time type protection means if you try to
concatenate any other type of element to this list, it
will give you compile time mistake. It helps detect
mistakesat compile time and makes your code secure.
X. CLUSTER MANAGEMENT IN
APACHE SPARK
The Spark is an engine for Big Data
processing and Spark is executing on distributed
mode on the cluster. In the cluster, there is a master
and n number of workers. It schedules and split
resource in the host machine which forms the cluster.
The main work of the cluster manager is to split
resources across applications [26]. It works as an
outer service for obtainingresources on the cluster.
DOI: 10.9790/9622-0801022641
34 | P a g e
V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
Moreover, the cluster manager sending work for the
cluster.
Spark
supports
pluggable
cluster
management. The cluster manager in Spark handles
starting executor processes [27].Apache Spark
applications can execute in three different cluster
managers.
10.1. Apache Spark Standalone Cluster Manager
The standalone mode is aneasy cluster
manager incorporated with Spark. It makes it simple
to setup a cluster that Spark itmanages, and can
execute on Windows, Linux, or Mac OSX.In
standalone mode, every application executesan
executor on every node within the cluster.It has
mastered and number of workers with configured
amount of memory and CPU cores. In Spark
standalone cluster mode [26], Spark allotresources
based on the core.Handling the file system, we can
attain the manual recovery of the master. The Spark
endorsement authentication with the help of shared
confidential with overall cluster manager. The user
configures every node with a shared confidential. For
communication protocols, Data encrypts praxis SSL.
But for block transfer, it makes praxis of data SASL
encryption.
10.2. Apache Mesos
Apache Mesos is a committed cluster and
resource manager that endow wealthy resource
scheduling ability. Mesossupport the workload in
distributed environments by dynamic resource
sharing and segregation. It is beneficial for
deployment and management of applications in largescale cluster territory. Apache Mesos clubs together
the alive resource of the machines nodes in a cluster.
The Mesos has a fine grained sharing option so Spark
shell scales down its CPU allocation during the
execution of [28]many commands specifically when
mausenyrs are executing interactive shells.It is a
resource management platform for Hadoop and Big
Data
cluster.TheMesos
Framework
permits
applications to entreaty the resources from the
cluster.
10.3. Hadoop YARN
YARN comes with most of the Hadoop
distributions and is the only cluster manager in Spark
that supports security. YARN became the sub-project
of Hadoop in the year 2012. YARN cluster manager
permit[29] dynamic sharing and central configuration
of the same pool of cluster resources between
different frameworks that execute on YARN. The
number of executors to use can be chosenby the user,
unlikethe Standalone mode. YARN is a superior
choice when big Hadoop cluster is previously in use
in production. The YARN data computation
framework
is
anamalgamation
of
the
ResourceManager, the NodeManager. It can execute
www.ijera.com
www.ijera.com
onWindows and Linux.The Yarn Resource Manager
manages resources among all the applications in the
system.
XI. CHARACTERISTICS OFAPACHE
SPARK
The Apache Spark is lightning rapid, in-memory data
processing engine. The Spark is principally designed
for data science and the abstractions of Spark make it
simple [30].Now we will discuss the[31] various
characteristics of Spark are.
11.1. Rapid Processing
Using Apache Spark, we instate a high data
processing speed of about 100x faster in memory and
10x faster on the disk. This is made feasible by
deficiency the number of read-write to disk.
11.2. Dynamic in Nature
We can comfortably develop a parallel application, as
Spark endow80 high-level operators.
11.3. High-Level Analytics
The best and masterly characteristics of Apache
Spark is its changeability. It endorsement Machine
learning (ML), Graph algorithms, SQL queries and
Streaming data along with MapReduce.
11.4. In-Memory Computation in Spark
In-memory processing, we can rise the processing
speed. Therein the data is being cached, so we
necessity doesn't bring in data from the disk every
time thus the time is saved. The Spark hasDAG
execution engine [31] which facilitates in-memory
computation and acyclic data flow outcome in
improved speed.
11.5. Reusability
The Spark code can be reused for batch-processing,
join the streamopposed to historical data or execute
ad-hoc queries on stream state.
11.6. Fault Tolerance in Spark
The Apace Spark endowsfault tolerance through
Spark abstraction RDD. Spark RDDs are designed to
handle the lack of success of any worker node in the
cluster. Consequently, it makes surethat the loss of
data is diminished to zero. Cognize various ways to
create RDD in Apache Spark.
11.7. Real-Time Stream Processing
The Spark has a facilityfor real-time stream
processing. Prior to the difficulty with Hadoop
MapReduce was that it can handle and process data
which is previously present, but not the real-time
data. However,Spark streaming we can solve this
difficulty.
11.8. Sluggish Evaluation in Apache Spark
Entire transformations we make in Spark
RDD are sluggish in nature, that is, it does not give
the outcome right away rather a new RDD is formed
from the current one. Consequently, this increases the
dexterity [30] of the system.
DOI: 10.9790/9622-0801022641
35 | P a g e
V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
11.9. Support Various Languages
In Spark, there is support for various languages like
Java, R, Scala, Python. Consequently, it
endowsdynamicity and overcomes the issue of
Hadoop that it can build applications only in Java.
11.10. Active, Progressive and Expanding Spark
Community
The programmer from over 50 companies
wereassociated with making of Apache Spark. This
project was initiated in the since 2009 and is still
expanding and now there are about 250 developers
who contributed to its expansion. It is the most vital
project for Apache community.
11.11. Support for Intricate Analysis
The Spark comes with faithful tools for
streaming data, interactive as well as declarative
queries, machine learning which add-on to map and
reduce.
11.12. Integrated with Hadoop
Spark can execute autonomously and also on Hadoop
YARN cluster Manager and thus it can read a alive
Hadoop data and Spark is resilient.
11.13. Spark GraphX
The Spark has GraphX, which is a
component for graph and graph-parallel computation.
This is over-simplifythe graph analytics tasks by the
collection of graph algorithm and builders.
11.14. Economical
The Spark is an economical solution for Big data
issue as in the Hadoop huge amount of storage and
the huge data center is needed during replication.
11.15. Strong
The Spark provides apliability to implement both
stream processing and batch of data at the same
moment, which permits organizations to oversimplify
deployment,
application
developmentandmaintenance.
XII. DRAWBACK OF APACHE SPARK
As we knowApache Sparkis the next
generation Big data tool that is being extended[32]
used by industries, but there are a few drawbacksof
Apache Spark.
12.1. No Support for Real-time Processing
On Spark streaming, the reach live stream of
data is split into batches of the pre-defined interval,
and every batch of data behaveslike Spark Resilient
Distributed Database (RDDs). Then these RDDs are
processed using the operations like a map, reduce,
join etc. The outcome of these operations is coming
back in batches. Therefore, it is not real time
processing, but Spark is near real-time processing of
data exists.
12.2. Trouble with Small File
www.ijera.com
www.ijera.com
Whenever use Spark with Hadoop, we come
across anissue of a small file. HDFS endow a limited
number of huge files rather than a huge number of
small files. Another place where Spark legs at the
back of we store the data gzipped in S3. This pattern
is very pleasant [32] except when there are lots of
small gzipped files. Presently the work of the Spark
is to keep those files on network and uncompress
them. Besides the gzipped files can be uncompressed
only if the whole file is on one basic. Therefore a
large span of time will be spent on burning their core
unzipping files in sequence.In the outcome RDD,
every file will become aecheloned,for this reason
there will be a huge amount of tiny echeloned within
an RDD. At the moment, if we want dexterity in our
processing, the RDDs should be re-echeloned into
some manageable format. This needscomprehensive
shuffling over the network.
12.3. No Support for File Management System
The Apache Spark does not have its personal file
management system, in consequence, it depends on
some another platform like Hadoop as well as
another cloud-based platform which is one of the
Spark known matter.
12.4. High-Priced
In memory capacity can become a
bottleneck when we want cost-efficient processing of
big data as keeping data in memory is completely
high-priced, the memory utilization is very high, and
it is not handled in a user-friendly fashion. The
Apache Spark needs lots of RAM to run inmemory,in consequence the cost of Spark is
completely high-priced.
12.5. Very Fewer number of Algorithms
The Spark MLlib lags behind in terms of a number of
accessible algorithms like Tanimoto distance.
12.6. Manual Ameliorate
The Spark job needs to be manually ameliorateand is
sufficient to specific datasets. If we want to
segmentation and cache in Spark to be right, it should
be controlled manually.
12.7. Repeatedly Processing
In Spark, the data repeatedly in batches and
everyrepeatedly is scheduled and executed on one
side.
12.8. Latency
The Apache Spark has excessive reaction time as
compared to Apache Flink.
12.9. Window Standard
The Spark does not endorsement record based
window standard. It only has time-based window
standard.
12.10. Consumes More Memory
DOI: 10.9790/9622-0801022641
36 | P a g e
V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
It makes use of a lot of memory,and problem around
memory consumption are not handled in a user
friendly manner.
12.11. Back Stress Handling
In Spark the back stress is built up of data at an I/O
when the buffer is full and not able to receive the
extra incoming data. The no data is transferred,so
long as the buffer is blank. Apache Spark is not
www.ijera.com
competent of handling stress implicitly by choice,it is
done manually.
XIII. COMPARATIVEANALYSIS
BETWEEN SPARK VS HADOOP VS
MAPREDUCE
In this section, we are comparativeanalysis between
Spark and Hadoop and MapReduce has shown in
below table 1 and 2 [33][34][35][36].
Table 1. The Comparative Analysis Between Spark and MapReduce
www.ijera.com
DOI: 10.9790/9622-0801022641
37 | P a g e
V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
www.ijera.com
Table 2. The Comparative Analysis Between Spark and Hadoop
The majority data analysts would otherwise
XI. CONCLUSION
www.ijera.com
DOI: 10.9790/9622-0801022641
38 | P a g e
V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
have to resort to using agglomeration of other
unrelated packages to get their work complete, which
makes things intricate. In this context, Spark libraries
are designed to all work jointly, on the same piece of
data, which is more integrated and convenient to use.
The Apache Spark is an open-source, distributed
processing system normally used for big data
workloads. Spark can run in a standalone cluster
mode that simply need the Apache Spark framework
and a JVM on every machine in your cluster. Apache
Spark improves execution for rapid performance and
make use of in-memory caching, and itsendorsement
common batch processing, graph databases, ad hoc
queries, machine learning, and streaming analytics. In
this paper, we are presentingSpark concepts,
necessity for Apache Spark, Spark Ecosystem and its
components, we also highlight the Sparkapplication
architecture.Afterwards, we are alsoinvestigating the
Resilient Distributed Datasets in Spark.Thispaper
aims to provide a briefoverview of this exciting
area.Finally, the Spark will enable developers to do
real-time analysis of everything from trading data to
web clicks, in aconvenient to develop an
environment, which remarkable speed.
[7].
[8].
[9].
[10].
REFERENCES
[1].
[2].
[3].
[4].
[5].
[6].
SamanSarraf, Mehdi Ostadhashem, “Big data
application in functional magnetic resonance
imaging using apache spark”, 2016 Future
Technologies
Conference
(FTC),
San
Francisco, CA, USA, Pages: 281 – 284, Year:
2016, DOI: 10.1109/FTC.2016.7821623
Dr. Yusuf Perwej, “An Experiential Study of
the Big Data,” for published in the
International Transaction of Electrical and
Computer Engineers System (ITECES), USA,
ISSN (Print): 2373-1273 ISSN (Online): 23731281, Vol. 4, No. 1, page 14-25, March 2017,
DOI:10.12691/iteces-4-1-3.
J. Dean and S. Ghemawat, “MapReduce:
Simplified Data Processing on Large
Clusters,” in Proceedings of the 6th
Conference on Symposium on Opearting
Systems Design & Implementation - Volume
6, 2004, p. 10.
Apache Spark, “Apache Spark–lightning-fast
cluster computing,” 2016, accessed 19February-2016.
[Online].
Available:
http://spark.apache.org
M. Zaharia, M. Chowdhury, M. J. Franklin, S.
Shenker, and I. Stoica, “Spark: cluster
computing with working sets,” in Proceedings
of the 2nd USENIX Conference on Hot Topics
in Cloud Computing (HotCloud'10), USENIX
Association, Berkeley, CA, 2010, p. 10-10.
H. Karau, A. Konwinski, P. Wendell, and M.
Zaharia, Learning Spark. Sebastopol, CA:
O'Reilly Media, 2015.
www.ijera.com
[11].
[12].
[13].
[14].
www.ijera.com
NikhatAkhtar, FirojParwej, Dr. Yusuf Perwej,
“A Perusal Of Big Data Classification And
Hadoop Technology,” for published in the
International Transaction of Electrical and
Computer Engineers System (ITECES), USA,
ISSN (Print): 2373-1273 ISSN (Online): 23731281, Vol. 4, No. 1, page 26-38, May 2017,
DOI: 10.12691/iteces-4-1-4.
N. Islam, S. Sharmin, M. Wasi-ur-Rahman, X.
Lu, D. Shankar, D. K. Panda, “Performance
characterization and acceleration of inmemory file systems for Hadoop and Spark
applications on HPC clusters,” in 2015 IEEE
International Conference on Big Data (Big
Data), October 29, 2015-November 1, 2015,
pp. 243-252.
X. Lin, P. Wang, and B. Wu, “Log analysis in
cloud computing environment with Hadoop
and Spark,” in 2013 5th IEEE International
Conference on Broadband Network &
Multimedia
Technology
(IC-BNMT),
November 1
[L. Gu and H. Li, “Memory or time:
performance evaluation for iterative operation
on Hadoop and Spark,” in 2013 IEEE 10th
International Conference on High Performance
Comput.
and
Comm.&
2013
IEEE
International Conference on Embedded and
Ubiquitous
Computing
(HPCC_EUC),
November 13-15, 2013, pp. 721-727.
K. Wang and M. M. H. Khan, “Performance
prediction for Apache Spark platform,” in
2015 IEEE 12th International Conference on
Embedded Software and Systems (ICESS),
2015 IEEE 17th International Conference on
High
Performance
Computing
and
Communications (HPCC), 2015 IEEE 7th
International Symposium on Cyberspace
Safety and Security (CSS), August 24-26,
2015, pp. 166-173.
Tim Kraska, AmeetTalwalkar, John Duchi,
ReanGri_th, Michael Franklin, and Michael
Jordan.MLbase: A Distributed Machinelearning System. In Conference on Innovative
Data Systems Research, 2013.
[XiangruiMeng, Joseph Bradley, Evan Sparks,
and ShivaramVenkataraman. Ml pipelines: A
new
high-level
api
for
MLlib.
https://databricks.com/?p=2473, 2015.
Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu,
J. K. Bradley, X. Meng, T. Kaftan, M. J.
Franklin, A. Ghodsi et al., "Spark sql:
Relational data processing in spark",
Proceedings of the 2015 ACM SIGMOD
International Conference on Management of
Data., ACM, pp. 1383-1394, 2015.
DOI: 10.9790/9622-0801022641
39 | P a g e
V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
[15].
[16].
[17].
[18].
[19].
[20].
[21].
N. Chaimov, A. Malony, S. Canon, C. Iancu,
K. Z. Ibrahim, J. Srinivasan, "Scaling Spark on
HPC Systems", Proceedings of the 25th ACM
International
Symposium
on
HighPerformance
Parallel
and
Distributed
Computing, 2016.
New directions for Apache Spark in 2015,”
http://www.slideshare.net/databricks/newdirections-for-apache-spark-in-2015.
"Apache
Spark-Lightning-Fast
Cluster
Computing", 2016, [online] Available:
http://spark.apache.org.
J. Liu, Y. Liang, C. Fang, and N. Ansari,
“Spark-based Large-scale Matrix Inversion for
Big Data Processing,” IEEE INFOCOM
Workshop
of
Big
Data
Sciences,
Technologies, and Applications (BDSTA)
,accepted, 2016.
Omar
Backhoff,
EiriniNtoutsi,”Scalable
Online-Offline Stream Clustering in Apache
Spark”, Data Mining Workshops (ICDMW),
2016 IEEE 16th International Conference on,
Barcelona, Spain ,12-15 Dec. 2016.
DOI: 10.1109/ICDMW.2016.0014
David Siegal ,JiaGuo ,G. Agrawal,” SmartMLlib: A High-Performance MachineLearning
Library”,Cluster
Computing
(CLUSTER), 2016 IEEE International
Conference on, Taipei, Taiwan ,12-16 Sept.
2016DOI: 10.1109/CLUSTER.2016.49
[28].
[29].
[30].
[31].
[32].
[22].
[23].
[24].
[25].
[26].
[27].
[J. E. Gonzalez, R. S. Xin, A. Dave, D.
Crankshaw, M. J. Franklin, I. Stoica, "Graphx:
Graph processing in a distributed dataflow
framework", Proceedings of the 11th USENIX
Conference on Operating Systems Design and
Implementation ser. OSDI'14, pp. 599-613,
2014.
S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla,
P. J. Haas, and J. McPherson. Ricardo:
integrating R and Hadoop. In
SIGMOD 2010 , pages 987–998. ACM, 2010.
L. Yejas, D. Oscar, W. Zhuang, and A. Pannu.
Big R:Large-Scale Analytics on Hadoop Using
R. InIEEE BigData 2014, pages 570–577.
M. Zaharia, M. Chowdhury, T. Das, A. Dave,
J. Ma, M. McCauley, M. J. Franklin, S.
Shenker, I. Stoica, "Resilient Distributed
Datasets: A Fault-Tolerant Abstraction for InMemory Cluster Computing", Proceedings of
the USENIX Conference on Networked
Systems Design and Implementation ’12), pp.
15-28, Apr. 2012.
, Teng-Sheng Moh,”DBSCAN on Resilient
Distributed Datasets”, High Performance
Computing & Simulation (HPCS), 2015
International Conference on, Amsterdam,
Netherlands, 20-24 July 2015.
www.ijera.com
[33].
[34].
www.ijera.com
Zixia Liu ,Hong Zhang, Liqiang Wang,”
Hierarchical Spark: A Multi-Cluster Big Data
Computing Framework”,Cloud Computing
(CLOUD), 2017 IEEE 10th International
Conference on, Honolulu, CA, USA,
Electronic ISBN: 978-1-5386-1993-3 , 25-30
June 2017.
Hamid Mushtaq, Zaid Al-Ars,”Cluster-based
Apache Spark implementation of the GATK
DNA analysis pipeline”,Bioinformatics and
Biomedicine (BIBM), 2015 IEEE International
Conference on, Washington, DC, USA, 9-12
Nov.
2015.,
DOI:10.1109/BIBM.2015.7359893
Benjamin Hindman, Andy Konwinski,
MateiZaharia, Ali Ghodsi, Anthony D. Joseph,
Randy Katz, Scott Shenker, Ion Stoica,
"Mesos: A Platform for Fine-Grained
Resource Sharing in the Data Center",
University of California, Berkley, September
2010.
Yusuf
Perwej,
BedineKerim,
MohmedSirelkhtemAdrees, Osama E. Sheta, “
An Empirical Exploration of the Yarn in Big
Data” for published in the International
Journal of Applied Information Systems
(IJAIS), ISSN : 2249-0868 , Foundation of
Computer Science FCS, New York, USA
Volume 12 , No.9, page 19-29 , December
2017 DOI : 10.5120/ijais2017451730
Nhan Nguyen, Mohammad MaifiHasan Khan,
Yusuf
Albayram,
Kewen
Wang,
"Understanding the Influence of Configuration
Settings: An Execution Model-Driven
Framework for Apache Spark Platform",
Cloud Computing (CLOUD) 2017 IEEE 10th
International Conference on, pp. 802-807,
2017, ISSN 2159-6190.
Kewen
Wang,
M.
M.HasanKhan,”Performance Prediction for
Apache Spark Platform”,2015 IEEE 12th
International Conferen on Embedded Software
and Systems (ICESS), 2015 IEEE 17th
International Conference on, New York, NY,
USA, 24-26 Aug. 2015.
DOI: 10.1109/HPCC-CSS-ICESS.2015.246
Kai
Hildebrandt,
Fabian
Panse,
NiklasWilcke,”Large-Scale Data Pollution
with Apache Spark”,IEEE Transactions on Big
Data, PP 1 - 1, Issue: 99,Electronic ISSN:
2332-7790
,
09
January
2017DOI:
10.1109/TBDATA.2016.2637378
DOI: 10.9790/9622-0801022641
40 | P a g e
V. Surekha. Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1) January 2018, pp.26-41
[35].
[36].
YassirSamadi
,MostaphaZbakh
,Claude
Tadonki ,”Comparative study between Hadoop
and
Spark
based
on
Hibench
benchmarks”,Cloud Computing Technologies
and Applications (CloudTech), 2016 2nd
International Conference on, Marrakech,
Morocco,
24-26
May
2016.DOI:
10.1109/CloudTech.2016.7847709
IstvanSzegedi, "Apache Spark: a fast big data
analytics
engine",
[online]
Available:
https://dzone.com/articles/apache-spark-fastbig-data.
[37].
[38].
www.ijera.com
Juwei Shi , YunjieQiu, Umar FarooqMinhas ,
Limei Jiao , Chen Wang , Berthold Reinwald ,
and Fatma O ̈ zcan , “Clash of the Titans:
MapReduce vs. Spark for Large Scale Data
Analytics”, Proceedings of the VLDB
Endowment, Vol. 8, No. 13 Copyright 2015
VLDB Endowment 2150 8097/15/09
PolatoIvanilton, R é Reginaldo, Goldman
Alfredo, Kon Fabio, "A comprehensive view
of Hadoop research-A systematic literature
review", Journal of Network and Computer
Applications, vol. 46, pp. 1-25, November
2014.
International Journal of Engineering Research and Applications (IJERA) is UGC approved
Journal with Sl. No. 4525, Journal no. 47088. Indexed in Cross Ref, Index Copernicus (ICV
80.82), NASA, Ads, Researcher Id Thomson Reuters, DOAJ.
1FirojParwej*. “A Close-Up View About Spark in Big Data Jurisdiction.” International Journal
of Engineering Research and Applications (IJERA), vol. 08, no. 01, 2018, pp. 26–41.
www.ijera.com
DOI: 10.9790/9622-0801022641
41 | P a g e