Unit 4

UNIT - 4 SPARK
STRUCTURE
4.0 Learning Objectives
4.1 Introduction
4.2 Data Analysis with Spark
4.3 Downloading Spark
4.4 Getting Started with Spark
4.5 Programming with RDDs
4.6 Machine Learning with MLlib
4.7 NoSQL
4.7.1 What is NoSQL?
4.7.2 Where is NoSQL Used?
4.7.3 Types of NoSQL databases
4.7.4 Why NoSQL?
4.7.5 Advantages of NoSQL
4.7.6 Use of NoSQL in Industry
4.8 Definition of SQL
4.9 SQL vs NoSQL
4.10 New SQL
4.11 Summary
4.12 Keywords
4.13 Learning Activity
4.14 Unit End Questions
4.15 References
175
4.0 LEARNING OBJECTIVES
After studying this unit, you will be able to:
 Describe the Data Analysis with Spark.
 Define the Programming with RDDs.
 Explain Machine Learning with MLlib.
 Elucidate the NoSQL.
 Describe the New SQL
4.1 INTRODUCTION
Industries are using Hadoop extensively to analyse their data sets. The reason is that Hadoop
framework is based on a simple programming model (Map Reduce), and it enables a
computing solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main
concern is to maintain speed in processing large datasets in terms of waiting time between
queries and waiting time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its
own cluster management computation, it uses Hadoop for storage purpose only.
Apache Spark is a lightning-fast cluster computing technology, designed for fast

computation. It is based on Hadoop Map Reduce, and it extends the Map Reduce model to
efficiently use it for more types of computations, which includes interactive queries and
stream processing. The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application.
176
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries, and streaming. Apart from supporting all these workloads in a
respective system, it reduces the management burden of maintaining separate tools.
Spark is one of Hadoop’s sub projects developed in 2009 in UC Berkeley’s AMPLab by

Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache
software foundation in 2013, and now Apache Spark has become a top-level Apache project
from Feb-2014.
Big Data have increased immense interest in the past few years. Nowadays analysing Big
Data is very common constraint and such chuck really turns into a big challenge to analyse
the mass amount of data to get impact and different patterns of information on a convenient
way. Processing the big data information in a single machine or even to store these Big Data
has become another big challenge of the Big Data. The elucidation for the above constraints
is to give out data over large clusters so that Big Data to be analysed and for storing
information should be overcome. The article will explore perceptions of Big Data Analysis
using emerging tools of Big Data such as Apache Hadoop and Spark and its performance
4.2 DATA ANALYSIS WITH SPARK
Due to the advancement in technology, various industries or domains like transport, tourism,
hotel, and banks and so on have been digitized and generating large amount of data. People
are using the Internet to generate forms, reports, graphs, periodic or to do online shopping on
discounted rates. Social media (Facebook, Instagram, blogs, twitter etc.) or entertainment
industries are using computers to share pictures, audios, and videos. According to a general
survey posted on Wikipedia till April 2019, 56.1% of population has been accessing Internet
services. Government websites have also been generating massive amount of data by
uploading or downloading pictures or credentials like fingerprints, retina scan, forms, reports
of the citizens. The big data analytic techniques have been designed to store, process, and
analyse such a mixed mode data.
Enormous data for the most part incorporates datasets with sizes a far from the capacity of
normally utilized programming devices to catch, precise, oversee, and process information
inside a bearable slipped by time. Huge Data "estimate" is an always moving focus, starting
at 2012 extending from a couple of dozen terabytes to numerous petabytes of information.
177
Enormous Data is a lot of procedures and advancements that require new types of
combination to uncover huge, concealed qualities from huge datasets that are differing,
complex, and of a large-scale.
Data analysis is defined as a process of cleaning, transforming, and modelling data to

discover useful information for business decision-making. The purpose of Data Analysis is to
extract useful information from data and taking the decision based upon the data analysis.
A simple example of Data analysis is whenever we take any decision in our day-to-day life is
by thinking about what happened last time or what will happen by choosing that particular
decision. This is nothing but analysing our past or future and making decisions based on it.
For that, we gather memories of our past or dreams of our future. So that is nothing but data
analysis. Now same thing analyst does for business purposes, is called Data Analysis.
The Spark project contains multiple closely integrated components. At its core, Spark is a
“computational engine” that is responsible for scheduling, distributing, and monitoring
applications consisting of many computational tasks across many worker machines, or a
computing cluster. Because the core engine of Spark is both fast and general-purpose, it
powers multiple higher-level components specialized for various workloads, such as SQL or
machine learning. These components are designed to interoperate closely, letting you
combine them like libraries in a software project.
A philosophy of tight integration has several benefits. First, all libraries and higher-level
components in the stack benefit from improvements at the lower layers. For example, when
Spark’s core engine adds an optimization, SQL and machine learning libraries automatically
speed up as well. Second, the costs associated with running the stack are minimized, because
instead of running 5–10 independent software systems, an organization needs to run only one.
These costs include deployment, maintenance, testing, support, and others. This also means
that each time a new component is added to the Spark stack, every organization that uses
Spark will immediately be able to try this new component. This changes the cost of trying out
a new type of data analysis from downloading, deploying, and learning a new software
project to upgrading Spark.
Finally, one of the largest advantages of tight integration is the ability to build applications
that seamlessly combine different processing models. For example, in Spark you can write
one application that uses machine learning to classify data in real time as it is ingested from
streaming sources.
178
Simultaneously, analysts can query the resulting data, also in real time, via SQL (e.g., to join
the data with unstructured logfiles). In addition, more sophisticated data engineers and data
scientists can access the same data via the Python shell for ad hoc analysis. Others might
access the data in standalone batch applications. All the while, the IT team has to maintain
only one system.
Spark Core
Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems, and more.
Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which
are Spark’s main programming abstraction. RDDs represent a collection of items distributed
across many compute nodes that can be manipulated in parallel. Spark Core provides many
APIs for building and manipulating these collections.
Spark SQL
Spark SQL is Spark’s package for working with structured data. It allows querying data via
SQL as well as the Apache Hive variant of SQL—called the Hive Query Language (HQL)—
and it supports many sources of data, including Hive tables, Parquet, and JSON. Beyond
providing a SQL interface to Spark, Spark SQL allows developers to intermix SQL queries
with the programmatic data manipulations supported by RDDs in Python, Java, and Scala, all
within a single application, thus combining SQL with complex analytics. This tight
integration with the rich computing environment provided by Spark makes Spark SQL unlike
any other open-source data warehouse tool. Spark SQL was added to Spark in version 1.0.
Shark was an older SQL-on-Spark project out of the University of California, Berkeley, that
modified Apache Hive to run on Spark. It has now been replaced by Spark SQL to provide
better integration with the Spark engine and language APIs.
Spark Streaming
Spark Streaming is a Spark component that enables processing of live streams of data.
Examples of data streams include logfiles generated by production web servers, or queues of
messages containing status updates posted by users of a web service. Spark Streaming
provides an API for manipulating data streams that closely matches the Spark Core’s RDD
API, making it easy for programmers to learn the project and move between applications that
manipulate data stored in memory, on disk, or arriving in real time.
179
Underneath its API, Spark Streaming was designed to provide the same degree of fault
tolerance, throughput, and scalability as Spark Core.
MLlib
Spark comes with a library containing common machine learning (ML) functionality, called
MLlib. MLlib provides multiple types of machines learning algorithms, including
classification, regression, clustering, and collaborative filtering, as well as supporting
functionality such as model evaluation and data import. It also provides some lower-level ML
primitives, including a generic gradient descent optimization algorithm. All of these methods
are designed to scale out across a cluster.
GraphX
GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and
performing graph-parallel computations. Like Spark Streaming and Spark SQL, GraphX
extends the Spark RDD API, allowing us to create a directed graph with arbitrary properties
attached to each vertex and edge. GraphX also provides various operators for manipulating
graphs (e.g., sub graph and map Vertices) and a library of common graph algorithms (e.g.,
PageRank and triangle counting).
Cluster Managers
Under the hood, Spark is designed to efficiently scale up from one-to-many thousands of
compute nodes. To achieve this while maximizing flexibility, Spark can run over a variety of
cluster managers, including Hadoop YARN, Apache Mesos, and a simple cluster manager
included in Spark itself called the Standalone Scheduler. If you are just installing Spark on an
empty set of machines, the Standalone Scheduler provides an easy way to get started; if you
already have a Hadoop YARN or Mesos cluster, however, Spark’s support for these cluster
managers allows your applications to also run on them.
Apache Spark has following features.
 Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk. This is possible by reducing
number of read/write operations to disk. It stores the intermediate processing data in
memory.
180
 Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80
high-level operators for interactive querying.
 Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports
SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
There are three ways of Spark deployment as explained below.
 Standalone − Spark Standalone deployment means Spark occupies the place on top of
HDFS (Hadoop Distributed File System) and space is allocated for HDFS, explicitly.
Here, Spark and Map Reduce will run side by side to cover all spark jobs on cluster.
 Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without
any pre-installation or root access required. It helps to integrate Spark into Hadoop
ecosystem or Hadoop stack. It allows other components to run on top of stack.
 Spark in Map Reduce (SIMR) − Spark in Map Reduce is used to launch spark job in
addition to standalone deployment. With SIMR, user can start Spark and uses its shell
without any administrative access.
4.3 DOWNLOADING SPARK
Apache Spark is an open-source framework that processes large volumes of stream data from
multiple sources. Spark is used in distributed computing with machine learning applications,
data analytics, and graph-parallel processing.
Fig 4.1 Installation of Spark on windows
181
Prerequisites
 A system running Windows 10
 A user account with administrator privileges (required to install software, modify file
permissions, and modify system PATH)
 Command Prompt or PowerShell
 A tool to extract .tar files, such as 7-Zip
Install Apache Spark on Windows
Installing Apache Spark on Windows 10 may seem complicated to novice users, but this
simple tutorial will have you up and running. If you already have Java 8 and Python 3
installed, you can skip the first two steps.
Step 1: Install Java 8
 Apache Spark requires Java 8. You can check to see if Java is installed using the
command prompt.
 Open the command line by clicking Start > type cmd > click Command Prompt.
 Type the following command in the command prompt:
Java -version
If Java is installed, it will respond with the following output:
Fig 4.2 Response if java is already installed
182
Your version may be different. The second digit is the Java version – in this case, Java 8.
If you don’t have Java installed:
 Open a browser window, and navigate to https://java.com/en/download/
Fig 4.3 Java download screen
 Click the Java Download button and save the file to a location of your choice.
 Once the download finishes double-click the file to install Java.
Note: At the time this article was written, the latest Java version is 1.8.0_251. Installing
a later version will still work. This process only needs the Java Runtime Environment
(JRE) – the full Development Kit (JDK) is not required. The download link to JDK is
https://www.oracle.com/java/technologies/javase-downloads.html.
Step 2: Install Python
 To install the Python package manager, navigate to https://www.python.org/ in your web

browser.
183
 Mouse over the Download menu option and click Python 3.8.3. 3.8.3 is the latest version
at the time of writing the article.
 Once the download finishes, run the file.
Fig 4.4 Python download screen
 Near the bottom of the first setup dialog box, check off Add Python 3.8 to PATH. Leave
the other box checked.
 Next, click Customize installation.
184
Fig 4.5 Python installation screen
 You can leave all boxes checked at this step, or you can uncheck the options you do not
want.
 Click Next.
 Select the box Install for all users and leave other boxes as they are.
 Under Customize install location, click Browse and navigate to the C drive. Add a new
folder and name its Python.
 Select that folder and click OK.

Fig 4.6 Python pre-installation options
 Click Install and let the installation complete.
 When the installation completes, click the disabled path length limit option at the bottom
and then click Close.
 If you have a command prompt open, restart it. Verify the installation by checking the
version of Python:
Python –version
 The output should print
Python 3.8.3.
Step 3: Download Apache Spark
 Open a browser and navigate to https://spark.apache.org/downloads.html.
 Under the Download Apache Spark heading, there are two drop-down menus. Use the
current non-preview version.
 In our case, in Choose a Spark release drop-down menu select 2.4.5 (Feb 05
2020).
 In the second drop-down Choose a package type, leave the selection Pre-built
for Apache Hadoop 2.7.
 Click the spark-2.4.5-bin-hadoop2.7.tgz link.
Fig 4.7 Apache spark download screen
 A page with a list of mirrors loads where you can see different servers to download
from. Pick any from the list and save the file to your Downloads folder.
Step 4: Verify Spark Software File
 Verify the integrity of your download by checking the checksum of the file. This ensures
you are working with unaltered, uncorrupted software.
 Navigate back to the Spark Download page and open the Checksum link, preferably in a
new tab.
 Next, open a command line and enter the following command:
certutil -hashfile c:\users\username\Downloads\spark-2.4.5-bin-hadoop2.7.tgz

SHA512
 Change the username to your username. The system displays a long alphanumeric code,
along with the message
Certutil: -hashfile completed successfully.
187
Fig 4.8 Spark software verification command screen
 Compare the code to the one you opened in a new browser tab. If they match, your
download file is uncorrupted.
Step 5: Install Apache Spark
Installing Apache Spark involves extracting the downloaded file to the desired location.
 Create a new folder named Spark in the root of your C: drive. From a command line,
enter the following:
cd \
mkdir Spark
 In Explorer, locate the Spark file you downloaded.
 Right-click the file and extract it to C:\Spark using the tool you have on your system
(e.g., 7-Zip).
 Now, your C:\Spark folder has a new folder spark-2.4.5-bin-hadoop2.7 with the
necessary files inside.
Step 6: Add winutils.exe File
Download the winutils.exe file for the underlying Hadoop version for the Spark installation
you downloaded.
 Navigate to this URL https://github.com/cdarlint/winutils and inside the bin folder,

locate winutils.exe, and click it.
188
Fig 4.9 Underlying Hadoop version for the spark installation
 Find the Download button on the right side to download the file.
 Now, create new folders Hadoop and bin on C: using Windows Explorer or the
Command Prompt.
 Copy the winutils.exe file from the Downloads folder to C:\hadoop\bin.
Step 7: Configure Environment Variables
Configuring environment variables in Windows adds the Spark and Hadoop locations to your
system PATH. It allows you to run the Spark shell directly from a command prompt window.
 Click Start and type environment.
 Select the result labelled Edit the system environment variables.
 A System Properties dialog box appears. In the lower-right corner, click Environment
Variables and then click New in the next window.
189
Fig 4.10 System properties screen
 For Variable Name type SPARK_HOME.
 For Variable Value type C:\Spark\spark-2.4.5-bin-hadoop2.7 and click OK. If you

changed the folder path, use that one instead.
Fig 4.11 Download Path Screen
 In the top box, click the Path entry, then click Edit. Be careful with editing the system
path. Avoid deleting any entries already on the list.
190
Fig 4.12 Edit path screen
 You should see a box with entries on the left. On the right, click New.
 The system highlights a new line. Enter the path to the Spark folder C:\Spark\spark-
2.4.5-bin-hadoop2.7\bin. We recommend using %SPARK_HOME%\bin to avoid
possible issues with the path.
Fig 4.13 Edit path entries
 Repeat this process for Hadoop and Java.
i. For Hadoop, the variable name is HADOOP_HOME and for the value use the
path of the folder you created earlier: C:\hadoop. Add C:\hadoop\bin to the
Path variable field, but we recommend using %HADOOP_HOME%\bin.
ii. For Java, the variable name is JAVA_HOME and for the value use the path to
your Java JDK directory (in our case it’s C:\Program Files\Java\jdk1.8.0_251).
 Click OK to close all open windows.
Note: Star by restarting the Command Prompt to apply changes. If that doesn't work, you
will need to reboot the system.
Step 8: Launch Spark
 Open a new command-prompt window using the right-click and run as administrator:
192
 To start Spark, enter:
C:\Spark\spark-2.4.5-bin-hadoop2.7\bin\spark-shell
If you set the environment path correctly, you can type spark-shell to launch Spark.
 The system should display several lines indicating the status of the application. You may
get a Java pop-up. Select Allow access to continue.
Finally, the Spark logo appears, and the prompt displays the Scala shell.
Fig 4.14 Spark launch screen
 Open a web browser and navigate to http://localhost:4040/.
 You can replace local host with the name of your system.
 You should see an Apache Spark shell Web UI. The example below shows the Executors
page.
193
Fig 4.15 Apache spark shell web UI
To exit Spark and close the Scala shell, press ctrl-d in the command-prompt window.
Note: If you installed Python, you could run Spark using Python with this command:
Pyspark
Exit using quit().
Test Spark
In this example, we will launch the Spark shell and use Scala to read the contents of a file.
You can use an existing file, such as the README file in the Spark directory, or you can
create your own. We created pnaptest with some text.
 Open a command-prompt window and navigate to the folder with the file you want to
use and launch the Spark shell.
 First, state a variable to use in the Spark context with the name of the file. Remember to
add the file extension if there is any.
val x =sc.textFile("pnaptest")
194
 The output shows an RDD is created. Then, we can view the file contents by using this
command to call an action:
x.take (11). foreach (println)
Fig 4.16 Spark testing demo
This command instructs Spark to print 11 lines from the file you specified. To perform an
action on this file (value x), add another value y, and do a map transformation.
 For example, you can print the characters in reverse with this command:
val y = x.map (_. reverse)
 The system creates a child RDD in relation to the first one. Then, specify how many
lines you want to print from the value y:
y. take (11).foreach (println)
195
Fig 4.17 Print lines demo screen
The output prints 11 lines of the pnaptest file in the reverse order.
When done, exit the shell using
ctrl-d.
4.4 GETTING STARTED WITH SPARK
This is a quick introduction to using Spark. We will first introduce the API through Spark’s
interactive shell (in Python), then show how to write applications in Java, and Python.
To follow along with this guide, first, download a packaged release of Spark from the Spark
website. Since we won’t be using HDFS, you can download a package for any version of
Hadoop.
Note that, before Spark 2.0, the main programming interface of Spark was the Resilient
Distributed Dataset (RDD). After Spark 2.0, RDDs are replaced by Dataset, which is strongly
typed like an RDD, but with richer optimizations under the hood. The RDD interface is still
supported, and you will read about it later. However, we highly recommend you switch to use
Dataset, which has better performance than RDD.
Setting Up SPARK
Spark is pretty simple to set up and get running on your machine. All you really need to do is
download one of the pre-built packages and so long as you have Java 6+ and Python 2.6+ you
can simply run the Spark binary on Windows, Mac OS X, and Linux. Ensure that the java
program is on your PATH or that the JAVA_HOME environment variable is set. Similarly,
python must also be in your PATH.
Assuming you already has Java and Python:
At this point Spark is installed and ready to use on your local machine in "standalone mode."
You can develop applications here and submit Spark jobs that will run in a multi-
process/multi-threaded mode, or you can configure this machine as a client to a cluster
(though this is not recommended as the driver plays an important role in Spark jobs and
should be in the same network as the rest of the cluster). Probably the most you will do with
Spark on your local machine beyond development is to use the spark-ec2 scripts to configure
an EC2 Spark cluster on Amazon's cloud.
Interactive Analysis with the Spark Shell
Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyse
data interactively. Start it by running the following in the Spark directory:
./bin/pyspark
Or if PySpark is installed with pip in your current environment:
pyspark
Spark’s primary abstraction is a distributed collection of items called a Dataset. Datasets can
be created from Hadoop Input Formats (such as HDFS files) or by transforming other
Datasets. Due to Python’s dynamic nature, we don’t need the Dataset to be strongly typed in
Python. As a result, all Datasets in Python are Dataset [Row], and we call it Data Frame to be
consistent with the data frame concept in Pandas and R. Let’s make a new Data Frame from
the text of the README file in the Spark source directory:
>>> textFile = spark.read.text("README.md")
You can get values from Data Frame directly, by calling some actions, or transform the Data
Frame to get a new one.
>>> textFile.count () # Number of rows in this Data Frame
126
>>> textFile.first () # First row in this Data Frame
Row(value=u'# Apache Spark')
Now let’s transform this Data Frame to a new one. We call filter to return a new Data Frame
with a subset of the lines in the file.
197
>>> linesWithSpark=textFile.filter(textFile.value.contains("Spark"))
We can chain together transformations and actions:
>>> textFile.filter(textFile.value.contains("Spark")).count() # How many lines contain

"Spark"?
15
Dataset Operations
Dataset actions and transformations can be used for more complex computations. Let’s say
we want to find the line with the most words:
>>> from pyspark.sql.functions import *
>>> textFile.select(size(split(textFile.value,
"\s+")).name("numWords")).agg(max(col("numWords"))).collect()
[Row(max(numWords)=15)]
This first maps a line to an integer value and aliases it as “numWords”, creating a new Data
Frame. agg is called on that Data Frame to find the largest word count. The arguments to
select and agg are both Column, we can use df.colName to get a column from a Data Frame.
We can also import pyspark.sql.functions, which provides a lot of convenient functions to
build a new Column from an old one.
One common data flow pattern is Map Reduce, as popularized by Hadoop. Spark can
implement Map Reduce flows easily:
>>> wordCounts = textFile.select(explode(split(textFile.value,

"\s+")).alias("word")).groupBy("word").count()
Here, we use the explode function in select, to transform a Dataset of lines to a Dataset of
words, and then combine groupBy and count to compute the per-word counts in the file as a
Data Frame of 2 columns: “word” and “count”. To collect the word counts in our shell, we
can call collect:
>>> wordCounts.collect()
[Row(word=u'online', count=1), Row(word=u'graphs', count=1), ...]
Apache Spark Examples
198
These examples give a quick overview of the Spark API. Spark is built on the concept of
distributed datasets, which contain arbitrary Java or Python objects. You create a dataset from
external data, then apply parallel operations to it. The building block of the Spark API is its
RDD API. In the RDD API, there are two types of operations: transformations, which define
a new dataset based on previous ones, and actions, which kick off a job to execute on a
cluster. On top of Spark’s RDD API, high level APIs are provided, e.g., Data Frame API and
Machine Learning API. These high-level APIs provide a concise way to conduct certain data
operations.
 Word Count
In this example, we use a few transformations to build a dataset of (String, Int) pairs
called counts and then save it to a file. (Using Python)
text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
 Text Search
In this example, we search through the error messages in a log file.
textFile = sc.textFile("hdfs://...")
# Creates a Data Frame having a single column named "line"
df = textFile.map(lambda r: Row(r)).toDF(["line"])
errors = df.filter(col("line").like("%ERROR%"))
# Counts all the errors
errors.count()
# Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
# Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()
199
4.5 PROGRAMMING WITH RDDs
RDD stands for “Resilient Distributed Dataset”. It is the fundamental data structure of
Apache Spark. RDD in Apache Spark is an immutable collection of objects which computes
on the different node of the cluster. RDD is a fault-tolerant collection of elements that can be
operated on in parallel.
Decomposing the Name RDD
Resilient, i.e., fault-tolerant with the help of RDD lineage graph (DAG) and so able to
recompute missing or damaged partitions due to node failures.
Distributed, since Data resides on multiple nodes.
Dataset represents records of the data you work with. The user can load the data set
externally which can be either JSON file, CSV file, text file or database via JDBC with no
specific data structure.
Hence, each and every dataset in RDD is logically partitioned across many servers so that
they can be computed on different nodes of the cluster. RDDs are fault tolerant i.e. It
possesses self-recovery in the case of failure.
There are two ways to create RDDs: parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file system,
HDFS, HBase, or any data source offering a Hadoop Input Format. One can also operate
Spark RDDs in parallel with a low-level API that offers transformations and actions. We will
study these Spark RDD Operations later in this section.
Parallelized Collections
Parallelized collections are created by calling JavaSparkContext’s parallelize method on an

existing Collection in your driver program. The elements of the collection are copied to form
a distributed dataset that can be operated on in parallel. For example, here is how to create a
parallelized collection holding the numbers 1 to 5:
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);
Once created, the distributed dataset (distData) can be operated on in parallel. For example,
we might call distData.reduce((a, b) -> a + b) to add up the elements of the list. We describe
operations on distributed datasets later on.
200
One important parameter for parallel collections is the number of partitions to cut the dataset
into. Spark will run one task for each partition of the cluster. Typically, you want 2-4
partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions
automatically based on your cluster. However, you can also set it manually by passing it as a
second parameter to parallelize (e.g., sc.parallelize(data, 10)). Note: some places in the code
use the term slices (a synonym for partitions) to maintain backward compatibility.
External Datasets
Spark can create distributed datasets from any storage source supported by Hadoop, including
your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files,
Sequence Files, and any other Hadoop Input Format.
Text file RDDs can be created using Spark Context’s textFile method. This method takes a
URI for the file (either a local path on the machine, or a hdfs://, s3a://, etc URI) and reads it
as a collection of lines. Here is an example invocation:
JavaRDD<String> distFile = sc.textFile("data.txt");
Once created, distFile can be acted on by dataset operations. For example, we can add up the
sizes of all the lines using the map and reduce operations as follows:
distFile.map(s -> s.length()).reduce((a, b) -> a + b).
Reading Files with SPARK
 If using a path on the local file system, the file must also be accessible at the same path
on worker nodes. Either copy the file to all workers or use a network-mounted shared
file system.
 All of Spark’s file-based input methods, including textFile, support running on

directories, compressed files, and wildcards as well. For example, you can use
textFile("/my/directory"), textFile("/my/directory/*.txt"), and
textFile("/my/directory/*.gz").
 The textFile method also takes an optional second argument for controlling the number
of partitions of the file. By default, Spark creates one partition for each block of the file
(blocks being 128MB by default in HDFS), but you can also ask for a higher number of
partitions by passing a larger value. Note that you cannot have fewer partitions than
blocks.
201
Apart from text files, Spark’s Java API also supports several other data formats:
 JavaSparkContext.wholeTextFiles lets you read a directory containing multiple small

text files, and returns each of them as (filename, content) pairs. This is in contrast with
textFile, which would return one record per line in each file.
 For Sequence Files, use Spark Context’s sequence File [K, V] method where K and V
are the types of key and values in the file. These should be subclasses of Hadoop’s
Writable interface, like IntWritable and Text.
 For other Hadoop Input Formats, you can use the JavaSparkContext.hadoopRDD
method, which takes an arbitrary JobConf and input format class, key class, and value
class. Set these the same way you would for a Hadoop job with your input source. You
can also use JavaSparkContext.newAPIHadoopRDD for Input Formats based on the
“new” Map Reduce API (org.apache.hadoop.mapreduce).
 JavaRDD.saveAsObjectFile and JavaSparkContext.objectFile support saving an RDD in

a simple format consisting of serialized Java objects. While this is not as efficient as
specialized formats like Avro, it offers an easy way to save any RDD.
Caching and Partitioning
Spark RDD can also be cached and manually partitioned. Caching is beneficial when we use
RDD several times. And manual partitioning is important to correctly balance partitions.
Generally, smaller partitions allow distributing RDD data more equally, among more
executors. Hence, fewer partitions make the work easy.
Programmers can also call a persist method to indicate which RDDs they want to reuse in
future operations. Spark keeps persistent RDDs in memory by default, but it can spill them to
disk if there is not enough RAM. Users can also request other persistence strategies, such as
storing the RDD only on disk or replicating it across machines, through flags to persist.
In Spark mainly we do following three things
 Creating new RDD
 Transforming existing RDD
 Calling operations on RDDS and compute the result.
 An RDD in Spark is simply an immutable distributed collection of objects.
202
 Each RDD is split into multiple partitions, which may be computed on different nodes of
the Cluster
Operations on RDD
RDDs support two types of operations: transformations, which create a new dataset from an
existing one, and actions, which return a value to the driver program after running a
computation on the dataset. For example, map is a transformation that passes each dataset
element through a function and returns a new RDD representing the results. On the other
hand, reduce is an action that aggregates all the elements of the RDD using some function
and returns the final result to the driver program (although there is also a parallel
reduceByKey that returns a distributed dataset). Let us discuss about them in detail.
 Transformations: which creates a new RDD from existing RDD e.g., filter.
val hadoopexamLines = lines.filter(line => line.contains("HadoopExam"))
 Many transformations are element-wise; that is, they work on one element at a
time; but this is not true for all transformations.
val lines = sc.textFile("spark.txt")
val hadoopexamLines = lines.filter(line => line.contains("HadoopExam"))
 Note that the filter() operation does not mutate the existing lines. Instead, it
returns a
pointer to an entirely new RDD hadoopexamLines.
 Use lines RDD again
val SparkLines = lines.filter(line => line.contains("Spark"))
 Now use union transformation to create new RDD by combining both the
above RDD.
val resultsRDD = hadoopexamLines.union(SparkLines)
resultsRDD.collect() //
 RDDs also have a collect() function to retrieve the entire RDD
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g., a file).
203
The transformations are only computed when an action requires a result to be returned to the
driver program. This design enables Spark to run more efficiently. For example, we can
realize that a dataset created through map will be used in a reduce and return only the result
of the reduce to the driver, rather than the larger mapped dataset.
The following table lists some of the common transformations supported by Spark. Refer to
the RDD API doc (Scala, Java, Python, R) and pair RDD functions doc (Scala, Java) for
details.
Transformation Meaning
map(func) Return a new distributed dataset formed by passing each

element of the source through a function func.
filter(func) Return a new dataset formed by selecting those elements of

the source on which func returns true.
flat Map(func) Similar to map, but each input item can be mapped to 0 or
more output items (so func should return a Seq rather than a
single item).
map Partitions(func) Similar to map, but runs separately on each partition (block)
of the RDD, so func must be of type Iterator<T> =>
Iterator<U> when running on an RDD of type T.
mapPartitionsWithIndex(func Similar to map Partitions, but also provides func with an

) integer value representing the index of the partition,
so func must be of type (Int, Iterator<T>) => Iterator<U>
when running on an RDD of type T.
sample(with Sample a fraction of the data, with or without replacement,

Replacement, fraction, seed) using a given random number generator seed.
union(other Dataset) Return a new dataset that contains the union of the elements
in the source dataset and the argument.
intersection(other Dataset) Return a new RDD that contains the intersection of
204
elements in the source dataset and the argument.
distinct([numPartitions])) Return a new dataset that contains the distinct elements of

the source dataset.
groupByKey([numPartitions] When called on a dataset of (K, V) pairs, returns a dataset

) of (K, Iterable<V>) pairs.
Note: If you are grouping in order to perform an
aggregation (such as a sum or average) over each key,
using reduceByKey or aggregateByKey will yield much
better performance.
Note: By default, the level of parallelism in the output
depends on the number of partitions of the parent RDD.
You can pass an optional numPartitions argument to set a
different number of tasks.
reduceByKey(func, When called on a dataset of (K, V) pairs, returns a dataset

[numPartitions]) of (K, V) pairs where the values for each key are aggregated
using the given reduce function func, which must be of type
(V,V) => V. Like in groupByKey, the number of reduce
tasks is configurable through an optional second argument.
aggregateByKey(zero When called on a dataset of (K, V) pairs, returns a dataset

Value)(seqOp, combOp, of (K, U) pairs where the values for each key are aggregated
[numPartitions]) using the given combine functions and a neutral "zero"
value. Allows an aggregated value type that is different than
the input value type, while avoiding unnecessary
allocations. Like in groupByKey, the number of reduce
tasks is configurable through an optional second argument.
sortByKey([ascending], When called on a dataset of (K, V) pairs where K

[numPartitions]) implements Ordered, returns a dataset of (K, V) pairs sorted
by keys in ascending or descending order, as specified in
205
the Boolean ascending argument.
join(other Dataset, When called on datasets of type (K, V) and (K, W), returns
[numPartitions]) a dataset of (K, (V, W)) pairs with all pairs of elements for
each key. Outer joins are supported
through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
cogroup(other Dataset, When called on datasets of type (K, V) and (K, W), returns
[numPartitions]) a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This
operation is also called group With.
Cartesian(other Dataset) When called on datasets of types T and U, returns a dataset

of (T, U) pairs (all pairs of elements).
pipe(command, [envVars]) Pipe each partition of the RDD through a shell command,
e.g., a Perl or bash script. RDD elements are written to the
process's stdin and lines output to its stdout are returned as
an RDD of strings.
coalesce(numPartitions) Decrease the number of partitions in the RDD to

numPartitions. Useful for running operations more
efficiently after filtering down a large dataset.
repartition(numPartitions) Reshuffle the data in the RDD randomly to create either

more or fewer partitions and balance it across them. This
always shuffles all data over the network.
repartitionAndSortWithinPart Repartition the RDD according to the given partitioner and,

itions(partitioner) within each resulting partition, sort records by their keys.
This is more efficient than calling repartition and then
sorting within each partition because it can push the sorting
down into the shuffle machinery.
Table 4.1 Common transformations supported by spark
206
 Actions: compute a result based on an RDD, and either return it to the driver program or
save it to an external storage system (e.g., HDFS, S3). kick off a computation
hadoopexamLines.first()
Note: Transformations return RDDs, whereas actions return some other data type.
 Actions force the evaluation of the transformations required for the RDD they
were called on, since they need to actually produce output.
println("Input had " + resultsRDD.count() + " concerning lines")
println("Here are 2 examples:")
resultsRDD.take(2).foreach(println)
 Keep in mind that your entire dataset must fit in memory on a single machine
to use collect () on it, so collect () shouldn’t be used on large datasets.
 You can save the contents of an RDD using the saveAsTextFile() action,
saveAsSequenceFile() or any of a number of actions for various built-in
formats.
By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory using the persist (or cache) method, in
which case Spark will keep the elements around on the cluster for much faster access the next
time you query it. There is also support for persisting RDDs on disk or replicated across
multiple nodes. The following table lists some of the common actions supported by Spark.
Refer to the RDD API doc (Scala, Java, Python, R) and pair RDD functions doc (Scala, Java)
for details.
Action Meaning
reduce(func) Aggregate the elements of the dataset using a

function func (which takes two arguments and returns one).
The function should be commutative and associative so that
it can be computed correctly in parallel.
collect() Return all the elements of the dataset as an array at the

driver program. This is usually useful after a filter or other
operation that returns a sufficiently small subset of the data.
207
Action Meaning
count() Return the number of elements in the dataset.
first() Return the first element of the dataset (similar to take(1)).
take(n) Return an array with the first n elements of the dataset.
take Sample(with Return an array with a random sample of num elements of

Replacement, num, [seed]) the dataset, with or without replacement, optionally pre-
specifying a random number generator seed.
take Ordered(n, [ordering]) Return the first n elements of the RDD using either their
natural order or a custom comparator.
saveAsTextFile(path) Write the elements of the dataset as a text file (or set of text
files) in a given directory in the local file system, HDFS or
any other Hadoop-supported file system. Spark will call to
String on each element to convert it to a line of text in the
file.
saveAsSequenceFile(path) Write the elements of the dataset as a Hadoop Sequence File

(Java and Scala) in a given path in the local file system, HDFS or any other
Hadoop-supported file system. This is available on RDDs of
key-value pairs that implement Hadoop's Writable interface.
In Scala, it is also available on types that are implicitly
convertible to Writable (Spark includes conversions for
basic types like Int, Double, String, etc).
saveAsObjectFile(path) Write the elements of the dataset in a simple format using

(Java and Scala) Java serialization, which can then be loaded
using SparkContext.objectFile().
countByKey() Only available on RDDs of type (K, V). Returns a hash map
of (K, Int) pairs with the count of each key.
foreach(func) Run a function func on each element of the dataset. This is
208
Action Meaning
usually done for side effects such as updating

an Accumulator or interacting with external storage systems.
Note: modifying variables other than Accumulators outside
of the foreach() may result in undefined behavior.
See Understanding closures for more details.
Table 4.2 Common actions supported by spark
The Spark RDD API also exposes asynchronous versions of some actions, like
foreachAsync for foreach, which immediately return a Future Action to the caller
instead of blocking on completion of the action. This can be used to manage or wait for the
asynchronous execution of the action.
Shuffle Operations
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s
mechanism for re-distributing data so that it’s grouped differently across partitions. This
typically involves copying data across executors and machines, making the shuffle a complex
and costly operation.
To understand what happens during the shuffle, we can consider the example of the
reduceByKey operation. The reduceByKey operation generates a new RDD where all values
for a single key are combined into a tuple - the key and the result of executing a reduce
function against all values associated with that key. The challenge is that not all values for a
single key necessarily reside on the same partition, or even the same machine, but they must
be co-located to compute the result.
In Spark, data is generally not distributed across partitions to be in the necessary place for a
specific operation. During computations, a single task will operate on a single partition - thus,
to organize all the data for a single reduceByKey reduce task to execute, Spark needs to
perform an all-to-all operation. It must read from all partitions to find all the values for all
keys, and then bring together values across partitions to compute the final result for each key
- this is called the shuffle.
209
Although the set of elements in each partition of newly shuffled data will be deterministic,
and so is the ordering of partitions themselves, the ordering of these elements is not. If one
desires predictably ordered data following shuffle, then it’s possible to use:
 map Partitions to sort each partition using, for example: sorted
 repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously

repartitioning
 sortBy to make a globally ordered RDD
Operations which can cause a shuffle include repartition operations like repartition and
coalesce, ‘By Key operations (except for counting) like groupByKey and reduceByKey and
join operations like cogroup and join.
Printing Elements of an RDD
Another common idiom is attempting to print out the elements of an RDD using
rdd.foreach(println) or rdd.map(println). On a single machine, this will generate the expected
output and print all the RDD’s elements. However, in cluster mode, the output to stdout being
called by the executors is now writing to the executor’s stdout instead, not the one on the
driver, so stdout on the driver won’t show these! To print all elements on the driver, one can
use the collect() method to first bring the RDD to the driver node thus:
rdd.collect().foreach(println). This can cause the driver to run out of memory, though,
because collect () fetches the entire RDD to a single machine; if you only need to print a few
elements of the RDD, a safer approach is to use the take (): rdd.take (100).foreach (println).
4.6 MACHINE LEARNING WITH MLLIB
Machine Learning is part of a broader umbrella known as Artificial Intelligence. Machine

learning refers to the study of statistical models to solve specific problems with patterns and
inferences. These models are “trained” for the specific problem by the means of training data
drawn from the problem space. Machine learning is truly an inter-disciplinary area of study.
It requires knowledge of the business domain, statistics, probability, linear algebra, and
programming.
210
As this can clearly get overwhelming, it's best to approach this in an orderly fashion, what we
typically call a machine learning workflow:
Fig 4.18 Machine learning workflow
As we can see, every machine learning project should start with a clearly defined problem
statement. This should be followed by a series of steps related to data that can potentially
answer the problem.
Then we typically select a model looking at the nature of the problem. This is followed by a
series of model training and validation, which is known as model fine-tuning. Finally, we test
the model on previously unseen data and deploy it to production if satisfactory.
Machine learning is getting popular in solving real-world problems in almost every business
domain. It helps solve the problems using the data, which is often unstructured, noisy, and in
huge size. With the increase in data sizes and various sources of data, solving machine
learning problems using standard techniques pose a big challenge. Spark is a distributed
processing engine using the Map Reduce framework to solve problems related to big data and
processing of it.
Spark MLlib is a module on top of Spark Core that provides machine learning primitives as
APIs. Machine learning typically deals with a large amount of data for model training.
The base computing framework from Spark is a huge benefit. On top of this, MLlib provides
most of the popular machine learning and statistical algorithms. This greatly simplifies the
task of working on a large-scale machine learning project.
Spark framework has its own machine learning module called MLlib. In this article, I will use
pyspark and spark MLlib to demonstrate the use of machine learning using distributed
processing. Readers will be able to learn the below concept with real examples.
211
Setting up Spark in the Google Colaboratory
Apache Spark is a unified computing engine and a set of libraries for parallel data processing
on computer clusters. As of the time this writing, Spark is the most actively developed open-
source engine for this task; making it the de facto tool for any developer or data scientist
interested in big data. Spark supports multiple widely used programming languages (Python,
Java, Scala, and R), includes libraries for diverse tasks ranging from SQL to streaming and
machine learning, and runs anywhere from a laptop to a cluster of thousands of servers. This
makes it an easy system to start with and scale up to big data processing or incredibly large
scale.
Once, we have set up the spark in Google colab and made sure it is running with the correct
version i.e., 3.0.1 in this case, we can start exploring the machine learning API developed on
top of Spark. PySpark is a higher-level Python API to use spark with python. For this tutorial,
I assume the readers have a basic understanding of Machine Learning and SK-Learn for
model building and training. Spark MLlib used the same fit and predicts structure as in SK-
Learn.
4.7 NOSQL
NOSQL databases (commonly interpreted by developers as „not only SQL databases‟ and
not „no SQL‟) is an emerging alternative to the most widely used relational databases. As
the name suggests, it does not completely replace SQL but compliments it in such a way that
they can co-exist. In this paper we will be discussing the NOSQL data model, types of
NOSQL data stores, characteristics and features of each data store, query languages used in
NOSQL, advantages and disadvantages of NOSQL over RDBMS and the future prospects of
NOSQL.
4.7.1 What is NoSQL?
When people use the term “NoSQL database”, they typically use it to refer to any non-
relational database. Some say the term “NoSQL” stands for “non-SQL” while others say it
stands for “not only SQL.” Either way, most agree that NoSQL databases are databases that
store data in a format other than relational tables. A common misconception is that NoSQL
databases or non-relational databases don’t store relationship data well. NoSQL databases can
store relationship data—they just store it differently than relational databases do.
212
In fact, when compared with SQL databases, many find modelling relationship data in
NoSQL databases to be easier than in SQL databases, because related data doesn’t have to be
split between tables.
The problem with relational model is that it has some scalability issues that is performance
degrades rapidly as data volumes increases. This led to the development of a new data model
i.e. NOSQL. Though the concept of NOSQL was developed a long time ago, it was after
the introduction of database as a service (DBaaS) that it gained a prominent recognition.
Because of the high scalability provided by NOSQL, it was seen as a major competitor to the
relational database model. Unlike RDBMS, NOSQL databases are designed to easily scale
out as and when they grow. Most NOSQL systems have removed the multi-platform support
and some extra unnecessary features of RDBMS, making them much more lightweight and
efficient than their RDMS counterparts. The NOSQL data model does not guarantee ACID
properties (Atomicity, Consistency, Isolation and Durability) but instead it guarantees BASE
properties (Basically Available, Soft state, Eventual consistency). It is in compliance with the
CAP (Consistency, Availability, Partition tolerance) theorem
NoSQL data models allow related data to be nested within a single data structure.
NoSQL databases emerged in the late 2000s as the cost of storage dramatically decreased.
Gone were the days of needing to create a complex, difficult-to-manage data model simply
for the purposes of reducing data duplication. Developers (rather than storage) were
becoming the primary cost of software development, so NoSQL databases optimized for
developer productivity.
4.7.2 Where is NoSQL Used?
NoSQL databases were created to handle big data as part of their fundamental architecture.
Additional engineering is not required as it is when SQL databases are used to handle web-
scale applications. The path to data scalability is straightforward and well understood.
NoSQL databases are often based on a scale-out strategy, which makes scaling to large data
volumes much cheaper than when using the scale-up approach the SQL databases take.
The scale-out strategy used by most NoSQL databases provides a clear path to scaling the
amount of traffic a database can handle.
213
Scale-out architectures also provide benefits such as being able to upgrade a database or
change its structure with zero downtime. The scale-out architecture is one of the most
affordable ways to handle large volumes of traffic.
The scalability of NoSQL databases allows one database to serve both transactional and
analytical workloads from the same database. In SQL databases, usually, a separate data
warehouse is used to support analytics.
NoSQL databases were created during the cloud era and have adapted quickly to the
automation that is part of the cloud. Deploying databases at scale in a way that supports
microservices is often easier with NoSQL databases. NoSQL databases often have superior
integration with real-time streaming technologies.
Let us understand its uses in detail
Session Store
Managing session data using relational database is very difficult, especially in case where
applications are grown very much.
In such cases the right approach is to use a global session store, which manages session
information for every user who visits the site.
NOSQL is suitable for storing such web application session information very is large in size.
Since the session data is unstructured in form, so it is easy to store it in schema less
documents rather than in relation database record.
User Profile Store
To enable online transactions, user preferences, authentication of user and more, it is required
to store the user profile by web and mobile application.
In recent time users of web and mobile application are grown very rapidly. The relational
database could not handle such large volume of user profile data which growing rapidly, as it
is limited to single server.
Using NOSQL capacity can be easily increased by adding server, which makes scaling cost
effective
214
Content and Metadata Store
Many companies like publication houses require a place where they can store large amount of
data, which include articles, digital content, and e-books, in order to merge various tools for
learning in single platform
The applications which are content based, for such application metadata is very frequently
accessed data which need less response times.
For building applications based on content, use of NoSQL provide flexibility in faster access
to data and to store different types of contents
Mobile Applications
Since the smartphone users are increasing very rapidly, mobile applications face problems
related to growth and volume.
Using NoSQL database mobile application development can be started with small size and
can be easily expanded as the number of user increases, which is very difficult if you
consider relational databases.
Since NoSQL database store the data in schema-less for the application developer can update
the apps without having to do major modification in database.
The mobile app companies like Kobo and Playtika, uses NOSQL and serving millions of
users across the world.
Third-Party Data Aggregation
Frequently a business requires to access data produced by third party. For instance, a
consumer-packaged goods company may require getting sales data from stores as well as
shopper’s purchase history.
In such scenarios, NoSQL databases are suitable, since NoSQL databases can manage huge
amount of data which is generating at high speed from various data sources.
Internet of Things
Today, billions of devices are connected to internet, such as smartphones, tablets, home
appliances, systems installed in hospitals, cars, and warehouses. For such devices large
volume and variety of data is generated and keep on generating.
215
Relational databases are unable to store such data. The NOSQL permits organizations to
expand concurrent access to data from billions of devices and systems which are connected,
store huge amount of data and meet the required performance.
E-Commerce
E-commerce companies use NoSQL for store huge volume of data and large amount of
request from user.
Social Gaming
Data-intensive applications such as social games which can grow users to millions. Such a
growth in number of users as well as amount of data requires a database system which can
store such data and can be scaled to incorporate number of growing users NOSQL is suitable
for such applications. NOSQL has been used by some of the mobile gaming companies like,
electronic arts, Zynga and Ten Cent.
4.7.3 Types of NoSQL databases
NoSQL Databases are mainly categorized into four types: Key-value pair, Column-oriented,
Graph-based, and Document-oriented. Every category has its unique attributes and
limitations. None of the above-specified database is better to solve all the problems. Users
should select the database based on their product needs. The different types of NoSQL
Databases:
 Key-value Pair Based
 Column-oriented Graph
 Graphs based
 Document-oriented
Key Value Pair Based
The key-value data stores are pretty simplistic but are quite efficient and powerful model. It
has a simple application programming interface (API). A key value data store allows the user
to store data in a schema less manner. The data is usually some kind of data type of a
programming language or an object. The data consists of two parts, a string which represents
the key and the actual data which is to be referred as value thus creating a „key-value‟ pair.
216
These stores are similar to hash tables where the keys are used as indexes, thus making it
faster than RDBMS Thus the data model is simple: a map or a dictionary that allows the user
to request the values according to the key specified. The modern key value data stores prefer
high scalability over consistency. Hence ad-hoc querying and analytics features like joins
and aggregate operations have been omitted. High concurrency, fast lookups and options for
mass storage are provided by key-value stores. One of the weaknesses of key value data sore
is the lack of schema which makes it much more difficult to create custom views of the data.
Data is stored in key/value pairs. It is designed in such a way to handle lots of data and heavy
load.
Key-value pair storage databases store data as a hash table where each key is unique, and the
value can be a JSON, BLOB (Binary Large Objects), string, etc. For example, a key-value
pair may contain a key like "Website" associated with a value like "Guru99".
It is one of the most basic NoSQL database examples. This kind of NoSQL database is used
as a collection, dictionaries, associative arrays, etc. Key value stores help the developer to
store schema-less data. They work best for shopping cart contents.
Redis, Dynamo, Riak are some NoSQL examples of key-value store Databases. They are all
based on Amazon's Dynamo paper. Amazon DynamoDB is a newly released fully managed
NOSQL database service offered by Amazon that provides a fast, highly reliable and cost-
effective NOSQL database service designed for internet scale applications.
Column-Based
Amazon DynamoDB is a newly released fully managed NOSQL database service offered by
Amazon that provides a fast, highly reliable and cost-effective NOSQL database service
designed for internet scale applications. Column-oriented databases work on columns and are
based on Big Table paper by Google. Every column is treated separately. Values of single
column databases are stored contiguously. Column oriented databases are suitable for data
mining and analytic applications, where the storage method is ideal for the common
operations performed on the data.
They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc.
as the data is readily available in a column. Column-based NoSQL databases are widely used
to manage data warehouses, business intelligence, CRM, Library card catalogues,
HBase, Cassandra, HBase, Hyper table are NoSQL query examples of column-based
database.
217
Document-Oriented
Document Store Databases refers to databases that store their data in the form of documents.
Document stores offer great performance and horizontal scalability options. Documents
inside a document-oriented database are somewhat similar to records in relational databases,
but they are much more flexible since they are schema less. The documents are of standard
formats such as XML, PDF, JSON etc. In relational databases, a record inside the same
database will have same data fields and the unused data fields are kept empty, but in case of
document stores, each document may have similar as well as dissimilar data.
Documents in the database are addressed using a unique key that represents that document.
These keys may be a simple string or a string that refers to URI or path. Document stores are
slightly more complex as compared to key-value stores as they allow to encase the key-value
pairs in document also known as key-document pairs. Document oriented databases should
be used for applications in which data need not be stored in a table with uniform sized fields,
but instead the data has to be stored as a document having special characteristics. Document
stores serve well when the domain model can be split and partitioned across some documents.
Document stores should be avoided if the database will have a lot of relations and
normalization.
The document type is mostly used for Content Management System (CMS), blogging
platforms, real-time analytics & e-commerce applications. It should not use for complex
transactions which require multiple operations or queries against varying aggregate
structures.
Amazon Simple DB, Couch DB, Mongo DB, Riak, Lotus Notes, Mongo DB, are popular
Document originated DBMS systems.
Graph-Based
A graph type database stores entity as well the relations amongst those entities. The entity is
stored as a node with the relationship as edges. An edge gives a relationship between nodes.
Every node and edge has a unique identifier.
218
4.7.4 Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data. The system response time
becomes slow when you use RDBMS for massive volumes of data.
To resolve this problem, we could "scale up" our systems by upgrading our existing
hardware. This process is expensive. The alternative for this issue is to distribute database
load on multiple hosts whenever the load increases. This method is known as "scaling out."
4.7.5 Advantages of NoSQL
Here are some of the advantages of NoSQL
 Can be used as Primary or Analytic Data Source
 Big Data Capability
 No Single Point of Failure
 Easy Replication
 No Need for Separate Caching Layer
 It provides fast performance and horizontal scalability.
 Can handle structured, semi-structured, and unstructured data with equal effect
 Object-oriented programming which is easy to use and flexible
 NoSQL databases don't need a dedicated high-performance server
 Support Key Developer Languages and Platforms
 Simple to implement than using RDBMS
 It can serve as the primary data source for online applications.
 Handles big data which manages data velocity, variety, volume, and complexity
 Excels at distributed database and multi-data centre operations
 Eliminates the need for a specific caching layer to store data
 Offers a flexible schema design which can easily be altered without downtime or service
disruption
219
Advantages of NOSQL over Relational
 Provides a wide range of data models to choose from
 Easily scalable
 Database administrators are not required
 Some of the NOSQL DBaaS providers like Riak and Cassandra are programmed to
handle hardware failures
 Faster, more efficient, and flexible
 Has evolved at a very high pace
Disadvantages of NOSQL over Relational
 Immature
 No standard query language
 Some NOSQL databases are not ACID compliant
 No standard interface
 Maintenance is difficult
Future Prospects for NoSQL
Although NOSQL has evolved at a very high pace, it still lags behind relational database in
terms of number of users. The main reason behind this is that the users are more familiar with
SQL while NOSQL databases lack a standard query language. If a standard query language
for NOSQL is introduced, it will surely be a game changer. There are a few DBaaS providers
over the cloud like Xeround which works on the hybrid database model, that is, they have the
familiar SQL in the frontend and NOSQL in the backend. These databases night is not as fast
as a pure NOSQL database, but they still provide features of both relational as well as
NOSQL databases to the user. Thus, a lot of disadvantages of both relational as well as
NOSQL databases may be covered up. With a few more advancements in this hybrid
architecture the future prospects for NOSQL databases in DBaaS are excellent.
220
4.7.6 Use of NoSQL in Industry
NOSQL is a technology widely used by different business today. Here are some uses of
NOSQL in different industries.
Internet of Things
Today, billions of devices are connected to internet, such as smartphones, tablets, home
appliances, systems installed in hospitals, cars, and warehouses. For such devices large
volume and variety of data is generated and keep on generating.
Relational databases are unable to store such data. The NOSQL permits organizations to
expand concurrent access to data from billions of devices and systems which are connected,
store huge amount of data and meet the required performance.
E-Commerce
E-commerce companies use NoSQL for store huge volume of data and large amount of
request from user.
Social Gaming
Data-intensive applications such as social games which can grow users to millions. Such a
growth in number of users as well as amount of data requires a database system which can
store such data and can be scaled to incorporate number of growing users NOSQL is suitable
for such applications.
NOSQL has been used by some of the mobile gaming companies like, electronic arts, Zynga
and TenCent.
Ad Targeting
Displaying ads or offers on the current web page is a decision with direct income to
determine what group of users to target, on web page where to display ads, the platforms
gather behavioural and demographic characteristics of users.
A NoSQL database enables ad companies to track user details and also place the very quickly
and increases the probability of clicks.
AOL, Media mind and PayPal are some of the ads targeting companies which use NoSQL.
221
4.8 DEFINITION OF SQL
SQL stands for Structured Query Language, which is the language used when communicating
with databases. A snippet of SQL typically looks something like this:
SELECT * FROM TABLE...
It is a domain-specific language used in programming and designed for managing data held in
a relational database management system (RDBMS), or for stream processing in a relational
data stream management system (RDSMS). It is particularly useful in handling structured
data, i.e., data incorporating relations among entities and variables.
SQL allows you to create, read, update, and delete—also known as CRUD operations—
through a universal language that is pretty much consistent across multiple underlying
relational database engines, such as MySQL, PostgreSQL, or Microsoft SQL Server.
When talking about databases, there are 4 key components that are important to consider:
 Structure
 Scale
 Storage
 Access
Let's take a look at how they relate to SQL.
 Structure
In a relational database engine, you typically interact with tables.
A table consists of rows and columns; the columns correspond to types, while rows
correspond to the individual entities that exist in the table
In a SQL table, you must have a primary key which corresponds to the unique identifier
that identifies a specific row on the table.
 Storage
In terms of storage, the pattern is concentrated. So, in a relational database engine, there's
typically one node that contains the entirety of your data; it's not partitioned or
segregated in any way unless you're using some advanced strategies.
222
 Scale
There are two approaches in terms of scale:
Horizontal scaling: This means adding more machines. When you add more machines to
a horizontally scaled RDS environment, you typically perform that by distributing your
data across multiple nodes.
Vertical scaling: If you have a machine hosting your database engine and you're not
getting enough performance based on the machine's physical limitations, the option here
is to build a better machine (more RAM, better CPU, and faster SSD) to host your
database engine.
 Access
In terms of access, it's typically raw SQL, so you'll be writing the CRUD syntax for your
queries. You'll need a direct database connection to the endpoint of the database, and
these days people are using ORM (Object Relational Mapper) to construct their queries.
These are abstractions that are used to add criteria to an object in a very programmatic
way, and they allow that to generate a SQL statement.
SQL offers two main advantages over older read–write APIs such as ISAM or VSAM.
Firstly, it introduced the concept of accessing many records with one single command.
Secondly, it eliminates the need to specify how to reach a record, e.g., with or without an
index.
Originally based upon relational algebra and tuple relational calculus, SQL consists of many
types of statements, which may be informally classed as sublanguages, commonly: a data
query language (DQL), a data definition language (DDL), a data control language (DCL), and
a data manipulation language (DML). The scope of SQL includes data query, data
manipulation (insert, update and delete), data definition (schema creation and modification),
and data access control. Although SQL is essentially a declarative language (4GL), it also
includes procedural elements.
SQL was one of the first commercial languages to use Edgar F. Codd’s relational model. The
model was described in his influential 1970 paper, "A Relational Model of Data for Large
Shared Data Banks". Despite not entirely adhering to the relational model as described by
Codd, it became the most widely used database language.
223
SQL became a standard of the American National Standards Institute (ANSI) in 1986, and of
the International Organization for Standardization (ISO) in 1987.Since then, the standard has
been revised to include a larger set of features. Despite the existence of standards, most SQL
code requires at least some changes before being ported to different database systems.
4.9 SQL VS NOSQL
When it comes to storing data, we generally have two options: SQL (relational databases) and
NoSQL (non-relational databases).
The idea for SQL was first introduced in 1970 by Edgar F. Codd in his model for relational
database management. This type of database stores data in rows and columns like a spread
sheet, assigning a specific key for each row.
NoSQL came along in the 1990s, with the term officially being coined in 1998 by Carlo
Strozzi. This type of database is not limited to the tabular schema of rows and columns found
in SQL database systems.
SQL is the best database to use for relational data, especially when the relationship between
data sets is well-defined and highly navigable. It is also best for assessing data integrity. If
you need flexible access to data, SQL allows for high-level ad-hoc queries, and, in most
cases, SQL databases are vertically scalable (i.e., increase a single server workload by
increasing RAM, CPU, SSD, etc.).
Some SQL databases support NoSQL-style workloads via special features (e.g., native
JavaScript Object Notation (JSON) data types). If you don’t need the horizontal scalability
found in NoSQL data stores, these databases are also good for many non-relational
workloads. This makes them useful for working with relational and unstructured data without
the complexity of different data stores.
Though NoSQL is simple, users must consider the implications of the data stores when
building applications. They must also consider write consistency, eventual consistency, and
the impact of shading on data access and storage. On the other hand, these concerns do not
apply to SQL databases, which make them simpler to build applications on. In addition, their
wide usage and versatility simplifies complex queries.
224
NoSQL is the best database to use for large amounts of data or ever-changing data sets. It is
also best to use when you have flexible data models or needs that don't fit into the relational
model. If you are working with large amounts of unstructured data, “document databases”
(e.g., CouchDB, MongoDB, Amazon Document DB) are a good fit. If you need quick access
to a key-value store without strong integrity guarantees, Redis is a great fit. In need of
complex or flexible search across a lot of data? Elastic search is a perfect fit.
Horizontal scalability is a core tenet of many NoSQL data stores. Unlike in SQL, their built-
in shading and high availability requirements ease horizontal scaling (i.e., “scaling out”).
Furthermore, NoSQL databases like Cassandra have no single points of failure, so
applications can easily react to underlying failures of individual members.
Conclusion and next steps
Selecting or suggesting a database is a key responsibility for most database experts, and
“SQL vs. NoSQL'' is a helpful rubric for informed decision-making. When considering either
database, it is also important to consider critical data needs and acceptable trade-offs
conducive to meeting performance and uptime goals.
IBM Cloud supports cloud-hosted versions of several SQL and NoSQL databases with its
cloud-native databases. For more guidance on selecting the best option for you, check out "A
Brief Overview of the Database Landscape" and "How to Choose a Database on IBM Cloud."
4.10 NEW SQL
Most programmers are familiar with SQL and the relational database management systems,
or RDBMSs, like MySQL or PostgreSQL. The basic principles for such architectures have
been around for decades. Around 2000s came NoSQL solutions, like MongoDB or
Cassandra, developed for distributed, scalable data needs.
But, for the past few years, there has been a new kid on the block: NewSQL.
NewSQL is a new approach to relational databases that wants to combine transactional ACID
(atomicity, consistency, isolation, durability) guarantees of good ol’ RDBMSs and the
horizontal scalability of NoSQL. It sounds like a perfect solution, the best of both worlds.
What took it so long to arrive?
225
Databases were born out of a need to separate code from data in the mid-1960s. These first
databases were designed with several considerations:
 The number of users querying the database is limited.
 The types of queries are unlimited – the developer can use any query they want.
 Hardware is quite expensive.
In those days of developers entering interactive queries to a terminal, as the only users with
access to the database, these considerations were relevant and valuable. Correctness and
consistency were the two important metrics, rather than today’s metrics of performance and
availability. Vertical scaling was the solution to growing data needs, and downtime needed
for the data to be moved in case of database migration or recovery was bearable.
Fast forwarding a couple of decades, the requirements from databases on the Internet and
cloud era are much more different. The scale of data is enormous, and commodity hardware
is much cheaper compared to the 20th-century costs.
As the scale of data grew and real-time interactions through Internet became widespread,
basic needs from databases started to be divided into the two main categories of OLAP and
OLTP, Online Analytical Processing and Online Transaction Processing, respectively.
OLAP databases are commonly known as data warehouses. They store a historical footprint
for statistical analysis purposes in business intelligence operations. OLAP databases are thus
focused on read-only workloads with ad-hoc queries for batch processing. The number of
users querying the database is considerably low, as usually, only the employees of a company
have access to the historical information.
OLTP databases correspond to the highly concurrent, transactional data processing,

characterized by short-lived and pre-defined queries enacted by real-time users. Searches a
regular user does on an e-commerce website and buying of items are basic examples of
transactional processing. While the users access a smaller subset of the data when compared
with OLAP users, the number of users is considerably higher, and the queries can include
both read and write operations. The important considerations in OLTP databases thus are
high availability, concurrency, and performance. For most websites, for any given time, there
are hundreds or thousands of users effectively querying the database concurrently. With this
scale in mind, the system needs to be highly available, as every minute of downtime can cost
the bigger companies thousands or even millions of dollars.
226
On websites, the queries made by the users are pre-defined; the users do not have access to
the terminal of the database to execute any query that they’d like. The queries are buried in
the application logic. This allows for optimizations towards high performance.
In the new database ecosystem where scalability is an important metric, and high availability
is essential for making profits, NoSQL databases were offered as a solution for achieving
easier scalability and better performance, opting for an AP design from the CAP theorem.
However, this meant giving up strong consistency and the transactional ACID properties
offered by RDMBSs in favour of eventual consistency in most NoSQL designs.
NoSQL databases use a different model than the relational, such as key-value, document,
wide-column, or graph. With these models, NoSQL databases are not normalized, and are
inherently scheme less by design. Most NoSQL databases support auto-sharding, allowing for
easy horizontal scaling without developer intervention.
NoSQL can be useful for applications such as social media, where eventual consistency is
acceptable – users do not notice if they see a non-consistent view of the database, and since
the data involves status updates, tweets, etc. strong consistency is not essential. However,
NoSQL databases are not easy to use for systems where consistency is critical, such as e-
commerce platforms.
NewSQL systems are born out of the desire to combine the scalability and high availability of
NoSQL alongside the relational model, transaction support, and SQL of traditional RDBMSs.
The one-size-fits-all solutions are at an end, and specialized databases for different workloads
like OLTP started to rise. Most NewSQL databases are born out of a complete redesign
focused heavily on OLTP or hybrid workloads.
Traditional RDMBS architecture was not designed with a distributed system in mind. Rather,
when the need arose, support for distributed designs was built as an afterthought on top of the
original design. Due to their normalized structure, rather than the aggregated form of NoSQL,
RDBMS had to introduce complicated concepts to both scale out and conserve its consistency
requirements. Manual sharding and master-slave architectures were developed to allow
horizontal scaling.
However, RDBMS loses much of its performance when scaling out, as joins become more
costly with moving data between different nodes for aggregation, and maintenance overhead
became time consuming. To preserve the performance, complex systems and products were
developed – but today, still, traditional RDBMSs are not regarded as inherently scalable.
227
NewSQL databases are built for the cloud era, with a distributed architecture in mind from
the start.
4.11 SUMMARY
 SQL which stands for Structured Query Language is a language to manage and
communicate with databases. For instance, it is used for database creation, deletion,
update rows by fetching rows, modifying rows, etc.
 SQL statements are used for tasks like updating data on a database or retrieving data
from a database.
 SQL is an ANSI (American National Standards Institute) standard language. There

are other versions of the SQL language like T-SQL by Microsoft, PSQL by Interbase/
Firebird, etc. Therefore, the most common relational database management systems
that use SQL are Sybase, Microsoft SQL Server, Oracle, Ingres, Access, etc. Despite
that most of these database systems use SQL, most of them have other additional
proprietary extensions that are made particularly for their system.
 John Tukey re-invigorated the practice of exploratory data analysis, and massively
promoted the phrase itself with his book of the same name. One of the simplest yet
most useful tools proposed by Tukey is the five number summaries. In tribute to its
usefulness, R has a single command to obtain this summary from any data set —
fivenum() — which is in base R.
 This consists very simply of the minimum, maximum, median, first quartile and third
quartile of a variable. Given the maximum and minimum have always been standard
aggregates, and there is no need to use tricky statistical distributions or matrix algebra
to prepare any of the values, it might be supposed that this summary should be easily
available from any SQL implementation.
 However, as recently as the early 2000s, it was actually quite rigmarole.
228
 Thankfully, while many things, such as popular music, have clearly deteriorated since
then (I reached the age that entitled me to drive during the Clinton administration,
which strongly correlates to the period when popular music sounded best to me), both
standard SQL and the biggest implantations have introduced useful new features since
then.
 IBM Research developed and defined SQL, and ANSI/ISO has refined SQL as the
standard language for relational database management systems. The SQL
implemented by Oracle Corporation for Oracle is 100% compliant at the Entry Level
with the ANSI/ISO 1992 standard SQL data language
 Oracle SQL includes many extensions to the ANSI/ISO standard SQL language, and
Oracle tools and applications provide additional commands. The Oracle tools
SQL*Plus and Server Manager allow you to execute any ANSI/ISO standard SQL
statement against an Oracle database, as well as additional commands or functions
that are available for those tools.
4.12 KEYWORDS
 SQL: SQL is a database computer language designed for the retrieval and
management of data in a relational database. SQL stands for Structured Query
Language. This tutorial will give you a quick start to SQL. It covers most of the topics
required for a basic understanding of SQL and to get a feel of how it works.
 NoSQL: When people use the term “NoSQL database”, they typically use it to refer
to any non-relational database. Some say the term “NoSQL” stands for “non-SQL”
while others say it stands for “not only SQL.” Either way, most agree that NoSQL
databases are databases that store data in a format other than relational tables. A
common misconception is that NoSQL databases or non-relational databases don’t
store relationship data well. NoSQL databases can store relationship data—they just
store it differently than relational databases do. In fact, when compared with SQL
databases, many find modelling relationship data in NoSQL databases to be easier
than in SQL databases, because related data doesn’t have to be split between tables.
NoSQL data models allow related data to be nested within a single data structure.
229
 RDD: RDD was the primary user-facing API in Spark since its inception. At the core,
an RDD is an immutable distributed collection of elements of your data, partitioned
across nodes in your cluster that can be operated in parallel with a low-level API that
offers transformations and actions. One of the most important capabilities in Spark is
persisting (or caching) a dataset in memory across operations. When you persist an
RDD, each node stores any partitions of it that it computes in memory and reuses
them in other actions on that dataset (or datasets derived from it). This allows future
actions to be much faster (often by more than 10x). Caching is a key tool for iterative
algorithms and fast interactive use. In addition, each persisted RDD can be stored
using a different storage level, allowing you, for example, to persist the dataset on
disk, persist it in memory but as serialized Java objects (to save space), replicate it
across nodes. These levels are set by passing a Storage Level object (Scala, Java, and
Python) to persist (). The cache () method is a shorthand for using the default storage
level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in
memory).
 Apache Spark: It is an open-source unified analytics engine for large-scale data

processing. Spark provides an interface for programming entire clusters with implicit
data parallelism and fault tolerance. Originally developed at the University of
California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache
Software Foundation, which has maintained it since.
 Machine Learning: Machine learning is a method of data analysis that automates

analytical model building. It is a branch of artificial intelligence based on the idea that
systems can learn from data, identify patterns, and make decisions with minimal
human intervention.
230
4.13 LEARNING ACTIVITY
1. Carry a new database with SQL and insert new data in the database.
___________________________________________________________________________
___________________________________________________________________________
2. Delete the data and create a new table in one database or even drop the table. Set
permissions for table, procedures, and views, and creating function, views and storing
procedures.
___________________________________________________________________________
___________________________________________________________________________
4.14 UNIT END QUESTIONS
A. Descriptive Questions
Short Questions
1. What is SQL?
2. Differentiate between SQL and NoSQL?
3. Where is no SQL used?
4. How to download data with Spark?
5. What are the uses of RDD?
Long Questions
1. Illustrate the steps of getting started through Spark
2. What are the advantages of NoSQL?
3. What are the types of NoSQL Databases?
4. What are the uses of NoSQL in industry?
5. Illustrate Machine Learning with RDD.
231
B. Multiple Choice Questions
1. What is the full form of SQL?
a. Structured Query Language
b. Structured Query List
c. Simple Query Language
d. None of these
2. Which is the subset of SQL commands used to manipulate Oracle Database structures
including Tables?
a. Data definition Language
b. Data manipulation language
c. Both of these
d. None of these
3. Which operator performs Pattern Matching?
a. Between operators
b. Exists Operator
c. Like operator
d. None of these
4. What operator tests column for absence of data?
a. Exists operator
b. Not operator.
c. Null operator.
d. None of these
232
5. In SQL which commands are used to change a table’s storage characteristics?
a. ALTER table
b. MODIFY table
c. CHANGE table
d. All of these
Answers
1-a, 2-a, 3-c, 4-c, 5-a
4.15 REFERENCES
Reference Books
 Nazari, E., Shahriari, M. H., & Tabesh, H. (2019). BigData Analysis in Healthcare:
Apache Hadoop, Apache spark and Apache Flink. Frontiers in Health Informatics, 8(1),
14.
 Big Data Analysis using Apache Hadoop and Spark. (2019). International Journal of
Recent Technology and Engineering, 8(2), 167–170.
 García-Gil, D., Ramírez-Gallego, S., García, S., & Herrera, F. (2017). A comparison on
scalability for batch big data processing on Apache Spark and Apache Flink. Big Data
Analytics, 2(1). https://doi.org/10.1186/s41044-016-0020-2
Textbooks
 Perrin, J. (2020). Spark in Action, Second Edition: Covers Apache Spark 3 with
Examples in Java, Python, and Scala (2nd ed.). Manning Publications.
 Touil, M. (2019). Big Data: Spark Hadoop and Their databases. Independently
published.
 Dasgupta, N. (2018). Practical Big Data Analytics: Hands-on techniques to implement

enterprise analytics and machine learning using Hadoop, Spark, NoSQL, and R (1st ed.).
Packt Publishing.
233
Websites
 https://towardsdatascience.com/machine-learning-with-spark-f1dbc1363986
 https://www.baeldung.com/spark-mlib-machine-learning
 https://spark.apache.org/docs/latest/rdd-programming-guide.html
 https://spark.apache.org/research.html
234

Unit 4

Uploaded by

Copyright:

Available Formats

Unit 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4

Uploaded by

Copyright:

Available Formats

UNIT - 4 SPARK

4.0 Learning Objectives

4.2 Data Analysis with Spark

4.3 Downloading Spark

4.4 Getting Started with Spark

4.5 Programming with RDDs

4.6 Machine Learning with MLlib

4.7.1 What is NoSQL?

4.7.2 Where is NoSQL Used?

4.7.3 Types of NoSQL databases

4.7.4 Why NoSQL?

4.7.5 Advantages of NoSQL

4.7.6 Use of NoSQL in Industry

4.8 Definition of SQL

4.9 SQL vs NoSQL

4.10 New SQL

4.13 Learning Activity

4.14 Unit End Questions

After studying this unit, you will be able to:

 Describe the Data Analysis with Spark.

 Define the Programming with RDDs.

 Explain Machine Learning with MLlib.

 Elucidate the NoSQL.

 Describe the New SQL

Apache Spark is a lightning-fast cluster computing technology, designed for fast

Spark is one of Hadoop’s sub projects developed in 2009 in UC Berkeley’s AMPLab by

4.2 DATA ANALYSIS WITH SPARK

Data analysis is defined as a process of cleaning, transforming, and modelling data to

Apache Spark has following features.

There are three ways of Spark deployment as explained below.

4.3 DOWNLOADING SPARK

Fig 4.1 Installation of Spark on windows

 A system running Windows 10

 Command Prompt or PowerShell

 A tool to extract .tar files, such as 7-Zip

Install Apache Spark on Windows

Step 1: Install Java 8

 Type the following command in the command prompt:

If Java is installed, it will respond with the following output:

Fig 4.2 Response if java is already installed

If you don’t have Java installed:

 Open a browser window, and navigate to https://java.com/en/download/

Fig 4.3 Java download screen

 Once the download finishes double-click the file to install Java.

Step 2: Install Python

 To install the Python package manager, navigate to https://www.python.org/ in your web

 Once the download finishes, run the file.

Fig 4.4 Python download screen

 Next, click Customize installation.

 Select that folder and click OK.

 Click Install and let the installation complete.

 The output should print

Step 3: Download Apache Spark

 Open a browser and navigate to https://spark.apache.org/downloads.html.

 Click the spark-2.4.5-bin-hadoop2.7.tgz link.

Fig 4.7 Apache spark download screen

Step 4: Verify Spark Software File

 Next, open a command line and enter the following command:

certutil -hashfile c:\users\username\Downloads\spark-2.4.5-bin-hadoop2.7.tgz