Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views43 pages

BIGDATA

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 43

Before big data technologies were introduced, the data was managed by general programming

languages and basic structured query languages.


However, these languages were not efficient enough to handle the data because there has been
continuous growth in each organization's information and data and the domain.
That is why it became very important to handle such huge data and introduce an efficient and
stable technology that takes care of all the client and large organizations' requirements and needs,
responsible for data production and control.

What is Big Data Technology?


1. Big data technology is defined as software-utility.
2. This technology is primarily designed to analyze, process and extract information from a
large data set and a huge set of extremely complex structures.
3. This is very difficult for traditional data processing software to deal with.
4. Among the larger concepts of rage in technology, big data technologies are widely
associated with many other technologies such as deep learning, machine learning,
artificial intelligence (AI), and Internet of Things (IoT) that are massively augmented.
5. In combination with these technologies, big data technologies are focused on analyzing
and handling large amounts of real-time data and batch-related data.

Types of Big Data Technology (Short note on any one)


1. Operational Big Data Technologies
2. Analytical Big Data Technologies

Operational Big Data Technologies


1. This type of big data technology mainly includes the basic day-to-day data that people
used to process.
2. Typically, the operational-big data includes daily basis data such as online transactions,
social media platforms, and the data from any particular organization or a firm, which is
usually needed for analysis using the software based on big data technologies.
3. The data can also be referred to as raw data used as the input for several Analytical Big
Data Technologies.
4. Some specific examples that include the Operational Big Data Technologies can be listed
as below:
● Online ticket booking system, e.g., buses, trains, flights, and movies, etc.
● Online trading or shopping from e-commerce websites like Amazon, Flipkart,
Walmart, etc.
● Online data on social media sites, such as Facebook, Instagram, Whatsapp, etc.
● The employees' data or executives' particulars in multinational companies.

Analytical Big Data Technologies


1. Analytical Big Data is commonly referred to as an improved version of Big Data
Technologies.
2. This type of big data technology is a bit complicated when compared with operational-
big data.
3. Analytical big data is mainly used when performance criteria are in use, and important
real-time business decisions are made based on reports created by analyzing operational-
real data.
4. This means that the actual investigation of big data that is important for business
decisions falls under this type of big data technology.
5. Some common examples that involve the Analytical Big Data Technologies can be listed
as below:
● Stock marketing data
● Weather forecasting data and the time series analysis
● Medical health records where doctors can personally monitor the health status of
an individual
● Carrying out the space mission databases where every information of a mission is
very important

Top Big Data Technologies (short note)


We can categorize the leading big data technologies into the following four sections:

1. Data Storage
2. Data Mining
3. Data Analytics
4. Data Visualization

1. Data Storage
Big Data Technologies that come under Data Storage:
○ Hadoop: When it comes to handling big data, Hadoop is one of the leading technologies
that come into play. This technology is based entirely on map-reduce architecture and is
mainly used to process batch information. Also, it is capable enough to process tasks in
batches. The Hadoop framework was mainly introduced to store and process data in a
distributed data processing environment parallel to commodity hardware and a basic
programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from
various machines with a faster speed and low cost. That is why Hadoop is known as one
of the core components of big data technologies. The Apache Software Foundation
introduced it in Dec 2011. Hadoop is written in Java programming language.

○ MongoDB: MongoDB is another important component of big data technologies in terms


of storage. No relational properties and RDBMS properties apply to MongoDb because it
is a NoSQL database. This is not the same as traditional RDBMS databases that use
structured query languages. Instead, MongoDB uses schema documents.
The structure of the data storage in MongoDB is also different from traditional RDBMS
databases. This enables MongoDB to hold massive amounts of data. It is based on a
simple cross-platform document-oriented design. The database in MongoDB uses
documents similar to JSON with the schema. This ultimately helps operational data
storage options, which can be seen in most financial organizations. As a result,
MongoDB is replacing traditional mainframes and offering the flexibility to handle a
wide range of high-volume data-types in distributed architectures.
MongoDB Inc. introduced MongoDB in Feb 2009. It is written with a combination of C+
+, Python, JavaScript, and Go language.

2.Data Mining
Presto: Presto is an open-source and a distributed SQL query engine developed to run interactive
analytical queries against huge-sized data sources. The size of data sources can vary from
gigabytes to petabytes. Presto helps in querying the data in Cassandra, Hive, relational databases
and proprietary data storage systems.Presto is a Java-based query engine that was developed in
2013 by the Apache Software Foundation. Companies like Repro, Netflix, Facebook are using
this big data technology and making good use of it.

3. Data Analytics

○ Apache Kafka: Apache Kafka is a popular streaming platform. This streaming platform is
primarily known for its three core capabilities: publisher, subscriber and consumer. It is
referred to as a distributed streaming platform. It is also defined as a direct messaging,
asynchronous messaging broker system that can ingest and perform data processing on
real-time streaming data. This platform is almost similar to an enterprise messaging
system or messaging queue.
Besides, Kafka also provides a retention period, and data can be transmitted through a
producer-consumer mechanism. Kafka has received many enhancements to date and
includes some additional levels or properties, such as schema, Ktables, KSql, registry,
etc. It is written in Java language and was developed by the Apache software community
in 2011. Some top companies using the Apache Kafka platform include Twitter, Spotify,
Netflix, Yahoo, LinkedIn etc.
○ Spark: Apache Spark is one of the core technologies in the list of big data technologies. It
is one of those essential technologies which are widely used by top companies. Spark is
known for offering In-memory computing capabilities that help enhance the overall speed
of the operational process. It also provides a generalized execution model to support more
applications. Besides, it includes top-level APIs (e.g., Java, Scala, and Python) to ease the
development process.
Also, Spark allows users to process and handle real-time streaming data using batching
and windowing operations techniques. This ultimately helps to generate datasets and data
frames on top of RDDs. As a result, the integral components of Spark Core are produced.
Components like Spark MlLib, GraphX, and R help analyze and process machine
learning and data science. Spark is written using Java, Scala, Python and R language. The
Apache Software Foundation developed it in 2009. Companies like Amazon, ORACLE,
CISCO, VerizonWireless, and Hortonworks are using this big data technology and
making good use of it.
○ R-Language: R is defined as the programming language, mainly used in statistical
computing and graphics. It is a free software environment used by leading data miners,
practitioners and statisticians. Language is primarily beneficial in the development of
statistical-based software and data analytics.
R-language was introduced in Feb 2000 by R-Foundation. It is written in Fortran.
Companies like Barclays, American Express, and Bank of America use R-Language for
their data analytics needs.

4.Data Visualization

○ Tableau: Tableau is one of the fastest and most powerful data visualization tools used by
leading business intelligence industries. It helps in analyzing the data at a very faster
speed. Tableau helps in creating the visualizations and insights in the form of dashboards
and worksheets.
Tableau is developed and maintained by a company named TableAU. It was introduced
in May 2013. It is written using multiple languages, such as Python, C, C++, and Java.
Some of the list's top companies are Cognos and ORACLE Hyperion, using this tool.
○ Plotly: As the name suggests, Plotly is best suited for plotting or creating graphs and
relevant components at a faster speed in an efficient way. It consists of several rich
libraries and APIs, such as MATLAB, Python, Julia, REST API, Arduino, R, Node.js,
etc. This helps interactive styling graphs with Jupyter notebook and Pycharm.
Plotly was introduced in 2012 by the Plotly company. It is based on JavaScript. Paladins
and Bitbank are some of those companies that are making good use of Plotly.
******************************************************************************
What is Apache Hadoop?

● Apache Hadoop software is an open source framework that allows for the distributed
storage and processing of large datasets across clusters of computers using simple
programming models.
● Hadoop is designed to scale up from a single computer to thousands of clustered
computers, with each machine offering local computation and storage.
● In this way, Hadoop can efficiently store and process large datasets ranging in size from
gigabytes to petabytes of data.

Explain primary Hadoop framework and work collectively to form the Hadoop ecosystem:

1. Hadoop Distributed File System (HDFS): As the primary component of the Hadoop
ecosystem, HDFS is a distributed file system in which individual Hadoop nodes operate
on data that resides in their local storage. This removes network latency, providing high-
throughput access to application data. In addition, administrators don’t need to define
schemas up front.
2. Yet Another Resource Negotiator (YARN): YARN is a resource-management platform
responsible for managing compute resources in clusters and using them to schedule users’
applications. It performs scheduling and resource allocation across the Hadoop system.
3. MapReduce: MapReduce is a programming model for large-scale data processing. In the
MapReduce model, subsets of larger datasets and instructions for processing the subsets
are dispatched to multiple different nodes, where each subset is processed by a node in
parallel with other processing jobs. After processing the results, individual subsets are
combined into a smaller, more manageable dataset.
4. Hadoop Common: Hadoop Common includes the libraries and utilities used and shared
by other Hadoop modules.

Beyond HDFS, YARN, and MapReduce, the entire Hadoop open source ecosystem continues to
grow and includes many tools and applications to help collect, store, process, analyze, and
manage big data. These include Apache Pig, Apache Hive, Apache HBase, Apache Spark.
How does Hadoop work?

1. Hadoop allows for the distribution of datasets across a cluster of commodity hardware.
Processing is performed in parallel on multiple servers simultaneously.
2. Software clients input data into Hadoop. HDFS handles metadata and the distributed file
system. MapReduce then processes and converts the data. Finally, YARN divides the
jobs across the computing cluster.
3. All Hadoop modules are designed with a fundamental assumption that hardware failures
of individual machines or racks of machines are common and should be automatically
handled in software by the framework.

What are the benefits of Hadoop?

Scalability

Hadoop is important as one of the primary tools to store and process huge amounts of data
quickly. It does this by using a distributed computing model which enables the fast processing of
data that can be rapidly scaled by adding computing nodes.
Low cost

As an open source framework that can run on commodity hardware and has a large ecosystem of
tools, Hadoop is a low-cost option for the storage and management of big data.
Flexibility

Hadoop allows for flexibility in data storage as data does not require preprocessing before
storing it which means that an organization can store as much data as they like and then utilize it
later.
Resilience
As a distributed computing model, Hadoop allows for fault tolerance and system resilience,
meaning if one of the hardware nodes fail, jobs are redirected to other nodes. Data stored on one
Hadoop cluster is replicated across other nodes within the system to fortify against the possibility
of hardware or software failure.

What are the challenges of Hadoop?

1. MapReduce complexity and limitations

As a file-intensive system, MapReduce can be a difficult tool to utilize for complex jobs, such as
interactive analytical tasks. MapReduce functions also need to be written in Java and can require
a steep learning curve. The MapReduce ecosystem is quite large, with many components for
different functions that can make it difficult to determine what tools to use.

2. Security

Data sensitivity and protection can be issues as Hadoop handles such large datasets. An
ecosystem of tools for authentication, encryption, auditing, and provisioning has emerged to help
developers secure data in Hadoop.

3. Governance and management

Hadoop does not have many robust tools for data management and governance, nor for data
quality and standardization.

4. Talent gap
Like many areas of programming, Hadoop has an acknowledged talent gap. Finding developers
with the combined requisite skills in Java to program MapReduce, operating systems, and
hardware can be difficult. In addition, MapReduce has a steep learning curve, making it hard to
get new programmers up to speed on its best practices and ecosystem.

What are Hadoop tools?

Hadoop has a large ecosystem of open source tools that can augment and extend the capabilities
of the core module. Some of the main software tools used with Hadoop include:

Apache Hive: A data warehouse that allows programmers to work with data in HDFS using a
query language called HiveQL, which is similar to SQL

Apache HBase: An open source non-relational distributed database often paired with Hadoop

Apache Pig: A tool used as an abstraction layer over MapReduce to analyze large sets of data
and enables functions like filter, sort, load, and join

Apache Impala: Open source, massively parallel processing SQL query engine often used with
Hadoop

Apache Sqoop: A command-line interface application for efficiently transferring bulk data
between relational databases and Hadoop

Apache ZooKeeper: An open source server that enables reliable distributed coordination in
Hadoop; a service for, "maintaining configuration information, naming, providing distributed
synchronization, and providing group services"
Apache Oozie: A workflow scheduler for Hadoop jobs

What is Apache Hadoop used for?

Here are some common uses cases for Apache Hadoop:

1. Analytics and big data

A wide variety of companies and organizations use Hadoop for research, production data
processing, and analytics that require processing terabytes or petabytes of big data, storing
diverse datasets, and data parallel processing.

2. Data storage and archiving

As Hadoop enables mass storage on commodity hardware, it is useful as a low-cost storage


option for all kinds of data, such as transactions, click streams, or sensor and machine data.

3. Data lakes

Since Hadoop can help store data without preprocessing, it can be used to complement to data
lakes, where large amounts of unrefined data are stored.

4. Marketing analytics

Marketing departments often use Hadoop to store and analyze customer relationship
management (CRM) data.

5. Risk management

Banks, insurance companies, and other financial services companies use Hadoop to build risk
analysis and management models.
6. AI and machine learning

Hadoop ecosystems help with the processing of data and model training operations for machine
learning applications.
************************************************************************

Evolution of Apache Spark

1. Spark was developed by MatieZaharia in 2009 as a research project at UC Berkeley

AMPLab, which focused on Big Data Analytics.

2. The fundamental motive and goal behind developing the framework was to overcome the

inefficiencies of MapReduce.

3. Even though MapReduce was a huge success and had wide acceptance, it could not be

applied to a wide range of problems.

4. MapReduce is not efficient for multi-pass applications that require low-latency data

sharing across multiple parallel operations. There are many data analytics applications

which include:

● Iterative algorithms, used in machine learning and graph processing


● Interactive business intelligence and data mining, where data from different
sources are loaded in memory and queried repeatedly
● Streaming applications that keep updating the existing data and need to maintain
the current state based on the latest data.

What are Features of Apache Spark?


● Apache Spark has many features which make it a great choice as a big data processing

engine.

● Many of these features establish the advantages of Apache Spark over other Big Data

processing engines.

● Fault tolerance

● Dynamic In Nature

● Lazy Evaluation

● Real-Time Stream Processing

● Speed

● Reusability

● Advanced Analytics

● In Memory Computing

● Supporting Multiple languages

● Integrated with Hadoop

● Cost efficient
1. Fault Tolerance: Apache Spark is designed to handle worker node failures. It achieves

this fault tolerance by using DAG and RDD (Resilient Distributed Datasets). DAG

contains the lineage of all the transformations and actions needed to complete a task. So

in the event of a worker node failure, the same results can be achieved by rerunning the

steps from the existing DAG.

2. Dynamic nature:Sparkoffers over 80 high-level operators that make it easy to build

parallel apps.

3. Lazy Evaluation: Spark does not evaluate any transformation immediately. All the

transformations are lazily evaluated. The transformations are added to the DAG and the

final computation or results are available only when actions are called. This gives Spark

the ability to make optimization decisions, as all the transformations become visible to

the Spark engine before performing any action.

4. Real Time Stream Processing: Spark Streaming brings Apache Spark's language-

integrated API to stream processing, letting you write streaming jobs the same way you

write batch jobs.

5. Speed: Spark enables applications running on Hadoop to run up to 100x faster in memory

and up to 10x faster on disk. Spark achieves this by minimizing disk read/write

operations for intermediate results. It stores in memory and performs disk operations only

when essential. Spark achieves this using DAG, query optimizer and highly optimized

physical execution engine.

6. Reusability: Spark code can be used for batch-processing, joining streaming data against

historical data as well as running ad-hoc queries on streaming state.


7. Advanced Analytics: Apache Spark has rapidly become the de facto standard for big data

processing and data sciences across multiple industries. Spark provides both machine

learning and graph processing libraries, which companies across sectors leverage to

tackle complex problems. And all this is easily done using the power of Spark and highly

scalable clustered computers. Databricks provides an Advanced Analytics platform with

Spark.

8. In Memory Computing: Unlike Hadoop MapReduce, Apache Spark is capable of

processing tasks in memory and it is not required to write back intermediate results to the

disk. This feature gives massive speed to Spark processing. Over and above this, Spark is

also capable of caching the intermediate results so that it can be reused in the next

iteration. This gives Spark added performance boost for any iterative and repetitive

processes, where results in one step can be used later, or there is a common dataset which

can be used across multiple tasks.

9. Supporting Multiple languages: Spark comes inbuilt with multi-language support. It has

most of the APIs available in Java, Scala, Python and R. Also, there are advanced

features available with R language for data analytics. Also, Spark comes with SparkSQL

which has an SQL like feature. SQL developers find it therefore very easy to use, and the

learning curve is reduced to a great level.

10. Integrated with Hadoop: Apache Spark integrates very well with Hadoop file system

HDFS. It offers support to multiple file formats like parquet, json, csv, ORC, Avro etc.

Hadoop can be easily leveraged with Spark as an input data source or destination.
11. Cost efficient: Apache Spark is an open source software, so it does not have any licensing

fee associated with it. Users have to just worry about the hardware cost. Also, Apache

Spark reduces a lot of other costs as it comes inbuilt for stream processing, ML and

Graph processing. Spark does not have any locking with any vendor, which makes it very

easy for organizations to pick and choose Spark features as per their use case.

Apache Spark:

Apache Hadoop is a platform that got its start as a Yahoo project in 2006,
which became a top-level Apache open-source project afterward. This
framework handles large datasets in a distributed fashion. The Hadoop
ecosystem is highly fault-tolerant and does not depend upon hardware to
achieve high availability. This framework is designed with a vision to look for
the failures at the application layer. It’s a general-purpose form of distributed
processing that has several components:

● Hadoop Distributed File System (HDFS): This stores files in a

Hadoop-native format and parallelism them across a cluster. It

manages the storage of large sets of data across a Hadoop Cluster.

Hadoop can handle both structured and unstructured data.

● YARN: YARN is Yet Another Resource Negotiator. It is a schedule

that coordinates application runtimes.

● MapReduce: It is the algorithm that actually processes the data in

parallel to combine the pieces into the desired result.


● Hadoop Common: It is also known as Hadoop Core and it provides

support to all other components. It has a set of common libraries

and utilities that all other modules depend on.

Hadoop is built in Java, and accessible through many programming


languages, for writing MapReduce code, including Python, through a Thrift
client. It’s available either open-source through the Apache distribution, or
through vendors such as Cloudera (the largest Hadoop vendor by size and
scope), MapR, or HortonWorks.

Advantages and Disadvantages of Hadoop –

Advantage of Hadoop:
1. Cost effective.

2. Processing operation is done at a faster speed.

3. Best to be applied when a company is having a data diversity to be


processed.

4. Creates multiple copies.

5. Saves time and can derive data from any form of data.

Disadvantage of Hadoop:
1. Can’t perform in small data environments

2. Built entirely on java

3. Lack of preventive measures

4. Potential stability issues

5. Not fit for small data


What is Spark?

Apache Spark is an open-source tool. It is a newer project, initially developed


in 2012, at the AMPLab at UC Berkeley. It is focused on processing data in
parallel across a cluster, but the biggest difference is that it works in
memory. It is designed to use RAM for caching and processing the data.
Spark performs different types of big data workloads like:

● Batch processing.

● Real-time stream processing.

● Machine learning.

● Graph computation.

● Interactive queries.

There are five main components of Apache Spark:

● Apache Spark Core: It is responsible for functions like scheduling,

input and output operations, task dispatching, etc.

● Spark SQL: This is used to gather information about structured data

and how the data is processed.

● Spark Streaming: This component enables the processing of live

data streams.

● Machine Learning Library: The goal of this component is scalability

and to make machine learning more accessible.

● GraphX: This has a set of APIs that are used for facilitating graph

analytics tasks.
Advantages and Disadvantages of Spark-

Advantage of Spark:
1. Perfect for interactive processing, iterative processing and event

steam processing

2. Flexible and powerful

3. Supports for sophisticated analytics

4. Executes batch processing jobs faster than MapReduce

5. Run on Hadoop alongside other tools in the Hadoop ecosystem

Disadvantage of Spark:
1. Consumes a lot of memory

2. Issues with small file

3. Less number of algorithms

4. Higher latency compared to Apache fling


******************************************************************************
Explain the Spark Architecture. OR Any one as a short note.
The Spark follows the master-slave architecture. Its cluster consists of a single master and
multiple slaves.

The Spark architecture depends upon two abstractions:

○ Resilient Distributed Dataset (RDD)


○ Directed Acyclic Graph (DAG)

1. Resilient Distributed Datasets (RDD)

The Resilient Distributed Datasets are the group of data items that can be stored in-memory on
worker nodes. Here,

○ Resilient: Restore the data on failure.


○ Distributed: Data is distributed among different nodes.
○ Dataset: Group of data.

What is RDD?
The RDD (Resilient Distributed Dataset) is the Spark's core abstraction. It is a collection of elements,
partitioned across the nodes of the cluster so that we can execute various parallel operations on it.

There are two ways to create RDDs:

○ Parallelizing an existing data in the driver program


○ Referencing a dataset in an external storage system, such as a shared filesystem, HDFS,
HBase, or any data source offering a Hadoop InputFormat.

Parallelized Collections
To create parallelized collection, call SparkContext's parallelize method on an existing collection in
the driver program. Each element of collection is copied to form a distributed dataset that can be
operated on in parallel.

val info = Array(1, 2, 3, 4)


val distinfo = sc.parallelize(info)

Now, we can operate the distributed dataset (distinfo) parallel such like distinfo.reduce((a, b) => a +
b).

External Datasets
In Spark, the distributed datasets can be created from any type of storage
sources supported by Hadoop such as HDFS, Cassandra, HBase and even our
local file system. Spark provides the support for text files, SequenceFiles, and
other types of Hadoop InputFormat.

SparkContext's textFile method can be used to create RDD's text file. This
method takes a URI for the file (either a local path on the machine or a hdfs://)
and reads the data of the file.

Now, we can operate data on by dataset operations such as we can add up the
sizes of all the lines using the map and reduceoperations as follows: data.map(s
=> s.length).reduce((a, b) => a + b).

2. Directed Acyclic Graph (DAG)

Directed Acyclic Graph is a finite direct graph that performs a sequence of


computations on data. Each node is an RDD partition, and the edge is a
transformation on top of data. Here, the graph refers the navigation whereas
directed and acyclic refers to how it is done.

Let's understand the Spark architecture.

Driver Program
The Driver Program is a process that runs the main() function of the application
and creates the SparkContext object. The purpose of SparkContext is to
coordinate the spark applications, running as independent sets of processes on a
cluster.

To run on a cluster, the SparkContext connects to a different type of cluster


managers and then perform the following tasks: -

○ It acquires executors on nodes in the cluster.


○ Then, it sends your application code to the executors. Here, the
application code can be defined by JAR or Python files passed to the
SparkContext.
○ At last, the SparkContext sends tasks to the executors to run.

Cluster Manager
○ The role of the cluster manager is to allocate resources across
applications. The Spark is capable enough of running on a large
number of clusters.
○ It consists of various types of cluster managers such as Hadoop YARN,
Apache Mesos and Standalone Scheduler.
○ Here, the Standalone Scheduler is a standalone spark cluster manager
that facilitates to install Spark on an empty set of machines.

Worker Node

○ The worker node is a slave node


○ Its role is to run the application code in the cluster.

Executor

○ An executor is a process launched for an application on a worker node.


○ It runs tasks and keeps data in memory or disk storage across them.
○ It read and write data to the external sources.
○ Every application contains its executor.

Task

○ A unit of work that will be sent to one executor.

******************************************************************************

Unit II:
Using the Spark Shell as a Scala shell

It is a great way to interactively work with Apache Spark and run Scala code.

Here’s how you can get started:

Starting the Spark Shell

1. Open your terminal.


2. Navigate to your Spark installation directory. This is where you have Spark installed.
3. Run the Spark shell by executing:

./bin/spark-shell

Basic Usage

Once the Spark shell is running, you’ll see a prompt where you can start typing Scala commands.

Creating a Spark Session

A Spark session is typically already created in the Spark shell, but you can access it using:
val spark = SparkSession.builder.appName("MyApp").getOrCreate()

Creating a DataFrame

You can create a DataFrame from a collection, a CSV file, or other data sources. Here’s an
example of creating a DataFrame from a collection:

val data = Seq(("Alice", 29), ("Bob", 31), ("Cathy", 25))

val df = spark.createDataFrame(data).toDF("Name", "Age")

df.show()

Running SQL Queries

You can run SQL queries on DataFrames:

df.createOrReplaceTempView("people")

val results = spark.sql("SELECT Name FROM people WHERE Age > 30")

results.show()

Performing Operations

You can perform various operations like filtering, grouping, and aggregating:
df.filter($"Age" > 30).show()

df.groupBy("Age").count().show()

Exiting the Spark Shell

To exit the Spark shell, simply type:

:quit

_____________________________________________________

Here’s how to conduct number analysis and log analysis using Apache Spark, specifically
tailored for both tasks in a Spark environment.

Number Analysis in Spark

Number analysis often involves statistical operations on datasets. Here’s a step-by-step guide:

Step 1: Create a DataFrame with Numbers


val numbers = Seq(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

val numberDF = spark.createDataFrame(numbers.map(Tuple1(_))).toDF("number")


Step 2: Perform Basic Statistics

You can use Spark SQL functions to compute basic statistics like mean, median, and standard
deviation.

import org.apache.spark.sql.functions._

// Calculate basic statistics

numberDF.describe().show()

// Calculate mean and standard deviation

val stats = numberDF.agg(

avg("number").as("mean"),

stddev("number").as("stddev")

stats.show()
Step 3: Calculate Median

Calculating the median in Spark requires a bit more effort since it doesn't have a built-in median
function. Here’s one way to do it:

val sortedDF = numberDF.orderBy("number")

val count = sortedDF.count()

val median = if (count % 2 == 0) {

val middle1 = sortedDF.collect()(count.toInt / 2 - 1)(0).asInstanceOf[Int]

val middle2 = sortedDF.collect()(count.toInt / 2)(0).asInstanceOf[Int]

(middle1 + middle2) / 2.0

} else {

sortedDF.collect()(count.toInt / 2)(0).asInstanceOf[Int]

println(s"Median: $median")
Log Analysis in Spark

Log analysis involves reading log files, filtering, and aggregating data based on various criteria.

Step 1: Read a Log File

Assuming you have a CSV log file with columns such as timestamp, level, and message.

val logDF = spark.read.option("header", "true").csv("path/to/logfile.csv")

logDF.printSchema()

Step 2: Filter Log Entries

You can filter log entries by severity level (e.g., ERROR, WARN).

val errorLogs = logDF.filter($"level" === "ERROR")


errorLogs.show()

Step 3: Count Log Levels

To count occurrences of each log level:

val logCounts = logDF.groupBy("level").count()

logCounts.show()

Step 4: Extract Specific Information

For example, if you want to extract timestamps and messages for error logs:

val errorMessages = errorLogs.select("timestamp", "message")

errorMessages.show()

Step 5: Time-based Analysis

You can also analyze logs based on time (e.g., count errors per day).
logDF.withColumn("date", to_date($"timestamp"))

.groupBy("date")

.agg(count($"level").alias("error_count"))

.filter($"level" === "ERROR")

.show()

______________________________________________________

Creating a "Hello World" example in Apache Spark is a great way to get started. Here’s how you
can do it in the Spark Shell using Scala.

Step-by-Step "Hello World" in Spark

Step 1: Start the Spark Shell

Open your terminal and navigate to your Spark installation directory, then start the Spark Shell:

./bin/spark-shell
Step 2: Write the "Hello World" Code

Once the Spark Shell is up and running, you can execute the following Scala code to print "Hello
World" using Spark:

// Create a simple RDD (Resilient Distributed Dataset)

val helloRDD = spark.sparkContext.parallelize(Seq("Hello, World!"))

// Collect and print the results

helloRDD.collect().foreach(println)

Explanation

1. Create an RDD:
○ The parallelize method creates an RDD from a sequence containing "Hello,
World!".

2. Collect and Print:

○ The collect() method gathers the RDD's data back to the driver node, and
foreach(println) prints each element.

Step 3: Run the Code

When you run the code in the Spark Shell, you should see the output:

Hello, World!

_________________________________________________________________

What is Spark Streaming?

1. Apache Spark Streaming is a scalable fault-tolerant streaming processing system that


natively supports both batch and streaming workloads.
2. Spark Streaming is an extension of the core Spark API that allows data engineers and
data scientists to process real-time data from various sources including (but not limited
to) Kafka, and Amazon Kinesis.
3. This processed data can be pushed out to file systems, databases, and live dashboards.
4. Its key abstraction is a Discretized Stream or, in short, a DStream, which represents a
stream of data divided into small batches.
5. DStreams are built on RDDs, Spark’s core data abstraction.
6. This allows Spark Streaming to seamlessly integrate with any other Spark components
like MLlib and Spark SQL.
7. Spark Streaming is different from other systems that either have a processing engine
designed only for streaming, or have similar batch and streaming APIs but compile
internally to different engines.
8. Spark’s single execution engine and unified programming model for batch and streaming
lead to some unique benefits over other traditional streaming systems.

Four Major Aspects of Spark Streaming

● Fast recovery from failures and stragglers


● Better load balancing and resource usage
● Combining of streaming data with static datasets and interactive queries
● Native integration with advanced processing libraries (SQL, machine learning,
graph processing)

In Spark Streaming, the Application Programming Interface (API) provides a way to interact
with and manipulate streams of data. The API allows you to perform various operations on
DStreams (Discretized Streams), enabling real-time data processing and analysis.

Key Components of Spark Streaming API


1. StreamingContext

○ The entry point for using Spark Streaming. It is used to create DStreams and
manage streaming operations.
○ Example:

val ssc = new StreamingContext(spark.sparkContext, Seconds(1))

2. DStream

○ Represents a continuous stream of data. DStreams can be created from various


sources (e.g., Kafka, socket, files).
○ Common operations on DStreams include transformations and output operations.

Common DStream Operations

Transformations

These operations create a new DStream from an existing one.

● map

○ Applies a function to each element in the DStream.


○ Example

val mappedStream = lines.map(line => line.toUpperCase)


flatMap

● Similar to map, but can return multiple output elements for each input element.
● Example:

val words = lines.flatMap(line => line.split(" "))

filter

● Returns a new DStream containing only the elements that satisfy a given condition.
● Example:

val errors = lines.filter(line => line.contains("ERROR"))

reduceByKey
● Combines values with the same key using a specified reduce function.
● Example

val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)

window

● Creates a new DStream by applying a transformation over a sliding window of data.


● Example

val windowedWordCounts = words.window(Seconds(30), Seconds(10))

.map(word => (word, 1))

.reduceByKey(_ + _)

Output Operations

These operations are used to output the results of the transformations.

● print

○ Prints the first n elements of the DStream to the console.


○ Example:
wordCounts.print()

saveAsTextFiles

● Saves the output of the DStream to text files.


● Example

wordCounts.saveAsTextFiles("path/to/output")

foreachRDD

● Allows you to apply any operation on the RDDs generated from DStreams.
● Example

wordCounts.foreachRDD(rdd => {

// Perform operations on RDD, like saving to a database

rdd.foreach(record => saveToDatabase(record))

})
*****************************************************************

Unit III:

What is Apache Spark SQL?

1. Spark SQL brings native support for SQL to Spark and streamlines the process of
querying data stored both in RDDs (Spark’s distributed datasets) and in external sources.
2. Spark SQL conveniently blurs the lines between RDDs and relational tables.
3. Unifying these powerful abstractions makes it easy for developers to intermix SQL
commands querying external data with complex analytics, all within a single application.
Concretely, Spark SQL will allow developers to:
● Import relational data from Parquet files and Hive tables
● Run SQL queries over imported data and existing RDDs
● Easily write RDDs out to Hive tables or Parquet files
4. Spark SQL also includes a cost-based optimizer, columnar storage, and code
generation to make queries fast.
5. At the same time, it scales to thousands of nodes and multi-hour queries using
the Spark engine, which provides full mid-query fault tolerance, without having to
worry about using a different engine for historical data.

____________________________________________________________________________

Spark SQL is a component of Apache Spark that enables users to run SQL queries on large
datasets. It provides a programming interface for working with structured and semi-structured
data, integrating SQL with the flexibility of Spark’s data processing capabilities.

Key Features of Spark SQL

1. Unified Data Processing:

○ Combines batch processing, streaming, and SQL query capabilities within the
same framework, allowing for seamless data operations.

2. DataFrame and Dataset APIs:

○ DataFrame: A distributed collection of data organized into named columns,


similar to a table in a relational database.
○ Dataset: A strongly typed version of a DataFrame, allowing for compile-time
type safety while still benefiting from Spark’s optimization features.

3. Support for Various Data Sources:

○ Can read data from a variety of formats and storage systems, including:

■ JSON
■ Parquet
■ ORC
■ JDBC (connecting to relational databases)
■ Hive tables
■ Text files

4. SQL Queries:

○ Users can run SQL queries using the SQL interface, making it easy to leverage
existing SQL knowledge.

5. Optimized Query Execution:

○ The Catalyst optimizer and Tungsten execution engine optimize query execution
for performance.

6. Integration with BI Tools:

○ Spark SQL integrates well with business intelligence (BI) tools such as Tableau
and Qlik, making it easier to visualize and analyze data.

Getting Started with Spark SQL

Here’s a step-by-step guide to using Spark SQL.

Step 1: Start Spark Shell with SQL Support

You can start the Spark shell with SQL capabilities:

./bin/spark-shell

Step 2: Create a Spark Session

The Spark session is the entry point for using Spark SQL.

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder

.appName("Spark SQL Example")

.getOrCreate()

Step 3: Load Data into a DataFrame

You can load data from various sources. Here’s an example of loading a JSON file:

val df = spark.read.json("path/to/your/data.json")

Step 4: Show DataFrame Contents

You can view the contents of the DataFrame:

df.show()

Step 5: Run SQL Queries


To run SQL queries, first register the DataFrame as a temporary view:

df.createOrReplaceTempView("people")

Now you can run SQL queries on this view:

val results = spark.sql("SELECT name, age FROM people WHERE age > 30")

results.show()

Step 6: Use DataFrame Operations

You can also use the DataFrame API to perform similar operations:

val filteredDF = df.filter($"age" > 30).select("name", "age")

filteredDF.show()

You might also like