BIGDATA
BIGDATA
BIGDATA
1. Data Storage
2. Data Mining
3. Data Analytics
4. Data Visualization
1. Data Storage
Big Data Technologies that come under Data Storage:
○ Hadoop: When it comes to handling big data, Hadoop is one of the leading technologies
that come into play. This technology is based entirely on map-reduce architecture and is
mainly used to process batch information. Also, it is capable enough to process tasks in
batches. The Hadoop framework was mainly introduced to store and process data in a
distributed data processing environment parallel to commodity hardware and a basic
programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from
various machines with a faster speed and low cost. That is why Hadoop is known as one
of the core components of big data technologies. The Apache Software Foundation
introduced it in Dec 2011. Hadoop is written in Java programming language.
2.Data Mining
Presto: Presto is an open-source and a distributed SQL query engine developed to run interactive
analytical queries against huge-sized data sources. The size of data sources can vary from
gigabytes to petabytes. Presto helps in querying the data in Cassandra, Hive, relational databases
and proprietary data storage systems.Presto is a Java-based query engine that was developed in
2013 by the Apache Software Foundation. Companies like Repro, Netflix, Facebook are using
this big data technology and making good use of it.
3. Data Analytics
○ Apache Kafka: Apache Kafka is a popular streaming platform. This streaming platform is
primarily known for its three core capabilities: publisher, subscriber and consumer. It is
referred to as a distributed streaming platform. It is also defined as a direct messaging,
asynchronous messaging broker system that can ingest and perform data processing on
real-time streaming data. This platform is almost similar to an enterprise messaging
system or messaging queue.
Besides, Kafka also provides a retention period, and data can be transmitted through a
producer-consumer mechanism. Kafka has received many enhancements to date and
includes some additional levels or properties, such as schema, Ktables, KSql, registry,
etc. It is written in Java language and was developed by the Apache software community
in 2011. Some top companies using the Apache Kafka platform include Twitter, Spotify,
Netflix, Yahoo, LinkedIn etc.
○ Spark: Apache Spark is one of the core technologies in the list of big data technologies. It
is one of those essential technologies which are widely used by top companies. Spark is
known for offering In-memory computing capabilities that help enhance the overall speed
of the operational process. It also provides a generalized execution model to support more
applications. Besides, it includes top-level APIs (e.g., Java, Scala, and Python) to ease the
development process.
Also, Spark allows users to process and handle real-time streaming data using batching
and windowing operations techniques. This ultimately helps to generate datasets and data
frames on top of RDDs. As a result, the integral components of Spark Core are produced.
Components like Spark MlLib, GraphX, and R help analyze and process machine
learning and data science. Spark is written using Java, Scala, Python and R language. The
Apache Software Foundation developed it in 2009. Companies like Amazon, ORACLE,
CISCO, VerizonWireless, and Hortonworks are using this big data technology and
making good use of it.
○ R-Language: R is defined as the programming language, mainly used in statistical
computing and graphics. It is a free software environment used by leading data miners,
practitioners and statisticians. Language is primarily beneficial in the development of
statistical-based software and data analytics.
R-language was introduced in Feb 2000 by R-Foundation. It is written in Fortran.
Companies like Barclays, American Express, and Bank of America use R-Language for
their data analytics needs.
4.Data Visualization
○ Tableau: Tableau is one of the fastest and most powerful data visualization tools used by
leading business intelligence industries. It helps in analyzing the data at a very faster
speed. Tableau helps in creating the visualizations and insights in the form of dashboards
and worksheets.
Tableau is developed and maintained by a company named TableAU. It was introduced
in May 2013. It is written using multiple languages, such as Python, C, C++, and Java.
Some of the list's top companies are Cognos and ORACLE Hyperion, using this tool.
○ Plotly: As the name suggests, Plotly is best suited for plotting or creating graphs and
relevant components at a faster speed in an efficient way. It consists of several rich
libraries and APIs, such as MATLAB, Python, Julia, REST API, Arduino, R, Node.js,
etc. This helps interactive styling graphs with Jupyter notebook and Pycharm.
Plotly was introduced in 2012 by the Plotly company. It is based on JavaScript. Paladins
and Bitbank are some of those companies that are making good use of Plotly.
******************************************************************************
What is Apache Hadoop?
● Apache Hadoop software is an open source framework that allows for the distributed
storage and processing of large datasets across clusters of computers using simple
programming models.
● Hadoop is designed to scale up from a single computer to thousands of clustered
computers, with each machine offering local computation and storage.
● In this way, Hadoop can efficiently store and process large datasets ranging in size from
gigabytes to petabytes of data.
Explain primary Hadoop framework and work collectively to form the Hadoop ecosystem:
1. Hadoop Distributed File System (HDFS): As the primary component of the Hadoop
ecosystem, HDFS is a distributed file system in which individual Hadoop nodes operate
on data that resides in their local storage. This removes network latency, providing high-
throughput access to application data. In addition, administrators don’t need to define
schemas up front.
2. Yet Another Resource Negotiator (YARN): YARN is a resource-management platform
responsible for managing compute resources in clusters and using them to schedule users’
applications. It performs scheduling and resource allocation across the Hadoop system.
3. MapReduce: MapReduce is a programming model for large-scale data processing. In the
MapReduce model, subsets of larger datasets and instructions for processing the subsets
are dispatched to multiple different nodes, where each subset is processed by a node in
parallel with other processing jobs. After processing the results, individual subsets are
combined into a smaller, more manageable dataset.
4. Hadoop Common: Hadoop Common includes the libraries and utilities used and shared
by other Hadoop modules.
Beyond HDFS, YARN, and MapReduce, the entire Hadoop open source ecosystem continues to
grow and includes many tools and applications to help collect, store, process, analyze, and
manage big data. These include Apache Pig, Apache Hive, Apache HBase, Apache Spark.
How does Hadoop work?
1. Hadoop allows for the distribution of datasets across a cluster of commodity hardware.
Processing is performed in parallel on multiple servers simultaneously.
2. Software clients input data into Hadoop. HDFS handles metadata and the distributed file
system. MapReduce then processes and converts the data. Finally, YARN divides the
jobs across the computing cluster.
3. All Hadoop modules are designed with a fundamental assumption that hardware failures
of individual machines or racks of machines are common and should be automatically
handled in software by the framework.
Scalability
Hadoop is important as one of the primary tools to store and process huge amounts of data
quickly. It does this by using a distributed computing model which enables the fast processing of
data that can be rapidly scaled by adding computing nodes.
Low cost
As an open source framework that can run on commodity hardware and has a large ecosystem of
tools, Hadoop is a low-cost option for the storage and management of big data.
Flexibility
Hadoop allows for flexibility in data storage as data does not require preprocessing before
storing it which means that an organization can store as much data as they like and then utilize it
later.
Resilience
As a distributed computing model, Hadoop allows for fault tolerance and system resilience,
meaning if one of the hardware nodes fail, jobs are redirected to other nodes. Data stored on one
Hadoop cluster is replicated across other nodes within the system to fortify against the possibility
of hardware or software failure.
As a file-intensive system, MapReduce can be a difficult tool to utilize for complex jobs, such as
interactive analytical tasks. MapReduce functions also need to be written in Java and can require
a steep learning curve. The MapReduce ecosystem is quite large, with many components for
different functions that can make it difficult to determine what tools to use.
2. Security
Data sensitivity and protection can be issues as Hadoop handles such large datasets. An
ecosystem of tools for authentication, encryption, auditing, and provisioning has emerged to help
developers secure data in Hadoop.
Hadoop does not have many robust tools for data management and governance, nor for data
quality and standardization.
4. Talent gap
Like many areas of programming, Hadoop has an acknowledged talent gap. Finding developers
with the combined requisite skills in Java to program MapReduce, operating systems, and
hardware can be difficult. In addition, MapReduce has a steep learning curve, making it hard to
get new programmers up to speed on its best practices and ecosystem.
Hadoop has a large ecosystem of open source tools that can augment and extend the capabilities
of the core module. Some of the main software tools used with Hadoop include:
Apache Hive: A data warehouse that allows programmers to work with data in HDFS using a
query language called HiveQL, which is similar to SQL
Apache HBase: An open source non-relational distributed database often paired with Hadoop
Apache Pig: A tool used as an abstraction layer over MapReduce to analyze large sets of data
and enables functions like filter, sort, load, and join
Apache Impala: Open source, massively parallel processing SQL query engine often used with
Hadoop
Apache Sqoop: A command-line interface application for efficiently transferring bulk data
between relational databases and Hadoop
Apache ZooKeeper: An open source server that enables reliable distributed coordination in
Hadoop; a service for, "maintaining configuration information, naming, providing distributed
synchronization, and providing group services"
Apache Oozie: A workflow scheduler for Hadoop jobs
A wide variety of companies and organizations use Hadoop for research, production data
processing, and analytics that require processing terabytes or petabytes of big data, storing
diverse datasets, and data parallel processing.
3. Data lakes
Since Hadoop can help store data without preprocessing, it can be used to complement to data
lakes, where large amounts of unrefined data are stored.
4. Marketing analytics
Marketing departments often use Hadoop to store and analyze customer relationship
management (CRM) data.
5. Risk management
Banks, insurance companies, and other financial services companies use Hadoop to build risk
analysis and management models.
6. AI and machine learning
Hadoop ecosystems help with the processing of data and model training operations for machine
learning applications.
************************************************************************
2. The fundamental motive and goal behind developing the framework was to overcome the
inefficiencies of MapReduce.
3. Even though MapReduce was a huge success and had wide acceptance, it could not be
4. MapReduce is not efficient for multi-pass applications that require low-latency data
sharing across multiple parallel operations. There are many data analytics applications
which include:
engine.
● Many of these features establish the advantages of Apache Spark over other Big Data
processing engines.
● Fault tolerance
● Dynamic In Nature
● Lazy Evaluation
● Speed
● Reusability
● Advanced Analytics
● In Memory Computing
● Cost efficient
1. Fault Tolerance: Apache Spark is designed to handle worker node failures. It achieves
this fault tolerance by using DAG and RDD (Resilient Distributed Datasets). DAG
contains the lineage of all the transformations and actions needed to complete a task. So
in the event of a worker node failure, the same results can be achieved by rerunning the
parallel apps.
3. Lazy Evaluation: Spark does not evaluate any transformation immediately. All the
transformations are lazily evaluated. The transformations are added to the DAG and the
final computation or results are available only when actions are called. This gives Spark
the ability to make optimization decisions, as all the transformations become visible to
4. Real Time Stream Processing: Spark Streaming brings Apache Spark's language-
integrated API to stream processing, letting you write streaming jobs the same way you
5. Speed: Spark enables applications running on Hadoop to run up to 100x faster in memory
and up to 10x faster on disk. Spark achieves this by minimizing disk read/write
operations for intermediate results. It stores in memory and performs disk operations only
when essential. Spark achieves this using DAG, query optimizer and highly optimized
6. Reusability: Spark code can be used for batch-processing, joining streaming data against
processing and data sciences across multiple industries. Spark provides both machine
learning and graph processing libraries, which companies across sectors leverage to
tackle complex problems. And all this is easily done using the power of Spark and highly
Spark.
processing tasks in memory and it is not required to write back intermediate results to the
disk. This feature gives massive speed to Spark processing. Over and above this, Spark is
also capable of caching the intermediate results so that it can be reused in the next
iteration. This gives Spark added performance boost for any iterative and repetitive
processes, where results in one step can be used later, or there is a common dataset which
9. Supporting Multiple languages: Spark comes inbuilt with multi-language support. It has
most of the APIs available in Java, Scala, Python and R. Also, there are advanced
features available with R language for data analytics. Also, Spark comes with SparkSQL
which has an SQL like feature. SQL developers find it therefore very easy to use, and the
10. Integrated with Hadoop: Apache Spark integrates very well with Hadoop file system
HDFS. It offers support to multiple file formats like parquet, json, csv, ORC, Avro etc.
Hadoop can be easily leveraged with Spark as an input data source or destination.
11. Cost efficient: Apache Spark is an open source software, so it does not have any licensing
fee associated with it. Users have to just worry about the hardware cost. Also, Apache
Spark reduces a lot of other costs as it comes inbuilt for stream processing, ML and
Graph processing. Spark does not have any locking with any vendor, which makes it very
easy for organizations to pick and choose Spark features as per their use case.
Apache Spark:
Apache Hadoop is a platform that got its start as a Yahoo project in 2006,
which became a top-level Apache open-source project afterward. This
framework handles large datasets in a distributed fashion. The Hadoop
ecosystem is highly fault-tolerant and does not depend upon hardware to
achieve high availability. This framework is designed with a vision to look for
the failures at the application layer. It’s a general-purpose form of distributed
processing that has several components:
Advantage of Hadoop:
1. Cost effective.
5. Saves time and can derive data from any form of data.
Disadvantage of Hadoop:
1. Can’t perform in small data environments
● Batch processing.
● Machine learning.
● Graph computation.
● Interactive queries.
data streams.
● GraphX: This has a set of APIs that are used for facilitating graph
analytics tasks.
Advantages and Disadvantages of Spark-
Advantage of Spark:
1. Perfect for interactive processing, iterative processing and event
steam processing
Disadvantage of Spark:
1. Consumes a lot of memory
The Resilient Distributed Datasets are the group of data items that can be stored in-memory on
worker nodes. Here,
What is RDD?
The RDD (Resilient Distributed Dataset) is the Spark's core abstraction. It is a collection of elements,
partitioned across the nodes of the cluster so that we can execute various parallel operations on it.
Parallelized Collections
To create parallelized collection, call SparkContext's parallelize method on an existing collection in
the driver program. Each element of collection is copied to form a distributed dataset that can be
operated on in parallel.
Now, we can operate the distributed dataset (distinfo) parallel such like distinfo.reduce((a, b) => a +
b).
External Datasets
In Spark, the distributed datasets can be created from any type of storage
sources supported by Hadoop such as HDFS, Cassandra, HBase and even our
local file system. Spark provides the support for text files, SequenceFiles, and
other types of Hadoop InputFormat.
SparkContext's textFile method can be used to create RDD's text file. This
method takes a URI for the file (either a local path on the machine or a hdfs://)
and reads the data of the file.
Now, we can operate data on by dataset operations such as we can add up the
sizes of all the lines using the map and reduceoperations as follows: data.map(s
=> s.length).reduce((a, b) => a + b).
Driver Program
The Driver Program is a process that runs the main() function of the application
and creates the SparkContext object. The purpose of SparkContext is to
coordinate the spark applications, running as independent sets of processes on a
cluster.
Cluster Manager
○ The role of the cluster manager is to allocate resources across
applications. The Spark is capable enough of running on a large
number of clusters.
○ It consists of various types of cluster managers such as Hadoop YARN,
Apache Mesos and Standalone Scheduler.
○ Here, the Standalone Scheduler is a standalone spark cluster manager
that facilitates to install Spark on an empty set of machines.
Worker Node
Executor
Task
******************************************************************************
Unit II:
Using the Spark Shell as a Scala shell
It is a great way to interactively work with Apache Spark and run Scala code.
./bin/spark-shell
Basic Usage
Once the Spark shell is running, you’ll see a prompt where you can start typing Scala commands.
A Spark session is typically already created in the Spark shell, but you can access it using:
val spark = SparkSession.builder.appName("MyApp").getOrCreate()
Creating a DataFrame
You can create a DataFrame from a collection, a CSV file, or other data sources. Here’s an
example of creating a DataFrame from a collection:
df.show()
df.createOrReplaceTempView("people")
val results = spark.sql("SELECT Name FROM people WHERE Age > 30")
results.show()
Performing Operations
You can perform various operations like filtering, grouping, and aggregating:
df.filter($"Age" > 30).show()
df.groupBy("Age").count().show()
:quit
_____________________________________________________
Here’s how to conduct number analysis and log analysis using Apache Spark, specifically
tailored for both tasks in a Spark environment.
Number analysis often involves statistical operations on datasets. Here’s a step-by-step guide:
You can use Spark SQL functions to compute basic statistics like mean, median, and standard
deviation.
import org.apache.spark.sql.functions._
numberDF.describe().show()
avg("number").as("mean"),
stddev("number").as("stddev")
stats.show()
Step 3: Calculate Median
Calculating the median in Spark requires a bit more effort since it doesn't have a built-in median
function. Here’s one way to do it:
} else {
sortedDF.collect()(count.toInt / 2)(0).asInstanceOf[Int]
println(s"Median: $median")
Log Analysis in Spark
Log analysis involves reading log files, filtering, and aggregating data based on various criteria.
Assuming you have a CSV log file with columns such as timestamp, level, and message.
logDF.printSchema()
You can filter log entries by severity level (e.g., ERROR, WARN).
logCounts.show()
For example, if you want to extract timestamps and messages for error logs:
errorMessages.show()
You can also analyze logs based on time (e.g., count errors per day).
logDF.withColumn("date", to_date($"timestamp"))
.groupBy("date")
.agg(count($"level").alias("error_count"))
.show()
______________________________________________________
Creating a "Hello World" example in Apache Spark is a great way to get started. Here’s how you
can do it in the Spark Shell using Scala.
Open your terminal and navigate to your Spark installation directory, then start the Spark Shell:
./bin/spark-shell
Step 2: Write the "Hello World" Code
Once the Spark Shell is up and running, you can execute the following Scala code to print "Hello
World" using Spark:
helloRDD.collect().foreach(println)
Explanation
1. Create an RDD:
○ The parallelize method creates an RDD from a sequence containing "Hello,
World!".
○ The collect() method gathers the RDD's data back to the driver node, and
foreach(println) prints each element.
When you run the code in the Spark Shell, you should see the output:
Hello, World!
_________________________________________________________________
In Spark Streaming, the Application Programming Interface (API) provides a way to interact
with and manipulate streams of data. The API allows you to perform various operations on
DStreams (Discretized Streams), enabling real-time data processing and analysis.
○ The entry point for using Spark Streaming. It is used to create DStreams and
manage streaming operations.
○ Example:
2. DStream
Transformations
● map
● Similar to map, but can return multiple output elements for each input element.
● Example:
filter
● Returns a new DStream containing only the elements that satisfy a given condition.
● Example:
reduceByKey
● Combines values with the same key using a specified reduce function.
● Example
window
.reduceByKey(_ + _)
Output Operations
saveAsTextFiles
wordCounts.saveAsTextFiles("path/to/output")
foreachRDD
● Allows you to apply any operation on the RDDs generated from DStreams.
● Example
wordCounts.foreachRDD(rdd => {
})
*****************************************************************
Unit III:
1. Spark SQL brings native support for SQL to Spark and streamlines the process of
querying data stored both in RDDs (Spark’s distributed datasets) and in external sources.
2. Spark SQL conveniently blurs the lines between RDDs and relational tables.
3. Unifying these powerful abstractions makes it easy for developers to intermix SQL
commands querying external data with complex analytics, all within a single application.
Concretely, Spark SQL will allow developers to:
● Import relational data from Parquet files and Hive tables
● Run SQL queries over imported data and existing RDDs
● Easily write RDDs out to Hive tables or Parquet files
4. Spark SQL also includes a cost-based optimizer, columnar storage, and code
generation to make queries fast.
5. At the same time, it scales to thousands of nodes and multi-hour queries using
the Spark engine, which provides full mid-query fault tolerance, without having to
worry about using a different engine for historical data.
____________________________________________________________________________
Spark SQL is a component of Apache Spark that enables users to run SQL queries on large
datasets. It provides a programming interface for working with structured and semi-structured
data, integrating SQL with the flexibility of Spark’s data processing capabilities.
○ Combines batch processing, streaming, and SQL query capabilities within the
same framework, allowing for seamless data operations.
○ Can read data from a variety of formats and storage systems, including:
■ JSON
■ Parquet
■ ORC
■ JDBC (connecting to relational databases)
■ Hive tables
■ Text files
4. SQL Queries:
○ Users can run SQL queries using the SQL interface, making it easy to leverage
existing SQL knowledge.
○ The Catalyst optimizer and Tungsten execution engine optimize query execution
for performance.
○ Spark SQL integrates well with business intelligence (BI) tools such as Tableau
and Qlik, making it easier to visualize and analyze data.
./bin/spark-shell
The Spark session is the entry point for using Spark SQL.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
.getOrCreate()
You can load data from various sources. Here’s an example of loading a JSON file:
val df = spark.read.json("path/to/your/data.json")
df.show()
df.createOrReplaceTempView("people")
val results = spark.sql("SELECT name, age FROM people WHERE age > 30")
results.show()
You can also use the DataFrame API to perform similar operations:
filteredDF.show()