Bda CHP1

THE ANJUMAN-I-ISLAM’S
M. H. SABOO SIDDIK COLLEGE OF ENGINEERING
Department of Computer Science & Engineering(AI & ML)

CSC702
BIG DATA ANALYTICS
Subject I/c: Prof Arshi Khan

Syllabus
Module 1: Introduction to Big
Data & Hadoop
⚫1.1 Introduction to Big Data,
⚫1.2 Big Data characteristics, types of Big
Data,
⚫1.3 Traditional vs. Big Data business
approach,
⚫1.4 Case Study of Big Data Solutions.
⚫1.5 Concept of Hadoop
⚫1.6 Core Hadoop Components; Hadoop
Ecosystem
Introduction to Big Data,
Big data is high-volume,
high-velocity and/or high-
variety information assets
that demand cost-effective,
innovative forms of
information processing that
enable enhanced insight,
decision making, and process
automation.
Introduction to Big Data,
Introduction to Big Data
Big Data is a collection of data that is
huge in volume, yet growing
exponentially with time. It is a data
with so large size and complexity that
none of traditional data management
tools can store it or process it
efficiently. Big data is also a data but
with huge size.
⚫ What is big data?

⚫ Big data is a combination of
structured, semi structured and
unstructured data collected by
organizations that can be mined for
information and used in machine
learning projects, predictive
modeling and other advanced
analytics applications.
Big data is often characterized by the three V's:
1.the large volume of data in many environments.
2.the wide variety of data types frequently stored in big data

systems; and
3.the velocity at which much of the data is generated, collected

and processed.
Big Data Characteristics
⚫Volume:
⚫The amount of data matters. With big data,
you’ll have to process high volumes of low-
density, unstructured data. This can be
data of unknown value, such as Twitter
data feeds, clickstreams on a web page or
a mobile app, or sensor-enabled
equipment. For some organizations, this
might be tens of terabytes of data. For
others, it may be hundreds of petabytes.
⚫Velocity
⚫Velocity is the fast rate at which data is
received and (perhaps) acted on. Normally,
the highest velocity of data streams
directly into memory versus being written
to disk. Some internet-enabled smart
products operate in real time or near real
time and will require real-time evaluation
and action.
⚫Variety:
Variety refers to the many types of data
that are available. Traditional data types
were structured and fit neatly in a relational
database. With the rise of big data, data
comes in new unstructured data types.
Unstructured and semistructured data
types, such as text, audio, and video,
require additional preprocessing to derive
meaning and support metadata.
The history of big data
⚫ origins of large data sets go back to the 1960s and

‘70s when the world of data was just getting started
with the first data centers and the development of
the relational database.
⚫ Around 2005, people began to realize just how much

data users generated through Facebook, YouTube,
and other online services.
⚫ Hadoop (an open-source framework created

specifically to store and analyze big data sets) was
developed that same year. NoSQL also began to gain
popularity during this time.
⚫ The development of open-source frameworks,
such as Hadoop (and more recently, Spark)
was essential for the growth of big data
because they make big data easier to work
with and cheaper to store.
⚫ With the advent of the Internet of Things

(IoT), more objects and devices are connected
to the internet, gathering data on customer
usage patterns and product performance.
⚫cloud computing has expanded big data
possibilities even further. The cloud offers
elastic scalability, where developers can
simply spin up ad hoc clusters to test a
subset of data.
Big data benefits:
⚫Big data makes it possible for you to gain

more complete answers because you have
more information.
⚫More complete answers mean more

confidence in the data—which means a
completely different approach to tackling
problems.
Big data use cases
Predictive
Product development maintenance
Customer experience Fraud and compliance
Machine learning Operational efficiency
Drive innovation
⚫ Product development:
⚫ Companies like Netflix and Procter & Gamble
use big data to anticipate customer demand.
⚫ They build predictive models for new products and
services by classifying key attributes of past and
current products or services and modeling the
relationship between those attributes and the
commercial success of the offerings.
⚫ In addition, P&G uses data and analytics from
focus groups, social media, test markets, and early
store rollouts to plan, produce, and launch new
products.
⚫ Predictive maintenance
⚫ Factors that can predict mechanical failures
may be deeply buried in structured data, such
as the year, make, and model of equipment, as
well as in unstructured data that covers
millions of log entries, sensor data, error
messages, and engine temperature.
⚫ By analyzing these indications of potential
issues before the problems happen,
organizations can deploy maintenance more
cost effectively and maximize parts and
equipment uptime.
⚫ Customer experience:
⚫ Big data enables you to gather data from social
media, web visits, call logs, and other sources to
improve the interaction experience and maximize the
value delivered. Start delivering personalized offers,
reduce customer churn, and handle issues proactively.
⚫ Fraud and compliance:

⚫ Security landscapes and compliance requirements are
constantly evolving. Big data helps you identify
patterns in data that indicate fraud and aggregate
large volumes of information to make regulatory
reporting much faster.
⚫ Machine
⚫ We are now able to teach machines instead of
program them. The availability of big data to train
machine learning models makes that possible.
⚫ Operational :
⚫ With big data, you can analyze and assess
production, customer feedback and returns, and
other factors to reduce outages and anticipate
future demands. Big data can also be used to
improve decision-making in line with current
market demand.
⚫Drive innovation
⚫Big data can help innovate by studying
interdependencies among humans,
institutions, entities, and process and then
determining new ways to use those
insights. Use data insights to improve
decisions about financial and planning
considerations.
Data Categories
Structured data: In Structured schema, along with all
the
H required columns. It is in a tabular form. Structured
Data is stored in the relational database management
system.
Semi-structured: In Semi-structured, the schema is not
appropriately defined, e.g., JSON, XML, CSV, TSV,
and email. OLTP (Online Transaction Processing)
systems are built to work with semi-structured data. It is
stored in relations, i.e., tables.
Unstructured Data: All the unstructured files, log
files, audio files, and image files are included in the
unstructured data. Some organizations have much data
available, but they did not know how to derive the value
of data since the data is raw.
Quasi-structured Data: The data format contains textual
data with inconsistent data formats that are formatted with
effort and time with some tools.
Example: Web server logs, i.e., the log file is created
and maintained by some server that contains a list
of activities.
types of Big Data
⚫Cons of Structured Data
⚫structured data has limited flexibility and is
suitable for certain specific use cases only.
⚫Structured data is stored in a data

warehouse with rigid constraints and a
definite schema. Any change in
requirements would mean updating all of
that structured data to meet the new
needs.
⚫ The data is not in the relational format
⚫ some features like key-value pairs that
help in discerning the different entities
from each other.
⚫ Since semi-structured data doesn’t need a
structured query language, it is commonly
called NoSQL data.
⚫ A data serialization language is used to
exchange semi-structured data across
systems that may even have varied underlying
infrastructure.
⚫ This type of information typically comes from
external sources such as social media
platforms or other web-based data feeds.
Things you must know!
⚫ Software developers use serialization languages to
write memory-based data in files, transit, store, and
parse. The sender and the receiver don’t need to
know about the other system. As long as the same
serialization language is used, the data can be
understood by both systems comfortably. There are
three predominantly used Serialization languages.
Traditional vs. Big Data business
approach
Volume: Traditional data typically refers to small
Variety: to medium-sized datasets that can be easily
Traditional stored and analyzed using traditional data
data is processing technologies. In contrast, big data
typically refers to extremely large datasets that cannot be
structured, easily managed or processed using traditional
meaning it is technologies.
organized in a
Velocity: Traditional data is usually static and
predefined
updated on a periodic basis. In contrast, big data
manner such
is constantly changing and updated in real-time
as tables,
or near real-time.
columns, and
rows. Big data,
on the other Complexity: Traditional data is relatively simple to
hand, can be manage and analyze. Big data, on the other
structured, hand, is complex and requires specialized tools
unstructured, and techniques to manage, process, and analyze.
or semi- Value: Traditional data typically has a lower
structured, potential value than big data because it is limited
meaning it in scope and size. Big data, on the other hand,
may contain can provide valuable insights into customer
text, images, behavior, market trends, and other business-
videos, or critical information.
other types of
data.
Case Study of Big Data Solutions.
⚫ 1) Starbucks:
⚫ With 90 million transactions a week in 25,000 stores worldwide the coffee
giant is in many ways on the cutting edge of using big data and artificial
intelligence to help direct marketing, sales and business decisions
⚫ Through its popular loyalty card program and mobile application,
Starbucks owns individual purchase data from millions of customers.
Using this information and BI tools, the company predicts purchases and
sends individual offers of what customers will likely prefer via their app
and email. This system draws existing customers into its stores more
frequently and increases sales volumes.
⚫ The same intel that helps Starbucks suggest new products to try also
helps the company send personalized offers and discounts that go far
beyond a special birthday discount. Additionally, a customized email goes
out to any customer who hasn’t visited a Starbucks recently with enticing
offers—built from that individual’s purchase history—to re-engage them.
2) Netflix:
The online entertainment company’s 148 million
subscribers give it a massive BI advantage.
Netflix has digitized its interactions with its 151 million
subscribers. It collects data from each of its users and with
the help of data analytics understands the behavior of
subscribers and their watching patterns. It then leverages
that information to recommend movies and TV shows
customized as per the subscriber’s choice and preferences.
As per Netflix, around 80% of the viewer’s activity is
triggered by personalized algorithmic recommendations.
Where Netflix gains an edge over its peers is that by
collecting different data points, it creates detailed profiles
of its subscribers which helps them engage with them
better.
The recommendation system of Netflix contributes to more
than 80% of the content streamed by its subscribers which
has helped Netflix earn a whopping one billion via
customer retention. Due to this reason, Netflix doesn’t
have to invest too much on advertising and marketing
Concept of Hadoop
⚫ Hadoop is an open-source software framework
⚫ that is used for storing and processing large

amounts of data in a distributed computing
environment.
⚫ It is designed to handle big data and is based
on the Map Reduce programming model,
⚫ which allows for the parallel processing of large
datasets.
main components
⚫Hadoop
HDFS (Hadoop
has two main components:
Distributed File YARN (Yet
System): This is Another
the storage Resource
component of Negotiator): This
Hadoop, which
is the resource
allows for the
management
storage of large
component of
amounts of data
Hadoop, which
across multiple
manages the
machines. It is
allocation of
designed to work
resources (such as
with commodity
CPU and memory)
hardware, which
for processing the
makes it cost-
data stored in HDFS.
effective.
⚫ Hadoop also includes several additional
modules that provide additional functionality,
such as
⚫ Hive (a SQL-like query language),
⚫ Pig (a high-level platform for creating
MapReduce programs), and
⚫ HBase (a non-relational, distributed database).
⚫ Hadoop is commonly used in big data scenarios
such as data warehousing, business
intelligence, and machine learning.
⚫ It’s also used for data processing, data analysis,
and data mining.
History of Hadoop
Apache Software Foundation is the
developers of Hadoop, and it’s co-founders
are Doug Cutting and Mike Cafarella.
It’s co-founder Doug Cutting named it on his

son’s toy elephant. In October 2003 the first
paper release was Google File System.
In January 2006, MapReduce development

started on the Apache Nutch which consisted
of around 6000 lines coding for it and around
5000 lines coding for HDFS. In April 2006
Hadoop 0.1.0 was released.
⚫ It is designed to scale up from single servers to
thousands of machines, each offering local
computation and storage.
⚫ It is used by many organizations, including

Yahoo, Facebook, and IBM, for a variety of
purposes such as data warehousing, log
processing, and research.
⚫ Hadoop has been widely adopted in the

industry and has become a key technology for
big data processing.
key features
High
Distributed Fault- Availability
Storage Tolerance
Data
Scalability locality
Flexible Data
Data YARN
Replication
Processing
Data Data
Integrity Compressio
n
⚫ Distributed Storage: Hadoop stores large data sets across
multiple machines, allowing for the storage and processing of
extremely large amounts of data.
⚫ Scalability: Hadoop can scale from a single server to thousands

of machines, making it easy to add more capacity as needed.
⚫ Fault-Tolerance: Hadoop is designed to be highly fault-tolerant,

meaning it can continue to operate even in the presence of
hardware failures.
⚫ Data locality: Hadoop provides data locality feature, where the

data is stored on the same node where it will be processed,
this feature helps to reduce the network traffic and improve
the performance
⚫ High Availability: Hadoop provides High Availability feature,
which helps to make sure that the data is always available and
is not lost.
⚫ Flexible Data Processing: Hadoop’s MapReduce programming

model allows for the processing of data in a distributed fashion,
making it easy to implement a wide variety of data processing
tasks.
⚫ Data Integrity: Hadoop provides built-in checksum feature,

which helps to ensure that the data stored is consistent and
correct.
⚫ Data Replication: Hadoop provides data replication feature,

which helps to replicate the data across the cluster for fault
tolerance.
⚫Data Compression: Hadoop provides built-
in data compression feature, which helps to
reduce the storage space and improve the
performance.
⚫YARN: A resource management platform

that allows multiple data processing
engines like real-time streaming, batch
processing, and interactive SQL, to run and
process data stored in HDFS.
Some common frameworks of Hadoop
⚫ Hive- It uses HiveQl for data structuring and for writing

complicated MapReduce in HDFS.
⚫ Drill- It consists of user-defined functions and is used for
data exploration.
⚫ Storm- It allows real-time processing and streaming of
data.
⚫ Spark- It contains a Machine Learning Library(MLlib) for
providing enhanced machine learning and is widely used
for data processing. It also supports Java, Python, and
Scala.
⚫ Pig- It has Pig Latin, a SQL-Like language and performs
data transformation of unstructured data.
⚫ Tez- It reduces the complexities of Hive and Pig and
helps in the running of their codes faster.
Hadoop Ecosystem
⚫Hadoop Ecosystem is neither a
programming language nor a service, it is a
platform or framework which solves big
data problems.
⚫You can consider it as a suite which
encompasses a number of services
(ingesting, storing, analyzing and
maintaining) inside it.
HDFS
⚫ Hadoop Distributed File System is the core

component or the backbone of Hadoop
Ecosystem.
⚫ HDFS is the one, which makes it possible to store
different types of large data sets (i.e. structured,
unstructured and semi structured data).
⚫ HDFS creates a level of abstraction over the
resources, from where we can see the whole
HDFS as a single unit.
⚫ It helps us in storing our data across various
nodes and maintaining the log file about the
stored data (metadata).
⚫ HDFS has two core components,
i.e. NameNode and DataNode.
⚫ The NameNode is the main node and it

doesn’t store the actual data. It contains
metadata, just like a log file or you can say as a
table of content. Therefore, it requires less
storage and high computational resources.
⚫ On the other hand, all your data is stored on

the Data Nodes and hence it requires more
storage resources. These DataNodes are
commodity hardware (like your laptops and
desktops) in the distributed environment.
That’s the reason, why Hadoop solutions are
very cost effective.
⚫ You always communicate to the NameNode

while writing the data. Then, it internally sends
a request to the client to store and replicate
data on various DataNodes.
YARN(Yet Another Resource
Negotiator)
⚫ Consider YARN as the brain of your Hadoop Ecosystem.It
handles the cluster of nodes and acts as Hadoop’s
resource management unit. YARN allocates RAM,
memory, and other resources to different applications.
⚫ It has two major components, i.e.

⚫ Resource Manager and Node Manager.
⚫ Resource Manager is again a main node in the

processing department.It receives the processing requests,
and then passes the parts of requests to corresponding
NodeManagers accordingly, where the actual processing
takes place.
⚫ Node Managers are installed on every DataNode. It is

responsible for execution of task on every single DataNode.
⚫ResourceManager has two components,
i.e. Schedulers and ApplicationsManage
r.
⚫Schedulers: Based on your application
resource requirements, Schedulers perform
scheduling algorithms and allocates the
resources.
⚫ApplicationMasters are the deamons which
reside on DataNode and communicates to
containers for execution of tasks on each
DataNode.
MapReduce
⚫Hadoop data processing is built

on MapReduce, which processes large
volumes of data in a parallelly distributed
manner.
⚫in the beginning, input data is divided up to
form the input splits.
⚫The first phase is the Map phase, where
data in each split is passed to produce
output values. In the shuffle and sort
phase, the mapping phase’s output is taken
and grouped into blocks of similar data.
⚫Finally, the output values from the shuffling
phase are aggregated. It then returns a
single output value.
⚫ MapReduce is a programming model that allows us
to perform parallel and distributed processing on
huge data sets.
⚫ MapReduce makes the use of two functions i.e.
Map() and Reduce() whose task is:
⚫ Map() performs sorting and filtering of data and
thereby organizing them in the form of group. Map
generates a key-value pair based result which is later
on processed by the Reduce() method.
⚫ Reduce(), as the name suggests does the
summarization by aggregating the mapped data. In
simple, Reduce() takes the output generated by Map()
as input and combines those tuples into smaller set of
tuples.
MapReduce Tutorial: A Word Count Example of MapReduce
⚫Let us understand, how a MapReduce works

by taking an example where I have a text
file called example.txt whose contents are
as follows:
⚫Dear, Bear, River, Car, Car, River, Deer,
Car and Bear
⚫Now, suppose, we have to perform a word
count on the sample.txt using MapReduce.
So, we will be finding the unique words and
the number of occurrences of those unique
words.
⚫ First, we divide the input into three splits as
shown in the figure. This will distribute the
work among all the map nodes.
⚫ Then, we tokenize the words in each of the
mappers and give a hardcoded value (1) to
each of the tokens or words.
⚫ Now, a list of key-value pair will be created
where the key is nothing but the individual
words and value is one. So, for the first line
(Dear Bear River) we have 3 key-value pairs –
Dear, 1; Bear, 1; River, 1. The mapping process
remains the same on all the nodes.
⚫ After the mapper phase, a partition process takes place
where sorting and shuffling happen so that all the tuples
with the same key are sent to the corresponding reducer.
⚫ So, after the sorting and shuffling phase, each reducer
will have a unique key and a list of values corresponding
to that very key. For example, Bear, [1,1]; Car,
[1,1,1].., etc.
⚫ Now, each Reducer counts the values which are present
in that list of values. As shown in the figure, reducer gets
a list of values which is [1,1] for the key Bear. Then, it
counts the number of ones in the very list and gives the
final output as – Bear, 2.
⚫ Finally, all the output key/value pairs are then collected
and written in the output file.
PIG
Apache Pig was developed by Yahoo researchers,
targeted mainly towards non-programmers. It was
designed with the ability to analyze and process
large datasets without using complex Java codes. It
provides a high-level data processing language that
can perform numerous operations without getting
bogged down with too many technical concepts.
consists of:
Pig Latin - This is the language for scripting
Pig Latin Compiler - This converts Pig Latin code
into executable code
Pig also provides Extract, Transfer, and Load (ETL),
and a platform for building data flow. Did you know
that ten lines of Pig Latin script equals approximately
200 lines of MapReduce job? Pig uses simple, time-
efficient steps to analyze datasets.
Programmers write scripts in Pig Latin to analyze data
using Pig. Grunt Shell is Pig’s interactive shell, used to
execute all Pig scripts.
If the Pig script is written in a script file, the Pig Server
executes it. The parser checks the syntax of the Pig
script, after which the output will be a DAG (Directed
Acyclic Graph).
The DAG (logical plan) is passed to the logical optimizer.

The compiler converts the DAG into MapReduce jobs.
The MapReduce jobs are then run by the Execution

Engine. The results are displayed using the “DUMP”
statement and stored in HDFS using the “STORE”
statement.
Hive
⚫ Hive uses SQL (Structured Query Language) to
facilitate the reading, writing, and management
of large datasets residing in distributed storage.
⚫ The hive was developed with a vision of

incorporating the concepts of tables and
columns with SQL since users were comfortable
with writing queries in SQL.
⚫ Apache Hive has two major components:
⚫ Hive Command Line
⚫ JDBC/ ODBC driver
⚫The Java Database Connectivity (JDBC)
application is connected through JDBC
Driver, and the Open Database
Connectivity (ODBC) application is
connected through ODBC Driver.
⚫Commands are executed directly in CLI.
⚫Hive driver is responsible for all the queries
submitted, performing the three steps of
compilation, optimization, and execution
internally.
SPARK
Spark is a huge framework in and of itself, an open-

source distributed computing engine for processing
and analyzing vast volumes of real-time data. It runs
100 times faster than MapReduce. Spark provides an
in-memory computation of data, used to process and
analyze real-time streaming data such as stock
market and banking data, among other things.
Mahout
⚫Mahout is used to create scalable and
distributed machine learning algorithms
such as clustering, linear regression,
classification, and so on. It has a library
that contains built-in algorithms for
collaborative filtering, classification, and
clustering.
Apache ambari
⚫ it is an open-source tool responsible for keeping
track of running applications and their statuses.
Ambari manages, monitors, and provisions Hadoop
clusters.
⚫ Also, it also provides a central management
service to start, stop, and configure Hadoop
services.
⚫ Apache Ambari follows a master/slave architecture.
⚫ The master node is accountable for keeping
track of the state of the infrastructure. For
doing this, the master node uses a database server
that can be configured during the setup time.
⚫ Most of the time, the Ambari server is located on
the MasterNode, and is connected to the
database.
⚫Agents run on all the nodes that you want
to manage under Ambari. This program
occasionally sends heartbeats to the
master node to show its aliveness. By using
Ambari Agent, the Ambari Server is able to
execute many tasks.
Kafka
⚫ Kafka is a distributed streaming platform
designed to store and process streams of
records. It is written in Scala. It builds real-time
streaming data pipelines that reliably get data
between applications, and also builds real-time
applications that transform data into streams.
⚫ Kafka uses a messaging system for transferring data
from one application to another. As seen below, we
have the sender, the message queue, and the
receiver involved in data transfer.
Storm
⚫The storm is an engine that processes real-
time streaming data at a very high speed.
⚫It is written in Clojure. A storm can handle
over 1 million jobs on a node in a fraction of
a second. It is integrated with Hadoop to
harness higher throughputs.
Ranger
⚫ Ranger is a framework designed to enable,

monitor, and manage data security across the
Hadoop platform.
⚫ It provides centralized administration for

managing all security-related tasks.
⚫ Ranger standardizes authorization across

all Hadoop components, and provides enhanced
support for different authorization methods like
role-based access control, and attributes based
access control, to name a few.
Knox
⚫ Apache Knox is an application gateway used in
conjunction with Hadoop deployments,
interacting with REST APIs and UIs. The
gateway delivers three types of user-facing
services:
1. Proxying Services - This provides access to
Hadoop via proxying the HTTP request
2. Authentication Services - This gives
authentication for REST API access and
WebSSO flow for user interfaces
3. Client Services - This provides client
development either via scripting through DSL
or using the Knox shell classes
Apache Sqoop
⚫ Sqoop works as a front-end loader of Big data.
Sqoop is a front-end interface that enables in
moving bulk data from Hadoop to relational
databases and into variously structured data
marts.
⚫ Sqoop replaces the function called ‘developing
scripts’ to import and export data. It mainly helps
in moving data from an enterprise database to
Hadoop cluster
Apache Flume
⚫Flume collects, aggregates, and moves

large sets of data from its origin and sends
it back to HDFS. It works as a fault-tolerant
mechanism.
⚫ It helps in transmitting data from a source

into a Hadoop environment. Flume enables
its users in getting data from multiple
servers immediately into Hadoop.
Apache Zookeeper
⚫ open-source project designed to coordinate

multiple services in the Hadoop ecosystem.
Organizing and maintaining a service in a
distributed environment is a complicated task.
⚫ Zookeeper solves this problem with its simple APIs
and Architecture. It allows developers to focus on
core applications instead of concentrating on a
distributed environment of the application.
⚫ Features of Zookeeper
⚫ Zookeeper acts fast enough with workloads where
reads to data are more common than writes.
⚫ Zookeeper acts as a disciplined one because it
maintains a record of all transactions.
H Base
⚫Hbase is considered as a Hadoop database,

because it is scalable, distributed, and
because NoSQL database that runs on top
of Hadoop.
⚫ Apache HBase is designed to store the

structured data on table format which has
millions of columns and billions of rows.
HBase gives access to get the real-time
data to read or write on HDFS.
⚫ HBase features
⚫ HBase is an open source, NoSQL database.
⚫ It is featured after Google’s big table, which
is considered as a distributed storage system
designed to handle big data sets.
⚫ It has a unique feature to support all types of
data. With this feature, it plays a crucial role
in handling various types of data in Hadoop.
⚫ The HBase is originally written in Java, and its
applications can be written in Avro, REST,
and Thrift APIs.
Components of HBase
⚫ There are majorly two components in HBase. They

are HBase master and Regional server.
a) HBase master:
⚫ It is not part of the actual data storage, but it
manages load balancing activities across all Region
Servers.
⚫ It controls the failovers.
⚫ Performs administration activities which provide an
interface for creating, updating and deleting tables.
⚫ Handles DDL operations.
⚫ It maintains and monitors the Hadoop cluster.
b) Regional server: It is a worker node. It reads,
writes, and deletes request from Clients. Region
server runs on every node of Hadoop cluster. Its
server runs on HDFS data nodes.

Bda CHP1

Uploaded by

Copyright:

Available Formats

Bda CHP1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bda CHP1

Uploaded by

Copyright:

Available Formats

THE ANJUMAN-I-ISLAM’S

M. H. SABOO SIDDIK COLLEGE OF ENGINEERING

Department of Computer Science & Engineering(AI & ML)

Subject I/c: Prof Arshi Khan

⚫ What is big data?

1.the large volume of data in many environments.

2.the wide variety of data types frequently stored in big data

3.the velocity at which much of the data is generated, collected

⚫ origins of large data sets go back to the 1960s and

⚫ Around 2005, people began to realize just how much

⚫ Hadoop (an open-source framework created

⚫ With the advent of the Internet of Things

⚫Big data makes it possible for you to gain

⚫More complete answers mean more

Customer experience Fraud and compliance

Machine learning Operational efficiency

⚫ Fraud and compliance:

⚫Structured data is stored in a data

⚫ that is used for storing and processing large

It’s co-founder Doug Cutting named it on his

In January 2006, MapReduce development

⚫ It is used by many organizations, including

⚫ Hadoop has been widely adopted in the

⚫ Scalability: Hadoop can scale from a single server to thousands

⚫ Fault-Tolerance: Hadoop is designed to be highly fault-tolerant,

⚫ Data locality: Hadoop provides data locality feature, where the

⚫ Flexible Data Processing: Hadoop’s MapReduce programming

⚫ Data Integrity: Hadoop provides built-in checksum feature,

⚫ Data Replication: Hadoop provides data replication feature,

⚫YARN: A resource management platform

⚫ Hive- It uses HiveQl for data structuring and for writing

⚫ Hadoop Distributed File System is the core

⚫ The NameNode is the main node and it

⚫ On the other hand, all your data is stored on

⚫ You always communicate to the NameNode

⚫ It has two major components, i.e.

⚫ Resource Manager is again a main node in the

⚫ Node Managers are installed on every DataNode. It is

⚫Hadoop data processing is built

⚫Let us understand, how a MapReduce works

The DAG (logical plan) is passed to the logical optimizer.

The MapReduce jobs are then run by the Execution

⚫ The hive was developed with a vision of

Spark is a huge framework in and of itself, an open-

⚫ Ranger is a framework designed to enable,

⚫ It provides centralized administration for

⚫ Ranger standardizes authorization across

⚫Flume collects, aggregates, and moves

⚫ It helps in transmitting data from a source

⚫ open-source project designed to coordinate

⚫Hbase is considered as a Hadoop database,

⚫ Apache HBase is designed to store the

⚫ There are majorly two components in HBase. They

You might also like