Unit-1 Introduction To Big Data

Unit-1
Introduction to Big Data

❑ Big Data
❑ Hadoop
❑ HDFS
❑ MapReduce
⮚ Big dnalytics has been aata refers to data that is so large, fast or complex
that it’s difficult or impossible to process using traditional methods.
⮚ The act of accessing and storing large amounts of information for
around for a long time. But the concept of big data gained momentum in
the early 2000s.
⮚ Big Data is high-volume, high-velocity and/or high-variety information
Big Data
asset that requires new forms of processing for enhanced decision
making, insight discovery and process optimization (Gartner 2012).
⮚ “Data of a very large size, typically to the extent that its manipulation
and management present significant logistical challenges”.
Types of Big Data
⮚ Big data is classified in three ways: Structured Data, Unstructured
Data and Semi-Structured Data.
⮚ Structured data is the easiest to work with. It is highly organized with

dimensions defined by set parameters. Structured data follows schemas:
essentially road maps to specific data points. These schemas outline
where each datum is and what it means. It’s all your quantitative data like
Age, Billing, Address etc.
⮚ Unstructured data is all your unorganized data. The hardest part of

analyzing unstructured data is teaching an application to understand the
information it’s extracting. More often than not, this means translating it
into some form of structured data.
⮚ Semi-structured data toes the line between structured and unstructured.
Most of the time, this translates to unstructured data with metadata
attached to it. Examples of this data are:- time, location, device ID stamp
or email address, or it can be a semantic tag attached to the data later.
Semi-structured data has no set schema.
3 V’s of Big Data
Hadoop
What is Hadoop?
⮚ Hadoop is an Apache open source framework written in java that
allows distributed processing of large datasets across clusters of
computers using simple programming models.
⮚ The Hadoop framework application works in an environment that

provides distributed storage and computation across clusters of
computers.
⮚ Hadoop is a framework that uses distributed storage and parallel

processing to store and manage Big Data.
⮚
⮚ Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.
Hadoop Applications
Use Case Application Industry Application Use Case
Social Network Analysis Web Clickstream Sessionization
Content Optimization Media Clickstream Sessionization

ADVANCED ANALYTICS
DATA PROCESSING
Network Analytics Telco Mediation
Loyalty & Promotions Analysis Data Factory

Retail
Fraud Analysis Financial Trade Reconciliation
Entity Analysis Federal SIGINT
Sequencing Analysis Bioinformatics Genome Mapping

Hadoop Core Principles
⮚ Scale-Out rather than Scale-Up
⮚ Bring code to data rather than data to code
⮚ Deal with failures – they are common
⮚ Abstract complexity of distributed and concurrent

applications
Scale-Out rather than Scale-Up
1)It is harder and more expensive to Scale-Up

i. Add additional resources to an existing node (CPU, RAM)
ii.Moore’s Law can’t keep up with data growth
iii.New units must be purchased if required resources can not be added
iv.Also known as scale vertically
2)Scale-Out
i. Add more nodes/machines to an existing distributed application
ii.Software Layer is designed for node additions or removal
iii.Hadoop takes this approach - A set of nodes are bonded together as a single
distributed system
iv.Very easy to scale down as well
Bring Code to Data rather than Data to Code
◆Hadoop co-locates processors and storage

◆Code is moved to data (size is tiny, usually in KBs)
◆Processors execute code and access underlying local storage
Hadoop is designed to cope with node failures
⮚If a node fails, the master will detect that failure and re-assign the work to
a different node on the system.
⮚Restarting a task does not require communication with nodes working on

other portions of the data.
⮚If a failed node restarts, it is automatically added back to the system and
assigned new tasks.
⮚If a node appears to be running slowly, the master can redundantly

execute another instance of the same task
⮚Results from the first to finish will be used
HDFS Replication
Block Size = 64MB

2 1
Replication Factor = 3
4 2
5 5
1
2 1
HDFS
3 3
4 4
5 2
5
1
3
3
Cost is $400-$500/TB 4
5
Hadoop Core Components
What is File System (FS)?
⮚ File management system is used by the operating system to access the
files and folders stored in a computer or any external storage devices.
⮚ A file system stores and organizes data and can be thought of as a type
of index for all the data contained in a storage device. These devices
can include hard drives, optical drives and flash drives.
⮚ Imagine file management system as a big dictionary that contains

information about file names, locations and types.
⮚ File systems specify conventions for naming files, including the

maximum number of characters in a name, which characters can be
used etc.
⮚ File management system is capable of handling files within one

What is Distributed File System
(DFS)?
⮚A Distributed File System (DFS) as the name suggests, is a file system
that is distributed on multiple file servers or multiple locations.
⮚It allows programs to access or store isolated files as they do with the
local ones, allowing programmers to access files from any network or
computer.
⮚The main purpose of the Distributed File System (DFS) is to allows

users of physically distributed systems to share their data and resources
by using a Common File System.
⮚A collection of workstations and mainframes connected by a Local

Area Network (LAN) is a configuration on Distributed File System.
How Distributed file system (DFS)
works?
? Distributed file system works as follows:
a) Distribution: Distribute blocks of data sets across multiple nodes. Each

node has its own computing power; which gives the ability of DFS to parallel
processing data blocks.
b) Replication: Distributed file system will also replicate data blocks on
different clusters by copy the same pieces of information into multiple
clusters on different racks. This will help to achieve the following:
c) Fault Tolerance: recover data block in case of cluster failure or Rack
failure.
d) High Concurrency: avail same piece of data to be processed by multiple
clients at the same time. It is done using the computation power of each node
to parallel process data blocks.
DFS Advantages
a) Scalability: You can scale up your infrastructure by adding more racks or

clusters to your system.
b) Fault Tolerance: Data replication will help to achieve fault tolerance in
the following cases: Cluster is down, Rack is down, Rack is disconnected
from the network and Job failure or restart.
c) High Concurrency: utilize the compute power of each node to handle
multiple client requests (in a parallel way) at the same time.
DFS Disadvantages
a) In Distributed File System nodes and connections needs to be secured

therefore we can say that security is at stake.
b) There is a possibility of lose of messages and data in the network
while movement from one node to another.
c) Database connection in case of Distributed File System is complicated.
d) Also handling of the database is not easy in Distributed File System as
compared to a single user system.
Hadoop
Distributed File
System (HDFS)
HDFS Basics
⮚ The Hadoop Distributed File System (HDFS) is based on the Google File
System (GFS)
⮚ Hadoop Distributed File System is responsible for storing data on the

cluster.
⮚ Data files are split into blocks and distributed across multiple nodes in the
cluster.
⮚ Each block is replicated multiple times

⮚--Default is to replicate each block three times
⮚--Replicas are stored on different nodes
⮚--This ensures both reliability and availability
⮚ A distributed file system that provides high-throughput access to

application data.
HDFS Architecture
HDFS Architecture
Hadoop Daemons
▪ Hadoop is comprised of five separate daemons

▪ NameNode: Holds the metadata for HDFS
▪ Secondary NameNode
– Performs housekeeping functions for the NameNode
– Is not a backup or hot standby for the NameNode!
▪ DataNode: Stores actual HDFS data blocks
▪ JobTracker: Manages MapReduce jobs, distributes individual tasks
▪ TaskTracker: Responsible for instantiating and monitoring individual Map and
Reduce tasks
Functions of Namenode
⮚ It is the master daemon that maintains and manages the DataNodes
(slave nodes)
⮚ It records the metadata of all the files stored in the cluster, e.g.
The location of blocks stored, the size of the files, permissions,
hierarchy, etc. There are two files associated with the metadata:
●FsImage: Complete state of the file system namespace since the start
of the NameNode.
●EditLogs: All the recent modifications made to the file system with
respect to the most recent FsImage.
⮚ It records each change that takes place to the file system metadata.
Functions of Namenode (Continued..)
⮚ It regularly receives a Heartbeat and a block report from all the

DataNodes in the cluster to ensure that the DataNodes are live.
⮚ It keeps a record of all the blocks in HDFS and in which nodes these

blocks are located.
⮚ The NameNode is also responsible to take care of

the replication factor .
⮚ In case of the DataNode failure, the NameNode chooses new

DataNodes for new replicas, balance disk usage and manages the
communication traffic to the DataNodes.
Functions of Datanode
⮚ These are slave daemons or process which runs on each slave machine.
⮚ The actual data is stored on DataNodes.
⮚ The DataNodes perform the low-level read and write requests from the

file system’s clients.
⮚ They send heartbeats to the NameNode periodically to report the

overall health of HDFS, by default, this frequency is set to 3 seconds.
Functions of Secondary NameNode
⮚ The Secondary NameNode is one which constantly reads all the file systems and
metadata from the RAM of the NameNode and writes it into the hard disk or the
file system.
⮚ It is responsible for combining the EditLogs with FsImage from the NameNode.
⮚ It downloads the EditLogs from the NameNode at regular intervals and applies to
FsImage.
⮚ The new FsImage is copied back to the NameNode, which is used whenever the
NameNode is started the next time
MapReduce(MR)
What is MapReduce?
■ MapReduce is a processing technique and a program model for distributed computing

based on java.
■ The MapReduce algorithm contains two important tasks, namely Map and Reduce.
■ Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
■ Reducer task which takes the output from a map as an input and combines those data
tuples into a smaller set of tuples.
■ As the sequence of the name MapReduce implies, the reduce task is always performed
after the map job.
■ MapReduce is the system used to process data in the Hadoop cluster.
■ Consists of two phases: Map, and then Reduce.
■ Each Map task operates on a discrete portion (one HDFS Block) of the overall
dataset.
■ MapReduce system distributes the intermediate data to nodes which perform the
Reduce phase.
MapReduce WordCount Example
Hadoop MapReduce WordCount Example
(Continued..)
(Continued...)
(Continued....)
Hadoop MapReduce Working

Unit-1 Introduction To Big Data

Uploaded by

Copyright:

Available Formats

Unit-1 Introduction To Big Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-1 Introduction To Big Data

Uploaded by

Copyright:

Available Formats

Unit-1

Introduction to Big Data

⮚ Structured data is the easiest to work with. It is highly organized with

⮚ Unstructured data is all your unorganized data. The hardest part of

⮚ The Hadoop framework application works in an environment that

⮚ Hadoop is a framework that uses distributed storage and parallel

Social Network Analysis Web Clickstream Sessionization

Content Optimization Media Clickstream Sessionization

Loyalty & Promotions Analysis Data Factory

Fraud Analysis Financial Trade Reconciliation

Entity Analysis Federal SIGINT

Sequencing Analysis Bioinformatics Genome Mapping

⮚ Bring code to data rather than data to code

⮚ Deal with failures – they are common

⮚ Abstract complexity of distributed and concurrent

1)It is harder and more expensive to Scale-Up

◆Hadoop co-locates processors and storage

⮚Restarting a task does not require communication with nodes working on

⮚If a node appears to be running slowly, the master can redundantly

Block Size = 64MB

⮚ Imagine file management system as a big dictionary that contains

⮚ File systems specify conventions for naming files, including the

⮚ File management system is capable of handling files within one

⮚The main purpose of the Distributed File System (DFS) is to allows

⮚A collection of workstations and mainframes connected by a Local

a) Distribution: Distribute blocks of data sets across multiple nodes. Each

a) Scalability: You can scale up your infrastructure by adding more racks or

a) In Distributed File System nodes and connections needs to be secured

⮚ Hadoop Distributed File System is responsible for storing data on the

⮚ Each block is replicated multiple times

⮚ A distributed file system that provides high-throughput access to

▪ Hadoop is comprised of five separate daemons

⮚ It regularly receives a Heartbeat and a block report from all the

⮚ It keeps a record of all the blocks in HDFS and in which nodes these

⮚ The NameNode is also responsible to take care of

⮚ In case of the DataNode failure, the NameNode chooses new

⮚ The actual data is stored on DataNodes.

⮚ The DataNodes perform the low-level read and write requests from the

⮚ They send heartbeats to the NameNode periodically to report the

⮚ It is responsible for combining the EditLogs with FsImage from the NameNode.

■ MapReduce is a processing technique and a program model for distributed computing

■ Consists of two phases: Map, and then Reduce.

You might also like