Unit-1 Introduction To Big Data
Unit-1 Introduction To Big Data
Unit-1 Introduction To Big Data
DATA PROCESSING
Network Analytics Telco Mediation
2)Scale-Out
i. Add more nodes/machines to an existing distributed application
ii.Software Layer is designed for node additions or removal
iii.Hadoop takes this approach - A set of nodes are bonded together as a single
distributed system
iv.Very easy to scale down as well
Bring Code to Data rather than Data to Code
⮚If a node fails, the master will detect that failure and re-assign the work to
a different node on the system.
⮚If a failed node restarts, it is automatically added back to the system and
assigned new tasks.
5 5
1
2 1
HDFS
3 3
4 4
5 2
5
1
3
3
Cost is $400-$500/TB 4
5
Hadoop Core Components
What is File System (FS)?
⮚ File management system is used by the operating system to access the
files and folders stored in a computer or any external storage devices.
⮚ A file system stores and organizes data and can be thought of as a type
of index for all the data contained in a storage device. These devices
can include hard drives, optical drives and flash drives.
⮚It allows programs to access or store isolated files as they do with the
local ones, allowing programmers to access files from any network or
computer.
⮚ Data files are split into blocks and distributed across multiple nodes in the
cluster.
⮚ It records the metadata of all the files stored in the cluster, e.g.
The location of blocks stored, the size of the files, permissions,
hierarchy, etc. There are two files associated with the metadata:
●FsImage: Complete state of the file system namespace since the start
of the NameNode.
●EditLogs: All the recent modifications made to the file system with
respect to the most recent FsImage.
⮚ It records each change that takes place to the file system metadata.
Functions of Namenode (Continued..)
⮚ The Secondary NameNode is one which constantly reads all the file systems and
metadata from the RAM of the NameNode and writes it into the hard disk or the
file system.
⮚ It downloads the EditLogs from the NameNode at regular intervals and applies to
FsImage.
⮚ The new FsImage is copied back to the NameNode, which is used whenever the
NameNode is started the next time
MapReduce(MR)
What is MapReduce?
■ The MapReduce algorithm contains two important tasks, namely Map and Reduce.
■ Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
■ Reducer task which takes the output from a map as an input and combines those data
tuples into a smaller set of tuples.
■ As the sequence of the name MapReduce implies, the reduce task is always performed
after the map job.
■ MapReduce is the system used to process data in the Hadoop cluster.
■ Each Map task operates on a discrete portion (one HDFS Block) of the overall
dataset.
■ MapReduce system distributes the intermediate data to nodes which perform the
Reduce phase.
MapReduce WordCount Example
Hadoop MapReduce WordCount Example
Hadoop MapReduce WordCount Example
(Continued..)
Hadoop MapReduce WordCount Example
(Continued...)
Hadoop MapReduce WordCount Example
(Continued....)
Hadoop MapReduce Working