Hadoop and Big Data

HADOOP
Presentation by:
Harshdeep kaur Roll
no: 7704
Submitted To:
Mrs Anu Singla

Things included:
 History of data
 What is Big Data?
 Big data challenges
 Big data solution
 Hadoop
 Hadoop architecture
 HDFS
 MapReduce
 How does hadoop work
 Enviorment setup
 Who uses hadoop

HISTORY OF DATA!!!
• Today we all generate data
• Data is in TB even in PB
• Today anything we want we just
look it on internet
• Even the K.G child rhyme is on
internet
• Earlier we need floppy's to save
our data now we move to clouds
• 90% of the data in the world today
has been created in the last two years
alone.

Organization
Est. amount of data
stored
Est. amount of data
processed per day
Ebay 200 PB 100 PB
Google 1500 PB 100 PB
Facebook 300 PB 600 TB
Twitter 200 PB 100 TB
Flood of data is coming from many resources

What is Big Data?
Big data means really a big data, it is a collection of large datasets that cannot
be processed using traditional computing techniques. It is not a single
technique or a tool, rather it involves many areas of business and technology.
According to IBM, 80% of data captured today is unstructured
That is gathered from :
 Posts to social media sites
 digital pictures and videos
 purchase transaction records
 cell phone GPS signals
 From sensors used to gather climate information

 Black box data
 Social media data
 power grid data
 Search engine data
This diagram includes

Big Data Challenges
The major challenges associated with big data are as follows:
 Capturing data
 Curation
 Storage
 searching
 Sharing
 Transfer
 Analysis
 Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.

BIG DATA SOLUTIONS
Traditional Enterprise Approach
In this approach, an enterprise will have a computer to store and process big
data. For storage purpose, the programmers will take the help of their choice of
database vendors such as Oracle, IBM, etc.
Limitation
when it comes to dealing with huge amounts of scalable data, it is a hectic task to
process such data through a single database bottleneck.

Google’s Solution
Google solved this problem using an algorithm called MapReduce.
Above diagram shows various commodity hardware which could be
single CPU machines or servers with higher capacity

Hadoop
• Doug Cutting and Mike cafarella developed an Open Source Project
called HADOOP
• Hadoop is an Apache open source framework written in java
• Hadoop allows distributed processing of large datasets across clusters
of computers using simple programming models
• Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.
• Hadoop runs applications using the MapReduce algorithm.

Hadoop Architecture
Hadoop designed and build on two independent framework
Hadoop ═ HDFS + Map Reduce
Hadoop has a master / slave architecture for both storage and processing
Hadoop File System (HDFS) was developed using distributed file system design.
It is run on commodity hardware.
Unlike other distributed systems, HDFS is highly fault-tolerant and designed
using low-cost hardware.
MapReduce is a parallel programming model for writing distributed
applications for efficient processing of large amounts of data on large clusters
(thousands of nodes) of commodity

COMPONENTS OF (HDFS)
Namenode
 The namenode i contains the GNU/Linux
operating system and the namenode software.
 The system having the namenode acts as the
master server and it does the following tasks:
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such
as renaming, closing, and opening files
and directories.

COMPONENTS OF (HDFS)
Datanode
 The datanode contains GNU/Linux operating
system and datanode software.
 For every node in a cluster, there will be a
datanode. These nodes manage the data
storage of their system.
 Datanodes perform read-write operations on the
file systems, as per client request.
 They also perform operations such as block
creation, deletion, and replication according to
the instructions of the namenode.

MAP REDUCE
• MapReduce is a parallel programming model for writing distributed
applications devised and efficient processing of large amounts of data
• It is a reliable, fault-tolerant manner. The MapReduce program runs on
Hadoop
• It contains Job trackers and Task trackers

How Does Hadoop Work?
 Hadoop runs code across a cluster of computers.
 Data is initially divided into directories and files. Files are
divided into uniform sized blocks of 128M and 64M
(preferably 128M).
 These files are then distributed across various cluster
nodes for further processing.
 HDFS, being on top of the local file system, supervises
the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.

ENVIORNMENT SETUP
• Hadoop is supported by Linux platform and its flavors
• In case you have an OS other than Linux, you can install a Virtualbox
software in it
Pre-installation Setup:- we need to set up Linux using ssh (Secure Shell).
 Creating a User
 Installing Java
 Downloading Hadoop

Installing hadoop
Hadoop Operation Modes Once you have downloaded Hadoop, you can
operate your Hadoop cluster in one of the three supported mode
• Local/Standalone Mode: After downloading Hadoop in your system by
default, it is configured in a standalone mode and can be run as a single
java process.
• Pseudo Distributed Mode: It is a distributed simulation on single machine.
Each Hadoop daemon such as (hdfs), MapReduce etc., will run as a
separate java process. This mode is useful for development.
• Fully Distributed Mode: This mode is fully distributed with minimum two or
more machines as a cluster.

Installing hadoop in standalone mode
• There are no daemons running and everything runs in a single JVM.
• Standalone mode is suitable for running MapReduce programs during
development,
• since it is easy to test and debug them.

Installing hadoop in pseudo distributed mode
Step 1: Setting Up Hadoop
Step 2:Hadoop Configuration
Verifying Hadoop Installation
• Step 1:Name Node Setup
• Step 2:Verifying Hadoop dfs
• Step 3: Verifying Yarn Script
• Step 4: Accessing Hadoop on Browser
• Step 5: Verify All Applications for Cluster

Who uses hadoop
Amazon
Facebook
Last.fm
New york times
Google
Ibm
Yahoo
Twitter
Linkedln
List toooo big now

Hadoop and Big Data

More Related Content

Hadoop and Big Data