Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
HADOOP
Presentation by:
Harshdeep kaur Roll
no: 7704
Submitted To:
Mrs Anu Singla
Things included:
 History of data
 What is Big Data?
 Big data challenges
 Big data solution
 Hadoop
 Hadoop architecture
 HDFS
 MapReduce
 How does hadoop work
 Enviorment setup
 Who uses hadoop
HISTORY OF DATA!!!
• Today we all generate data
• Data is in TB even in PB
• Today anything we want we just
look it on internet
• Even the K.G child rhyme is on
internet
• Earlier we need floppy's to save
our data now we move to clouds
• 90% of the data in the world today
has been created in the last two years
alone.
Organization
Est. amount of data
stored
Est. amount of data
processed per day
Ebay 200 PB 100 PB
Google 1500 PB 100 PB
Facebook 300 PB 600 TB
Twitter 200 PB 100 TB
Flood of data is coming from many resources
What is Big Data?
Big data means really a big data, it is a collection of large datasets that cannot
be processed using traditional computing techniques. It is not a single
technique or a tool, rather it involves many areas of business and technology.
According to IBM, 80% of data captured today is unstructured
That is gathered from :
 Posts to social media sites
 digital pictures and videos
 purchase transaction records
 cell phone GPS signals
 From sensors used to gather climate information
 Black box data
 Social media data
 power grid data
 Search engine data
This diagram includes
Big Data Challenges
The major challenges associated with big data are as follows:
 Capturing data
 Curation
 Storage
 searching
 Sharing
 Transfer
 Analysis
 Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
BIG DATA SOLUTIONS
Traditional Enterprise Approach
In this approach, an enterprise will have a computer to store and process big
data. For storage purpose, the programmers will take the help of their choice of
database vendors such as Oracle, IBM, etc.
Limitation
when it comes to dealing with huge amounts of scalable data, it is a hectic task to
process such data through a single database bottleneck.
Google’s Solution
Google solved this problem using an algorithm called MapReduce.
Above diagram shows various commodity hardware which could be
single CPU machines or servers with higher capacity
Hadoop
• Doug Cutting and Mike cafarella developed an Open Source Project
called HADOOP
• Hadoop is an Apache open source framework written in java
• Hadoop allows distributed processing of large datasets across clusters
of computers using simple programming models
• Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.
• Hadoop runs applications using the MapReduce algorithm.
Hadoop Architecture
Hadoop Architecture
Hadoop designed and build on two independent framework
Hadoop ═ HDFS + Map Reduce
Hadoop has a master / slave architecture for both storage and processing
Hadoop File System (HDFS) was developed using distributed file system design.
It is run on commodity hardware.
Unlike other distributed systems, HDFS is highly fault-tolerant and designed
using low-cost hardware.
MapReduce is a parallel programming model for writing distributed
applications for efficient processing of large amounts of data on large clusters
(thousands of nodes) of commodity
COMPONENTS OF (HDFS)
Namenode
 The namenode i contains the GNU/Linux
operating system and the namenode software.
 The system having the namenode acts as the
master server and it does the following tasks:
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such
as renaming, closing, and opening files
and directories.
COMPONENTS OF (HDFS)
Datanode
 The datanode contains GNU/Linux operating
system and datanode software.
 For every node in a cluster, there will be a
datanode. These nodes manage the data
storage of their system.
 Datanodes perform read-write operations on the
file systems, as per client request.
 They also perform operations such as block
creation, deletion, and replication according to
the instructions of the namenode.
MAP REDUCE
• MapReduce is a parallel programming model for writing distributed
applications devised and efficient processing of large amounts of data
• It is a reliable, fault-tolerant manner. The MapReduce program runs on
Hadoop
• It contains Job trackers and Task trackers
JOB TRACKER TASK TESTERS
How Does Hadoop Work?
 Hadoop runs code across a cluster of computers.
 Data is initially divided into directories and files. Files are
divided into uniform sized blocks of 128M and 64M
(preferably 128M).
 These files are then distributed across various cluster
nodes for further processing.
 HDFS, being on top of the local file system, supervises
the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
ENVIORNMENT SETUP
• Hadoop is supported by Linux platform and its flavors
• In case you have an OS other than Linux, you can install a Virtualbox
software in it
Pre-installation Setup:- we need to set up Linux using ssh (Secure Shell).
 Creating a User
 Installing Java
 Downloading Hadoop
Installing hadoop
Hadoop Operation Modes Once you have downloaded Hadoop, you can
operate your Hadoop cluster in one of the three supported mode
• Local/Standalone Mode: After downloading Hadoop in your system by
default, it is configured in a standalone mode and can be run as a single
java process.
• Pseudo Distributed Mode: It is a distributed simulation on single machine.
Each Hadoop daemon such as (hdfs), MapReduce etc., will run as a
separate java process. This mode is useful for development.
• Fully Distributed Mode: This mode is fully distributed with minimum two or
more machines as a cluster.
Installing hadoop in standalone mode
• There are no daemons running and everything runs in a single JVM.
• Standalone mode is suitable for running MapReduce programs during
development,
• since it is easy to test and debug them.
Installing hadoop in pseudo distributed mode
Step 1: Setting Up Hadoop
Step 2:Hadoop Configuration
Verifying Hadoop Installation
• Step 1:Name Node Setup
• Step 2:Verifying Hadoop dfs
• Step 3: Verifying Yarn Script
• Step 4: Accessing Hadoop on Browser
• Step 5: Verify All Applications for Cluster
Hadoop browser
Who uses hadoop
Amazon
Facebook
Last.fm
New york times
Google
Ibm
Yahoo
Twitter
Linkedln
List toooo big now
Queries
Thank you

More Related Content

Hadoop and Big Data

  • 1. HADOOP Presentation by: Harshdeep kaur Roll no: 7704 Submitted To: Mrs Anu Singla
  • 2. Things included:  History of data  What is Big Data?  Big data challenges  Big data solution  Hadoop  Hadoop architecture  HDFS  MapReduce  How does hadoop work  Enviorment setup  Who uses hadoop
  • 3. HISTORY OF DATA!!! • Today we all generate data • Data is in TB even in PB • Today anything we want we just look it on internet • Even the K.G child rhyme is on internet • Earlier we need floppy's to save our data now we move to clouds • 90% of the data in the world today has been created in the last two years alone.
  • 4. Organization Est. amount of data stored Est. amount of data processed per day Ebay 200 PB 100 PB Google 1500 PB 100 PB Facebook 300 PB 600 TB Twitter 200 PB 100 TB Flood of data is coming from many resources
  • 5. What is Big Data? Big data means really a big data, it is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it involves many areas of business and technology. According to IBM, 80% of data captured today is unstructured That is gathered from :  Posts to social media sites  digital pictures and videos  purchase transaction records  cell phone GPS signals  From sensors used to gather climate information
  • 6.  Black box data  Social media data  power grid data  Search engine data This diagram includes
  • 7. Big Data Challenges The major challenges associated with big data are as follows:  Capturing data  Curation  Storage  searching  Sharing  Transfer  Analysis  Presentation To fulfill the above challenges, organizations normally take the help of enterprise servers.
  • 8. BIG DATA SOLUTIONS Traditional Enterprise Approach In this approach, an enterprise will have a computer to store and process big data. For storage purpose, the programmers will take the help of their choice of database vendors such as Oracle, IBM, etc. Limitation when it comes to dealing with huge amounts of scalable data, it is a hectic task to process such data through a single database bottleneck.
  • 9. Google’s Solution Google solved this problem using an algorithm called MapReduce. Above diagram shows various commodity hardware which could be single CPU machines or servers with higher capacity
  • 10. Hadoop • Doug Cutting and Mike cafarella developed an Open Source Project called HADOOP • Hadoop is an Apache open source framework written in java • Hadoop allows distributed processing of large datasets across clusters of computers using simple programming models • Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. • Hadoop runs applications using the MapReduce algorithm.
  • 12. Hadoop Architecture Hadoop designed and build on two independent framework Hadoop ═ HDFS + Map Reduce Hadoop has a master / slave architecture for both storage and processing Hadoop File System (HDFS) was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly fault-tolerant and designed using low-cost hardware. MapReduce is a parallel programming model for writing distributed applications for efficient processing of large amounts of data on large clusters (thousands of nodes) of commodity
  • 13. COMPONENTS OF (HDFS) Namenode  The namenode i contains the GNU/Linux operating system and the namenode software.  The system having the namenode acts as the master server and it does the following tasks:  Manages the file system namespace.  Regulates client’s access to files.  It also executes file system operations such as renaming, closing, and opening files and directories.
  • 14. COMPONENTS OF (HDFS) Datanode  The datanode contains GNU/Linux operating system and datanode software.  For every node in a cluster, there will be a datanode. These nodes manage the data storage of their system.  Datanodes perform read-write operations on the file systems, as per client request.  They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.
  • 15. MAP REDUCE • MapReduce is a parallel programming model for writing distributed applications devised and efficient processing of large amounts of data • It is a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop • It contains Job trackers and Task trackers
  • 16. JOB TRACKER TASK TESTERS
  • 17. How Does Hadoop Work?  Hadoop runs code across a cluster of computers.  Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M (preferably 128M).  These files are then distributed across various cluster nodes for further processing.  HDFS, being on top of the local file system, supervises the processing.  Blocks are replicated for handling hardware failure.  Checking that the code was executed successfully.
  • 18. ENVIORNMENT SETUP • Hadoop is supported by Linux platform and its flavors • In case you have an OS other than Linux, you can install a Virtualbox software in it Pre-installation Setup:- we need to set up Linux using ssh (Secure Shell).  Creating a User  Installing Java  Downloading Hadoop
  • 19. Installing hadoop Hadoop Operation Modes Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of the three supported mode • Local/Standalone Mode: After downloading Hadoop in your system by default, it is configured in a standalone mode and can be run as a single java process. • Pseudo Distributed Mode: It is a distributed simulation on single machine. Each Hadoop daemon such as (hdfs), MapReduce etc., will run as a separate java process. This mode is useful for development. • Fully Distributed Mode: This mode is fully distributed with minimum two or more machines as a cluster.
  • 20. Installing hadoop in standalone mode • There are no daemons running and everything runs in a single JVM. • Standalone mode is suitable for running MapReduce programs during development, • since it is easy to test and debug them.
  • 21. Installing hadoop in pseudo distributed mode Step 1: Setting Up Hadoop Step 2:Hadoop Configuration Verifying Hadoop Installation • Step 1:Name Node Setup • Step 2:Verifying Hadoop dfs • Step 3: Verifying Yarn Script • Step 4: Accessing Hadoop on Browser • Step 5: Verify All Applications for Cluster
  • 23. Who uses hadoop Amazon Facebook Last.fm New york times Google Ibm Yahoo Twitter Linkedln List toooo big now