Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo

1

Certified Big Data & Hadoop Training – DataFlair
Hadoop Tutorial

2

Certified Big Data & Hadoop Training – DataFlair
Agenda
 Introduction to Hadoop
 Hadoop nodes & daemons
 Hadoop Architecture
 Characteristics
 Hadoop Features

3

Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
The Technology that empowers Yahoo, Facebook, Twitter, Walmart and others
Hadoop

4

Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
An Open Source framework that
allows distributed processing of
large data-sets across the cluster
of commodity hardware

5

Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
An Open Source framework that
allows distributed processing of
large data-sets across the cluster
of commodity hardware
Open Source
 Source code is freely available
 It may be redistributed and
modified

6

Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
An open source framework that
allows Distributed Processing of
large data-sets across the cluster
of commodity hardware
Distributed Processing
 Data is processed distributedly
on multiple nodes / servers
 Multiple machines processes
the data independently

7

Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
An open source framework that
allows distributed processing of
large data-sets across the Cluster
of commodity hardware
Cluster
 Multiple machines connected
together
 Nodes are connected via LAN

8

Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
An open source framework that
allows distributed processing of
large data-sets across the cluster
of Commodity Hardware
Commodity Hardware
 Economic / affordable
machines
 Typically low performance
hardware

9

Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
• Open source framework written in Java
• Inspired by Google's Map-Reduce programming model as well as its file
system (GFS)

10

Certified Big Data & Hadoop Training – DataFlair
Hadoop defeated
Super computer
Hadoop became
top-level project
launched Hive,
SQL Support for Hadoop
Development of
started as Lucene sub-project
published GFS &
MapReduce papers
2002 2003 2005 2006 2008
Doug Cutting started
working on
Doug Cutting added
DFS & MapReduce
in
converted 4TB of
image archives over
100 EC2 instances
Doug Cutting
joined Cloudera
20092004
Hadoop History
2007

11

Certified Big Data & Hadoop Training – DataFlair
Hadoop Components
Hadoop consists of three key parts

12

Certified Big Data & Hadoop Training – DataFlair
Master Node Slave Node
Hadoop Nodes
Nodes

13

Certified Big Data & Hadoop Training – DataFlair
Master Node Slave Node
Hadoop Daemons
Resource
Manager
NameNode
Node
Manager
DataNode
Nodes

14

Certified Big Data & Hadoop Training – DataFlair
Sub Work Sub Work Sub Work Sub Work
Sub WorkSub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Work
Sub Work Sub Work Sub Work Sub Work
Sub WorkSub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Basic Hadoop Architecture

15

Certified Big Data & Hadoop Training – DataFlair
Hadoop Characteristics

16

Certified Big Data & Hadoop Training – DataFlair
Open Source
• Source code is freely
available
• Can be redistributed
• Can be modified
Free
Affordable
Community
Transparent
Inter-
operable
No vendor
lock
Open
Source

17

Certified Big Data & Hadoop Training – DataFlair
Distributed Processing
• Data is processed distributedly
on cluster
• Multiple nodes in the cluster
process data independently
Centralized Processing
Distributed Processing

18

Certified Big Data & Hadoop Training – DataFlair
Fault Tolerance
• Failure of nodes are recovered
automatically
• Framework takes care of failure
of hardware as well tasks

19

Certified Big Data & Hadoop Training – DataFlair
Reliability
• Data is reliably stored on the
cluster of machines despite
machine failures
• Failure of nodes doesn’t
cause data loss

20

Certified Big Data & Hadoop Training – DataFlair
High Availability
• Data is highly available and
accessible despite hardware
failure
• There will be no downtime for
end user application due to
data

21

Certified Big Data & Hadoop Training – DataFlair
Scalability
• Vertical Scalability – New
hardware can be added to the
nodes
• Horizontal Scalability – New
nodes can be added on the fly

22

Certified Big Data & Hadoop Training – DataFlair
Economic
• No need to purchase costly license
• No need to purchase costly hardware
EconomicOpen Source
Commodity
Hardware =+

23

Certified Big Data & Hadoop Training – DataFlair
Easy to Use
• Distributed computing challenges
are handled by framework
• Client just need to concentrate on
business logic

24

Certified Big Data & Hadoop Training – DataFlair
Data Locality
• Move computation to data
instead of data to computation
• Data is processed on the nodes
where it is stored Storage Servers App Servers
Data Data
DataData
Servers
Data Data
DataData
Algorithm
Algo Algo
AlgoAlgo

25

Certified Big Data & Hadoop Training – DataFlair
Summary
• Everyday we generate 2.3 trillion GBs of data
• Hadoop handles huge volumes of data efficiently
• Hadoop uses the power of distributed computing
• HDFS & Yarn are two main components of Hadoop
• It is highly fault tolerant, reliable & available

26

Certified Big Data & Hadoop Training – DataFlair
Thank You
DataFlair
/c/DataFlairWS /DataFlairWS

More Related Content

Hadoop Tutorial For Beginners

  • 1. Certified Big Data & Hadoop Training – DataFlair Hadoop Tutorial
  • 2. Certified Big Data & Hadoop Training – DataFlair Agenda  Introduction to Hadoop  Hadoop nodes & daemons  Hadoop Architecture  Characteristics  Hadoop Features
  • 3. Certified Big Data & Hadoop Training – DataFlair What is Hadoop? The Technology that empowers Yahoo, Facebook, Twitter, Walmart and others Hadoop
  • 4. Certified Big Data & Hadoop Training – DataFlair What is Hadoop? An Open Source framework that allows distributed processing of large data-sets across the cluster of commodity hardware
  • 5. Certified Big Data & Hadoop Training – DataFlair What is Hadoop? An Open Source framework that allows distributed processing of large data-sets across the cluster of commodity hardware Open Source  Source code is freely available  It may be redistributed and modified
  • 6. Certified Big Data & Hadoop Training – DataFlair What is Hadoop? An open source framework that allows Distributed Processing of large data-sets across the cluster of commodity hardware Distributed Processing  Data is processed distributedly on multiple nodes / servers  Multiple machines processes the data independently
  • 7. Certified Big Data & Hadoop Training – DataFlair What is Hadoop? An open source framework that allows distributed processing of large data-sets across the Cluster of commodity hardware Cluster  Multiple machines connected together  Nodes are connected via LAN
  • 8. Certified Big Data & Hadoop Training – DataFlair What is Hadoop? An open source framework that allows distributed processing of large data-sets across the cluster of Commodity Hardware Commodity Hardware  Economic / affordable machines  Typically low performance hardware
  • 9. Certified Big Data & Hadoop Training – DataFlair What is Hadoop? • Open source framework written in Java • Inspired by Google's Map-Reduce programming model as well as its file system (GFS)
  • 10. Certified Big Data & Hadoop Training – DataFlair Hadoop defeated Super computer Hadoop became top-level project launched Hive, SQL Support for Hadoop Development of started as Lucene sub-project published GFS & MapReduce papers 2002 2003 2005 2006 2008 Doug Cutting started working on Doug Cutting added DFS & MapReduce in converted 4TB of image archives over 100 EC2 instances Doug Cutting joined Cloudera 20092004 Hadoop History 2007
  • 11. Certified Big Data & Hadoop Training – DataFlair Hadoop Components Hadoop consists of three key parts
  • 12. Certified Big Data & Hadoop Training – DataFlair Master Node Slave Node Hadoop Nodes Nodes
  • 13. Certified Big Data & Hadoop Training – DataFlair Master Node Slave Node Hadoop Daemons Resource Manager NameNode Node Manager DataNode Nodes
  • 14. Certified Big Data & Hadoop Training – DataFlair Sub Work Sub Work Sub Work Sub Work Sub WorkSub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Work Sub Work Sub Work Sub Work Sub Work Sub WorkSub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Basic Hadoop Architecture
  • 15. Certified Big Data & Hadoop Training – DataFlair Hadoop Characteristics
  • 16. Certified Big Data & Hadoop Training – DataFlair Open Source • Source code is freely available • Can be redistributed • Can be modified Free Affordable Community Transparent Inter- operable No vendor lock Open Source
  • 17. Certified Big Data & Hadoop Training – DataFlair Distributed Processing • Data is processed distributedly on cluster • Multiple nodes in the cluster process data independently Centralized Processing Distributed Processing
  • 18. Certified Big Data & Hadoop Training – DataFlair Fault Tolerance • Failure of nodes are recovered automatically • Framework takes care of failure of hardware as well tasks
  • 19. Certified Big Data & Hadoop Training – DataFlair Reliability • Data is reliably stored on the cluster of machines despite machine failures • Failure of nodes doesn’t cause data loss
  • 20. Certified Big Data & Hadoop Training – DataFlair High Availability • Data is highly available and accessible despite hardware failure • There will be no downtime for end user application due to data
  • 21. Certified Big Data & Hadoop Training – DataFlair Scalability • Vertical Scalability – New hardware can be added to the nodes • Horizontal Scalability – New nodes can be added on the fly
  • 22. Certified Big Data & Hadoop Training – DataFlair Economic • No need to purchase costly license • No need to purchase costly hardware EconomicOpen Source Commodity Hardware =+
  • 23. Certified Big Data & Hadoop Training – DataFlair Easy to Use • Distributed computing challenges are handled by framework • Client just need to concentrate on business logic
  • 24. Certified Big Data & Hadoop Training – DataFlair Data Locality • Move computation to data instead of data to computation • Data is processed on the nodes where it is stored Storage Servers App Servers Data Data DataData Servers Data Data DataData Algorithm Algo Algo AlgoAlgo
  • 25. Certified Big Data & Hadoop Training – DataFlair Summary • Everyday we generate 2.3 trillion GBs of data • Hadoop handles huge volumes of data efficiently • Hadoop uses the power of distributed computing • HDFS & Yarn are two main components of Hadoop • It is highly fault tolerant, reliable & available
  • 26. Certified Big Data & Hadoop Training – DataFlair Thank You DataFlair /c/DataFlairWS /DataFlairWS