Hadoop training in bangalore

Presented By,
KELLY TECHNOLOGIES
WWW.KELLYTECHNO.COM

1. Introduction: Hadoop’s history and
advantages
2. Architecture in detail
3. Hadoop in industry

 Hadoop is an open source framework which
is composed in java by apache software
foundation.
 This framework is utilized to write software
application which requires to process
unfathomable measure of information (It
could handle with multi tera bytes of
information).

Doug Cutting
2005: Doug Cutting and Michael J. Cafarella developed
Hadoop to support distribution for the Nutch search
engine project.
The project was funded by Yahoo.
2006: Yahoo gave the project to Apache
Software Foundation.

• 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte of
data in 209 seconds, compared to previous record of 297 seconds)
• 2009 - Avro and Chukwa became new members of Hadoop
Framework family
• 2010 - Hadoop's Hbase, Hive and Pig subprojects completed, adding
more computational power to Hadoop framework
• 2011 - ZooKeeper Completed
• 2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha.
- Ambari, Cassandra, Mahout have been added

• Hadoop:
• an open-source software framework that supports data-
intensive distributed applications, licensed under the
Apache v2 license.
• Goals / Requirements:
• Abstract and facilitate the storage and processing of
large and/or rapidly growing data sets
• Structured and non-structured data
• Simple programming models
• High scalability and availability
• Use commodity (cheap!) hardware with little redundancy
• Fault-tolerance
• Move computation rather than data

• Distributed, with some centralization
• Main nodes of cluster are where most of the computational power
and storage of the system lies
• Main nodes run TaskTracker to accept and reply to MapReduce
tasks, and also DataNode to store needed blocks closely as
possible
• Central control node runs NameNode to keep track of HDFS
directories & files, and JobTracker to dispatch compute tasks to
TaskTracker
• Written in Java, also supports Python and Ruby

• Hadoop Distributed Filesystem
• Tailored to needs of MapReduce
• Targeted towards many reads of filestreams
• Writes are more costly
• High degree of data replication (3x by default)
• No need for RAID on normal nodes
• Large blocksize (64MB)
• Location awareness of DataNodes in network

NameNode:
• Stores metadata for the files, like the directory structure of a
typical FS.
• The server holding the NameNode instance is quite crucial,
as there is only one.
• Transaction log for file deletes/adds, etc. Does not use
transactions for whole blocks or file-streams, only metadata.
• Handles creation of more replica blocks when necessary
after a DataNode failure

DataNode:
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere

MapReduce Engine:
• JobTracker & TaskTracker
• JobTracker splits up data into smaller tasks(“Map”) and
sends it to the TaskTracker process in each node
• TaskTracker reports back to the JobTracker node and
reports on job progress, sends data (“Reduce”) or requests
new jobs

• None of these components are necessarily limited to using
HDFS
• Many other distributed file-systems with quite different
architectures work
• Many other software packages besides Hadoop's
MapReduce platform make use of HDFS

• Hadoop is in use at most organizations that handle big data:
o Yahoo!
o Facebook
o Amazon
o Netflix
o Etc…
• Some examples of scale:
o Yahoo!’s Search Webmap runs on 10,000 core Linux
cluster and powers Yahoo! Web search
o FB’s Hadoop cluster hosts 100+ PB of data (July, 2012)
& growing at ½ PB/day (Nov, 2012)

• Advertisement (Mining user behavior to generate
recommendations)
• Searches (group related documents)
• Security (search for uncommon patterns)
Three main applications of Hadoop:

• Non-realtime large dataset computing:
o NY Times was dynamically generating PDFs of articles
from 1851-1922
o Wanted to pre-generate & statically serve articles to
improve performance
o Using Hadoop + MapReduce running on EC2 / S3,
converted 4TB of TIFFs into 11 million PDF articles in
24 hrs

• System requirements
o High write throughput
o Cheap, elastic storage
o Low latency
o High consistency (within a
single data center good
enough)
o Disk-efficient sequential
and random read
performance

• Classic alternatives
o These requirements typically met using large MySQL cluster &
caching tiers using Memcached
o Content on HDFS could be loaded into MySQL or Memcached
if needed by web tier
• Problems with previous solutions
o MySQL has low random write throughput… BIG problem for
messaging!
o Difficult to scale MySQL clusters rapidly while maintaining
performance
o MySQL clusters have high management overhead, require
more expensive hardware

• Facebook’s solution
o Hadoop + HBase as foundations
o Improve & adapt HDFS and HBase to scale to FB’s workload
and operational considerations
 Major concern was availability: NameNode is SPOF &
failover times are at least 20 minutes
 Proprietary “AvatarNode”: eliminates SPOF, makes HDFS
safe to deploy even with 24/7 uptime requirement
 Performance improvements for realtime workload: RPC
timeout. Rather fail fast and try a different DataNode

 Distributed File System
 Fault Tolerance
 Open Data Format
 Flexible Schema
 Queryable Database

 Need to process Multi Petabyte Datasets
 Data may not have strict schema
 Expensive to build reliability in each
application
 Nodes fails everyday
 Need common infrastructure
 Very Large Distributed File System
 Assumes Commodity Hardware
 Optimized for Batch Processing
 Runs on heterogeneous OS

 A Block Sever
 Stores data in local file system
 Stores meta-data of a block - checksum
 Serves data and meta-data to clients
 Block Report
 Periodically sends a report of all existing
blocks to NameNode
 Facilitate Pipelining of Data
 Forwards data to other specified
DataNodes

 Replication Strategy
 One replica on local node
 Second replica on a remote rack
 Third replica on same remote rack
 Additional replicas are randomly placed
 Clients read from nearest replica

 Use Checksums to validate data – CRC32
 File Creation
 Client computes checksum per 512 byte
 DataNode stores the checksum
 File Access
 Client retrieves the data and checksum from DataNode
 If validation fails, client tries other replicas

 Client retrieves a list of DataNodes on which to
place replicas of a block
 Client writes block to the first DataNode
 The first DataNode forwards the data to the
next DataNode in the Pipeline
 When all replicas are written, the client moves
on to write the next block in file

 Log processing
 Web search indexing
 Ad-hoc queries

 MapReduce Component
 JobClient
 JobTracker
 TaskTracker
 Child
 Job Creation/Execution Process

THANK
YOU!!!
www.Kellytechno.com

Hadoop training in bangalore

More Related Content

Hadoop training in bangalore