Apache Hadoop
Apache Hadoop
Apache Hadoop
com
#117
CONTENTS INCLUDE:
n
Introduction
Apache Hadoop
Hadoop Quick Reference
Hadoop Quick How-To
Staying Current
Hot Tips and more...
Apache Hadoop
By Eugene Ciurana and Masoud Kalali
INTRODUCTION
www.dzone.com
What Is MapReduce?
MapReduce refers to a framework that runs on a computational
cluster to mine large datasets. The name derives from the
application of map() and reduce() functions repurposed from
functional programming languages.
APACHE HADOOP
A
parallelized operation performed on all splits yields
the same results as if it were executed against the larger
dataset before turning it into splits
Implementations separate business logic from multiprocessing logic
M
apReduce framework developers focus on process
dispatching, locking, and logic flow
MapReduce
Pig
ZooKeeper
HBase
HDFS
Hive
Chukwa
A
pp developers focus on implementing the business logic
without worrying about infrastructure or scalability issues
Implementation patterns
The Map(k1, v1) -> list(k2, v2) function is applied to every
item in the split. It produces a list of (k2, v2) pairs for each call.
The framework groups all the results with the same key
together in a new split.
Hot
Tip
www.dzone.com
Hot
Tip
Z
ooKeeper - a distributed application management tool
for configuration, event synchronization, naming, and
group services used for managing the nodes in a Hadoop
computational network.
Hot
Tip
Components
Description
Usage
Local
(default)
Development,
test, debug
Pseudodistributed
Development,
test, debug
Distributed
Staging,
production
Essentials
H
DFS - a scalable, high-performance distributed file
system. It stores its data blocks on top of the native file
system. HDFS is designed for consistency; commits arent
considered complete until data is written to at least two
different configurable volumes. HDFS presents a single
view of multiple physical disks or file systems.
M
apReduce - A Java-based job tracking, node
management, and application container for mappers and
reducers written in Java or in any scripting language that
supports STDIN and STDOUT for job interaction.
Hot
Tip
Frameworks
C
hukwa - a data collection system for monitoring, displaying,
and analyzing logs from large distributed systems.
N
ameNode: manages HDFS and communicates with every
DataNode daemon in the cluster
J obTracker: dispatches jobs and assigns splits (splits) to
mappers or reducers as each stage completes
H
ive - structured data warehousing infrastructure that
provides a mechanisms for storage, data extraction,
transformation, and loading (ETL), and a SQL-like
language for querying and analysis.
T
askTracker: executes tasks sent by the JobTracker and
reports status
H
Base - a column-oriented (NoSQL) database designed for
real-time storage, retrieval, and search of very large tables
(billions of rows/millions of columns) running atop HDFS.
D
ataNode: Manages HDFS content in the node and
updates status to the NameNode
These daemons execute in the three distinct processing
layers of a Hadoop cluster: master (Name Node), slaves (Data
Nodes), and user applications.
Utilities
P
ig - a set of tools for programmatic flat-file data
analysis that provides a programming language, data
transformation, and parallelized processing.
S
qoop - a tool for importing and exporting data stored in
relational databases into Hadoop or Hive, and vice versa
using MapReduce tools and standard JDBC drivers.
DZone, Inc.
www.dzone.com
M
anages lists of files, list of blocks in each file, list of
blocks per node, and file attributes and other meta-data
h
adoop-env.sh environmental configuration,
JVM configuration, logging, master and slave
configuration files
T
racks HDFS file creation and deletion operations in an
activity log
Depending on system load, the NameNode and JobTracker
daemons may run on separate computers.
Hot
Tip
h
dfs-site.xml HDFS block size, Name and Data
node directories
m
apred-site.xml total MapReduce tasks,
JobTracker address
m
asters, slaves files NameNode, JobTracker,
DataNodes, and TaskTrackers addresses, as appropriate
S
end data blocks to other nodes required by the
Name Node
User Applications
D
ispatch mappers and reducers to the Name Node for
execution in the Hadoop cluster
http://namenode.server.name:50070/
http://jobtracker.server.name:50070/
E
xecute implementation contracts for Java and for
scripting languages mappers and reducers
Usage
hadoop [--config confdir] [COMMAND]
[GENERIC_OPTIONS] [COMMAND_OPTIONS]
Hadoop Installation
Hot
Tip
Hadoop can parse generic options and run classes from the
command line. confdir can override the default $HADOOP_HOME/
conf directory.
Generic Options
E
nsure that Java 6 and both ssh and sshd are running in
all nodes
-D <property=value>
Set a property
G
et the most recent, stable release from
http://hadoop.apache.org/common/releases.html
-fs <local|namenode:port>
Specify a namenode
-jg <local|jobtracker:port>
$HADOOP_HOME/bin/hadoop
|
www.dzone.com
User Commands
Hot
Tip
Create an archive
distcp
hdfs://node1:8020/dir_a
hdfs://node2:8020/dir_b
queue -list
Administrator Commands
balancer -threshold 50
Fetch http://host/
logLevel?log=name
datanode
jobtracker
namenode -format
namenode -regular
namenode -upgrade
namenode -finalize
K
eyValueTextInputFormat Each line represents a key
and value delimited by a separator; if the separator is
missing the key and value are empty
HDFS Shell
du /var/data1 hdfs://node/data2
lsr
cat hdfs://node/file
count hdfs://node/data
Permissions
expunge
cp, mv, rm
mkdir hdfs://node/path
setrep -R -w 3
Output Formats
The output formats have a 1:1 correspondence with the
input formats and types. The complete list is available from:
http://hadoop.apache.org/common/docs/current/api
www.dzone.com
Job Driver
job.waitForCompletion(true);
}
} // Driver
The number represents the line in which the text occurred. The
mapper and reducer/combiner implementations in this section
require the documentation from:
http://hadoop.apache.org/mapreduce/docs/current/api
The Mapper
The basic Java code implementation for the mapper has the
form:
The streaming API is intended for users with very limited Java
knowledge and interacts with any code that supports STDIN
and STDOUT streaming. Java is considered the best choice for
heavy duty jobs. Development speed could be a reason for
using the streaming API instead. Some scripted languages may
work as well or better than Java in specific problem domains.
This section shows how to implement the same mapper and
reducer using awk and compares its performance against
Javas.
The Mapper
Hot
Tip
#!/usr/bin/gawk -f
{
for (n = 2;n <= NF;n++) {
gsub([,:;)(|!\\[\\]\\.\\?]|--,);
if (length($n) > 0) printf(%s\t%s\n, $n, $1);
}
}
The output is mapped with the key, a tab separator, then the
index occurrence.
The Reducer/Combiner
The combiner is an output handler for the mapper to reduce
the total data transferred over the network. It can be thought
of as a reducer on the local node.
The Reducer
#!/usr/bin/gawk -f
{ wordsList[$1] = ($1 in wordsList) ?
sprintf(%s,%s,wordsList[$1], $2) : $2; }
END {
for (key in wordsList)
printf(%s\t%s\n, key,wordsList[key]);
}
The output is a list of all entries for a given word, like in the
previous section:
doubt\thamlet@111141,romeoandjuliet@23445,henryv@426917
Job Driver
A complete index shows the line where each word occurs, and
the file/work where it occurred.
DZone, Inc.
www.dzone.com
Performance Tradeoff
Hot
Tip
STAYING CURRENT
ABOUT THE AUTHOR
http://eugeneciurana.com/scalablesystems
RECOMMENDED BOOKS
BUY NOW
books.dzone.com/books/hadoop-definitive-guide
Aldon
dz. com
RATION
HTML
LUD E:
Basics
ref car
L vs XHT technologies
Automated growthHTM
& scalable
to isolat
space
n
e Work
riptio
itory
a Privat
Desc
ge
These companies
have
webmana
applications
are in long deployed
trol repos
to
n-con
lop softw
ing and making them
Deve
a versio
that adapt
and scale
bases,
rn
les toto large user
ize merg
Patte
it all fi
minim
le
space
Comm
ine to
tos multip
cloud
computing.
e Work knowledgeable in amany
mainl aspects related
Privat
that utilize
lop on
Deve code lines
a system
sitory
of work
within
om
Valid
ML
Vis it
Network Security
ALM
Solr
Subversion
Elem
ents
al Elem
Large scale growth scenarios involvingents
specialized
and mor equipment
e... away by
(e.g. load balancers and clusters) are all but abstracted
Structur
rdz !
www.dzone.com
INTEG
Upcoming Refcardz
By Daniel Rubio
ge. Colla
Chan
active
Repo
are to you to clouddcomputing,
units
This Refcard
will introduce
with an
softw
riente
loping
ine
task-o it
e
Mainl
es by
emphasis onDeve
these
so
youComm
can better understand
ines providers,
chang
softwar
codel
e code
ding
Task Level
sourc es as aplatform
what it is a cloudnize
computing
can offer your
trol
ut web
line Policy
of buil
NT
Orga
Code
e witho
it chang
cess
ion con
e name
sourc
T CO
and subm
applications.
it
the pro jects vers
are from
with uniqu
ABOU
softw
um
(CI) is
evel Comm
the build
build
a pro
minim
Label
Task-L
ies to
gration
ed to
activit
blem
the bare
ion
ate all
cies to
t
ous Inte
committ
USAGE SCENARIOS
to a pro
nden
Autom configurat
ymen
Build
depe
al
tion
deplo
t
Continu ry change
manu
Label
d tool
nmen
a solu ineffective
the same
stalle
)
Build
eve
t, use target enviro
(i.e.,
,
ated
ce pre-in
ymen
with
erns
Redu
problem
Autom
ns (i.e.
ory.
d deploEAR) in each
icular
Pay only
consume
tagge
via patt
or
cieswhat you
-patter
s that
reposit
t
nden
For each (e.g. WAR
es
lained ) and anti
the part solution duce
Depe
nmen
al
ge
librari
x
exp
t
Web application deployment
until
ago was similar
t enviro
packa
Minim
be
nden a few years
text
to fi
ns are
to pro
all targe
rity
all depe
used
CI can ticular con
le that
alizeplans
y Integ
i-patter they tend
to
most phone services:
with
but can
late fialloted resources,
Centr
Binar
ts with an
etimes
,
temp
s. Ant
nt
tices,
nmen
on
in a par hes som
geme
t enviro
e a single
proces in the end bad prac lementing
incurred
cost
whether
resources
were
consumed
or
based
thenot.
Creatsuchare
cy Mana
nt targe
es to
rties
nden
approac ed with the cial, but,
chang
Depe
prope
into differe
essarily ed to imp
itting
er
efi
te builds
e comm
not nec compar
late Verifi
associat to be ben
befor
etc.
Cloud
computing asRun
itsremo
known etoday
has changed this.
n
Temp
y are
Build
ually,
Privat
y, contin
appear effects. The results whe
rm a
nt team
dicall
s
The various
resourcesPerfo
consumed
by webperio
applications
(e.g.
opme
ed
d Build
sitory
Build
gration
r to devel
Repo
Stage
adverse unintend
ration on a per-unit
bandwidth,
memory, CPU)
areInteg
tallied
basis
CI serve
e
ous Inte
e Build
rm an
ack from
produc
Privat
feedb computing platforms.
Continu Refcard
on
(starting
from zero) by Perfo
all majorated
cloud
tern.
term
they occur
tion
ld based
d
le this
d utom
he pat
f he
h s
S
INUOU
Cloud#64Computing
HTM
L BAS
Core
HTML
By An
dy Ha
rris
DZone, Inc.
140 Preston Executive Dr.
Suite 100
Cary, NC 27513
ISBN-13: 978-1-934238-75-2
ISBN-10: 1-934238-75-9
50795
888.678.0399
919.678.0300
Refcardz Feedback Welcome
refcardz@dzone.com
Sponsorship Opportunities
sales@dzone.com
$7.95
Mo re
Re fca
Ge t
tion:
tegra ternvasll
us Ind Anti-PPaat
Du
ul M.
nuorn
By
an
ti
s
n
o
C
atte
CONTENTS INCLUDE:
Cost
by...
youTechnologies
Data
t toTier
Comply.
brough
Platform Management and more...
borate.
Ge t Mo
re
E:
LUD
TS INC
gration
ON TEN tinuous Inte Change
ry
Con
at Eve
ns
About
Software i-patter
Build
and Ant
Patterns Control
ment
Version
e...
Manage s and mor
Build
Practice
d
Buil
#82
9 781934 238752
Copyright 2010 DZone, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical,
photocopying, or otherwise, without prior written permission of the publisher.
Version 1.0