Hadoop and Pig are tools for analyzing large datasets. Hadoop uses MapReduce and HDFS for distributed processing and storage. Pig provides a high-level language for expressing data analysis jobs that are compiled into MapReduce programs. Common tasks like joins, filters, and grouping are built into Pig for easier programming compared to lower-level MapReduce.
2. Outline of the presentation
Hadoop
Motivations. What is it? And high-level concepts
The Ecosystem. The MapReduce model & framework
and HDFS
Programming with Hadoop
Pig
What is it? Motivations
Model & components
Integration with Cassandra
2
4. Traditional HPC systems
CPU-intensive computations
Relatively small amount of data
Tightly-coupled applications
Highly concurrent I/O requirements
Complex message passing paradigms such as MPI,
PVM…
Developers might need to spend some time
designing for failure
4
5. Challenges
Data and storage
Locality, computation close to the data
In large-scale systems, nodes fail
Mean time between failures: 1 node / 3 years, 1000 nodes / 1 day
Built-in fault-tolerance
Distributed programming is complex
Need a simple data-parallel programming model. Users would
structure the application in high-level functions, the system
distributes the data & jobs and handles communications and faults
5
6. What requirements
A simple data-parallel programming model, designed for
high scalability and resiliency
Scalability to large-scale data volumes
Automated fault-tolerance at application level rather
than relying on high-availability hardware
Simplified I/O and tasks monitoring
All based on cost-efficient commodity machines (cheap,
but unreliable), and commodity network
6
7. Hadoop’s core concepts
Data spread in advance, persistent (in terms of
locality), and replicated
No inter-dependencies / shared nothing architecture
Applications written in two pieces of code
And developers do not have to worry about the
underlying issues in networking, jobs interdependencies,
scheduling, etc…
7
8. Where does it come from?
Hadoop originated from Apache Nutch, an open source
web search engine
After the publications of the GFS and MapReduce papers,
in 2003 & 2004, the Nutch developers decided to
implement open source versions
In February 2006, it became Hadoop, with a dedicated
team at Yahoo!
September 2007 - release 0.14.1
Last release 1.0.3 out last week
Used by a large number of companies including Facebook,
LinkedIn, Twitter, hulu, among many others..
8
9. The model
A map function processes a key/value pair to generate a set of
intermediate key/value pairs
Divides the problem into smaller ‘intermediate key/value’ pairs
The reduce function merge all intermediate values associated with
the same intermediate key
Run-time system takes care of:
Partitioning the input data across nodes (blocks/chunks typically of
64Mb to 128Mb)
Scheduling the data and execution. Maps operate on a single block.
Manages node failures, replication, re-submissions..
9
10. Simple Word Count
♯key: offset, value: line
def mapper():
for line in open(“doc”):
for word in line.split():
output(word, 1)
♯key: a word, value: iterator over counts
def reducer():
output(key, sum(value))
10
11. The Combiner
A combiner is a local aggregation function for repeated keys
produced by the map
Works for associative functions like sum, count, max
Decreases the size of intermediate data / communications
map-side aggregation for word count:
def combiner():
output(key, sum(values))
11
12. Some other basic examples…
Distributed Grep:
Map function emits a line if it matches a supplied pattern
Reduce function is an identity function that copies the supplied intermediate
data to the output
Count of URL accesses:
Map function processes logs of web page requests and outputs <URL, 1>
Reduce function adds together all values for the same URL, emitting <URL, total
count> pairs
Reverse Web-Link graph:
Map function outputs <tgt, src> for each link to a tgt in a page named src
Reduce concatenates the list of all src URLS associated with a given tgt URL and
emits the pair: <tgt, list(src)>
Inverted Index:
Map function parses each document, emitting a sequence of <word, doc_ID>
Reduce accepts all pairs for a given word and emits a <word, list(doc_ID)> pair
12
13. Hadoop Ecosystem
Core 13
components from http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
14. Hadoop components
Hadoop consists of two core components
The MapReduce framework, and
The Hadoop Distributed File System
MapReduce layer
JobTracker
TaskTrackers
HDFS layer
Namenode
Secondary namenode
Datanode
Example of a typical physical distribution within a
14 Hadoop cluster
15. HDFS
Scalable and fault-tolerant. Based on Namenode
Google’s GFS File1
1
Single namenode stores metadata (file 2
3
names, block locations, etc.). 4
Files split into chunks, replicated across
several datanodes (typically 3+). It is rack-
aware
Optimised for large files, sequential
1 2 1 3
streaming reads, rather than random 2 1 4 2
4 3 3 4
Files written once, no append
Datanodes
15
16. HDFS
HDFS API / HDFS FS Shell for command line*
> hadoop fs –copyFromLocal local_dir hdfs_dir
> hadoop fs –copToLocal hdfs_dir local_dir
Tools
Flume: collects, aggregates and move log data from application
servers to HDFS
Sqoop: HDFS import and export to SQL
*http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html
16
17. MapReduce execution
In Hadoop, a Job (full program) is a set of tasks
Each task (mapper or reducer) is attempted at least once, or
multiple times if it crashes. Multiple attempts may also occur
in parallel
The tasks run inside a separate JVM on the tasktracker
All the class files are assembled into a jar file, which will be
uploaded into HDFS, before notifying the tasktracker
17
19. Getting Started…
Multiple choices - Vanilla Apache version, or one of the
numerous existing distros
hadoop.apache.org
www.cloudera.com [A set of VMs is also provided]
http://www.karmasphere.com/
…
Three ways to write jobs in Hadoop:
Java API
Hadoop Streaming (for Python, Perl, etc.)
Pipes API (C++)
19
20. Word Count in Java
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(ReduceClass.class);
conf.setReducerClass(ReduceClass.class);
FileInputFormat.setInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
} 20
21. Word Count in Java – mapper
public class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable ONE = new IntWritable(1);
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
out.collect(new text(itr.nextToken()), ONE);
}
}
}
21
22. Word Count in Java – reducer
public class ReduceClass extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
out.collect(key, new IntWritable(sum));
}
}
22
23. Getting keys and values
Input file
Reducer Reducer
Input Format
Input split Input split
Output Format
RecordWriter RecordWriter
RecordReader RecordReader
Output file Output file
Mapper Mapper
23
24. Hadoop Streaming
Mapper.py: #!/usr/bin/env python
import sys
for line in sys.stdin:
for word in line.split():
print "%st%s" % (word, 1)
Reducer.py: #!/usr/bin/env python
import sys
dict={}
for line in sys.stdin:
word, count = line.split("t", 1)
dict[word] = dict.get(word, 0) + int(count)
counts = dict.items()
for word, count in counts:
print "%st%s" % (word.lower(), count)
You can locally test your code on the command line:
$> cat data | mapper | sort | reducer
24
25. High-level tools
MapReduce is fairly low-level: must think about
keys, values, partitioning, etc.
How to express parallel algorithms by a series of
MapReduce jobs
Can be hard to capture common job building blocks
Different use cases require different tools!
25
26. Pig
Apache Pig is a platform raising a level of abstraction for
processing large datasets. Its language, Pig Latin is a simple
query algebra expressing data transformations and applying
functions to records
Pig
MapReduce jobs Hadoop / HDFS
job submission
Started at Yahoo! Research, >60% of Hadoop jobs within
Yahoo! are Pig jobs
26
27. Motivations
MapReduce requires a Java programmer
Solution was to abstract it and create a system where users are
familiar with scripting languages
Other than very trivial applications, MapReduce requires
multiple stages, leading to long development cycles
Rapid prototyping. Increased productivity
In MapReduce users have to reinvent common functionality
(join, filter, etc.)
Pig provides them
27
28. Used for
Rapid prototyping of algorithms for processing large datasets
Log analysis
Ad hoc queries across various large datasets
Analytics (including through sampling)
Pig Mix provides a set of performance and scalability
benchmarks. Currently 1.1 times MapReduce speed.
28
29. Using Pig
Grunt, the Pig shell
Executing scripts directly
Embedding Pig in Java (using PigServer, similar to SQL
using JDBC), or Python
A range of tools including Eclipse plug-ins
PigPen, Pig Editor…
29
30. Execution modes
Pig has two execution types or modes: local mode and
Hadoop mode
Local
Pig runs in a single JVM and accesses the local filesystem.
Starting form v0.7 it uses the Hadoop job runner.
Hadoop mode
Pig runs on a Hadoop cluster (you need to tell Pig about the
version and point it to your Namenode and Jobtracker
30
31. Running Pig
Pig resides on the user’s machine and can be independent
from the Hadoop cluster
Pig is written in Java and is portable
Compiles into map reduce jobs and submit them to the cluster
No need to install anything extra on the cluster
Pig client
31
32. How does it work
Pig defines a DAG. A step-by-step set of operations, each
performing a transformation
Pig defines a logical plan for these transformations:
A = LOAD ’file' as (line);
• Parses, checks, & optimises
B = FOREACH A GENERATE
• Plan the execution
FLATTEN(TOKENIZE(line)) AS word;
• Maps & Reduces
C = GROUP B BY word;
• Passes the jar to Hadoop
D = FOREACH C GENERATE group,
• Monitor the progress
COUNT(words);
STORE D INTO ‘output’
32
33. Data types & expressions
Scalar type:
Int, Long, Float, Double, Chararray, Bytearray
Complex type representing nested structures:
Tuple: sequence of fields of any type
Bag: an unordered collection of tuples
Map: a set of key-value pairs. Keys must be atoms, values may
be any type
Expressions:
used in Pig as a part of a statement; field name, position ($),
arithmetic, conditional, comparison, Boolean, etc.
33
34. Functions
Load / Store
Data loaders; PigStorage, BinStorage, BinaryStorage,
TextLoader, PigDump
Evaluation
Many built-in functions MAX, COUNT, SUM, DIFF, SIZE…
Filter
A special type of eval function used by the FILTER operator.
IsEmpty is a built-in function
Comparison
Function used in ORDER statement; ASC | DESC
34
35. Schemas
Schemas enable you to associate names and types of the
fields in the relation
Schemas are optional but recommended whenever
possible; type declarations result in better parse-time error
checking and more efficient code execution
They are defined using the AS keyword with operators
Schema definition for simple data types:
> records = LOAD 'input/data' AS (id:int, date:chararray);
35
36. Statements and aliases
Each statement, defining a data processing operator /
relation, produces a dataset with an alias
grunt> records = LOAD 'input/data' AS (id:int, date:chararray);
LOAD returns a tuple, which elements can be referenced by
position or by name
Very useful operators are DUMP, ILLUSTRATE, and DESCRIBE
36
37. Filtering data
Filter is user to work with tuples and rows of data
Select data you want, or remove the data you are not
interested in
Filtering early in the processing pipeline minimises the
amount of data flowing through the system, which can
improve efficiency
grunt> filtered_records = FILTER records BY id == 234;
37
38. Foreach .. Generate
Foreach .. Generate acts on columns on every row in a
relation
grunt> ids = FOREACH records GENERATE id;
Positional reference. This statement has the same output
grunt> ids = FOREACH records GENERATE $0;
The elements of ‘ids’ however are not named ‘id’ unless
you add ‘AS id’ at the end of your statement
grunt> ids = FOREACH records GENERATE $0 AS id;
38
39. Grouping and joining
Group .. by makes an output bag containing grouped
fields with the same schema using a grouping key
Join performs inner, equijoin of two or more relations
based on common field values.
You can also perform outer joins using keywords left,
right and full
Cogroup is similar to Group, using multiple relations, and
creates a nested set of output tuples
39
40. Ordering, combining, splitting…
Order imposes an order on the output to sort a relation
by one or more fields
The Limit statement limits the number of results
Split partitions a relation into two or more relations
the Sample operator selects a random data sample with
the stated sample size
the Union operator to merge the contents of two or more
relations
40
41. Stream
The Stream operator allows to transform data in a
relation using an external program or script
grunt> C = STREAM A THROUGH `cut -f 2`;
Extract the second field of A using cut
The scripts are shipped to the cluster using
grunt> DEFINE script `script.py` SHIP (‘script.py’);
grunt> D = STREAM C THROUGH script AS (…);
41
42. User defined functions
Support and a community of user-defined functions (UDFs)
UDFs can encapsulate users processing logic in filtering,
comparison, evaluation, grouping, or storage
filter functions for instance are all subclasses of
FilterFunc, which itself is a subclass of EvalFunc
PiggyBank: the Pig community sharing their UDFs
DataFu: Linkedin's collection of Pig UDFs
42
43. A simple eval UDF example
package myudfs;
import …
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String) input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception processing input row ", e);
}
}
}
43
44. An Example
Load Users Load Pages
Let’s find the top 5 most visited Filter by age
pages by users aged 18 – 25.
Input: user data file, and page Join on name
view data file. Group on url
Count clicks
Order by clicks
Take top 5
44
45. A simple script
Users = LOAD ‘users’ AS (name, age);
Filtered = FILTER Users BY
age >= 18 and age <= 25;
Pages = LOAD ‘pages’ AS (user, url);
Joined = JOIN Filtered BY name, Pages by user;
Grouped = GROUP Joined BY url;
Summed = FOREACH Grouped GENERATE group,
count(Joined) AS clicks;
Sorted = ORDER Summed BY clicks desc;
Top5 = LIMIT Sorted 5;
STORE Top5 INTO ‘top5sites’;
45
46. i
i
i
i
m
m
m
m
p
p
p
p
i m p o r t
o
o
o
o
r
r
r
r
t
t
t
t
j
j
j
j
a
a
a
a
v
v
v
v
a
a
a
a
.
.
.
.
i
u
u
u
o
t
t
t
.
i
i
i
I
l
l
l
O
.
.
.
o r g . a p a c h e . h a d o o p . f s . P a t h ;
E
A
I
L
x
r
t
i
c
r
e
s
e
a
r
t
p t i o n ;
y L i s t ;
a t o r ;
; / /
f o r
D o t h e
( S t r i n g
f o r
In MapReduce!
c r o s s
s 1 :
( S t r i n g
p r o d u c t
f i r s t )
s 2 :
{
a n
s e c o n
}
r e p o r t e r . s e t S t a t u s ( " O K " ) ;
d
d )
c o l l e c t
l p . s e t O u t p u t K e y C
l p . s e t O u t p u t V a l u
l p . s e t M a p p e r C l a s
F i l e I n p u t F o r m a t .
t h e v a l u e s
P a t hu s e r / g a t e s / p a g e s " ) ) ;
{
( " /
F i l e O u t p u t F o r m a t
l
e
s
a
.
a
C
(
d
s
s
l
L
d
s
a
o
I
(
s
a
n
e t O u t
T
s
d
p
e
(
P
u
x
T
a
t
t
e
g
P
.
x
e
a
c
t
s
t
p u t P a t h
l
.
.
h
a
c
c
(
s
l
l
l
i m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ; S t r i n g o u t v a l = k e y + " , " + s 1 +n e w "P a t h ( " / u
" , + s 2 ; s e r / g a t e s / t m p /
i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ; o c . c o l l e c t ( n u l l , n e w T e x t ( o ul p . s e t N u m R e d u c e T
t v a l ) ) ; a s k s ( 0 ) ;
i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ; r e p o r t e r . s e t S t a t u s ( " O K " ) ; J o b l o a d P a g e s = n e w J o b ( l p ) ;
i p o r t
m o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ; }
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ; } J o b C o n f l f u = n e w J o b C o n f ( M R E x
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t}; e t J o b N a m e ( " L o a d
l f u . s a n d F i l t e r U s e r s
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ; } l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x tp u b l i c os t a t i c
I n p u t F r m a t ; c l a s s L o a d J o i n e d e x t e n d s M a p R el f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a
d u c e B a s e
i m p o r t po r g . a h a d o o p . m a p r e d . M a p p e r ;
a c h e . i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n gl f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c
W r i t a b l e > {
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ; l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ;p u b l i c v o i d m a p ( F i l e I nI n p u t P a t h ( l f u ,
p u t F o r m a t . a d d n e w
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ; T e x t k , P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ; T e x t v a l , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ; c t o r < T e x t , lL o n g W r i t a b l e >
O u t p u t C o l e o c , n e w P a t h ( " / u s e r / g a t e s / t m p /
i m p t
o r o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ; R e p o r t e r r e p o r t e r ) t h r o w s I O E x c el f u . s e t N u m R e d u c e T a s k s ( 0 ) ;
p t i o n {
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o/ / aF i n d
r m t ; t h e u r l J o b l o a d U s e r s = n e w J o b ( l f u ) ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ; S t r i n g l i n e = v a l . t o S t r i n g ( ) ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ; J o b C o n f j oM R E x a m p l e . c l a s s ) ;
i n = n e w J o b C o n f (
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b C
o n t r o l ; i n t s e c o n d C o m m a C= ml i n e . i n d
o m a ) ; e x O f ( ' , ' , j o i n . s e t J o b N a m e ( " J o i n
f i r s t U s e r s a n
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ; S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o mj o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e
m a , s e c o n d C o m m a ) ;
/ / d r o p t h e r e s t o f t h e r e c o r d , I d oj o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l
n ' t n e e d i t a n y m o r e ,
p u b l i c c l a s s M R E x a m p l e { / / j u s t p a s s a 1 f o r t h e c o m b i n e r / r ej o i n . s e t O u t p u t V a l u e C l a s s ( T e x t .
d u c e r t o s u m i n s t e a d .
p u b l i c s t a t i c c l a s s L o a d P a g e s e x t e n d s M a p R e d u c e BT e x t
a s e o u t K e y = n e w T e x t ( k e y ) ; j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a
p e r . c l a s s ) ;
i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t ,o c . c o l l e c t ( o u t K e y ,
T e x t > { n e w L o n g W r i t a b l e ( 1j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s
L ) ) ;
} F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j
p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t} v a l , P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) )
O u t p u t C o l l e c t o r < T e x t , T e x t > o c , b l i c
p u s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R eF i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j
d u c e B a s e
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c ei m p l e m e n t s
p t i o n { R e d u c e r < T e x t , L o n g W r iP a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " )
t a b l e , W r i t a b l e C o m p a r a b l e ,
/ / P u l l t h e k e y o u t W r i t a b l e > { F i l e O ut O u t p u t P a t h ( j o i n ,
t p u t F o r m a t . s e n e w
S t r i n g l i n e = v a l . t o S t r i n g ( ) ; P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ; p u b l i c v o i d r e d u c e ( j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
S t r i ns t r i n g ( 0 , if i r s t C o m m a ) ;
g k e y = l n e . u b y , T e x t k e J o b j o i n J o b = n e w J o b ( j o i n ) ;
S t r i n g v a l u e = l i n e . s u b s t r i n g ( f i r s t C o m m a + 1I t e r a t o r < L o n g W r i t a b l e >
) ; i t e r , j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a
T e x t o u t K e y = n e w T e x t ( k e y ) ; O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a bj o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s
l e , W r i t a b l e > o c ,
/ / P r e p e n d a n i n d e x t o t h e v a l u e s o w e k n o w R e p o r t e r lr e p o r t e r )
w h i c h f i e t h r o w s I O E x c e p t i o n {
/ / i t c a m e f r o m . / / A d d u p a l l t h e v a l u e s w e s e e J o b C o n f g r o u p a= pn e w cJ o b C o n f ( M R
x m l . l a s s ) ;
T e x t o u t V a l v= ln e w ;T e x t ( " 1
" + a u ) g r o u p . s e t J o b N a m e ( " G r o u p U R L s " )
o c . c o l l e c t ( o u t K e y , o u t V a l ) ; l o n g s u m = 0 ; g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T
} i l e (w h e r . h a s N e x t ( ) )
i t { g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c
} s u m + = i t e r . n e x t ( ) . g e t ( ) ; g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g
p u b l i c s t a t i c c l a s s L o a d A n d F i l t e r U s e r s e x t e n d s M a p R er e p o r t e r . s e t S t a t u s ( " O K
d u c e B a s e " ) ; g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e
l e O u t p u t F o r m a t . c l a
i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , } e x t >
T { g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e
g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U
p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l , o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u mg r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r
) ) ;
O u t p u t C o l l e c t o r < T e x t , T e x t > o c , } F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g
R e p o r t e r r e p o r t e r ) t h r o w s I O} x c e p t i o n
E { P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
/ / P u l l t h e k e y o u t p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e o r m a t . s e t O u t p u t P a t h ( g r
F i l e O u t p u t F
S t r i n g l i n e = v a l . t o S t r i n g ( ) ; m p l e m e n t s
i M a p p e r < W r i t a b l e C o m p a r a b l e , aW r i t a b l e , /L o n g W r i t a b l e , u p e d " ) ) ;
P t h ( " / u s e r g a t e s / t m p / g r o
i n t f i r s t C o m m a = l i n e . i n d e x OT e x t > ){
f ( ' , ' ; g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
S t r i n g v a lf i r s t C o m m a
u e = l i n e . s+ b1 ) ;
u s t r i n g ( J o b g r o u p J o b = n e w J o b ( g r o u p ) ;
i n t a g e = I n t e g e r . p a r s e I n t ( v a l u e ) ; p u b l i c v o i d m a p ( g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J
i f ( a g e < 1 8 | | a g e > 2 5 ) r e t u r n ; W r i t a b l e C o m p a r a b l e k e y ,
S t r i n g k e y = l i n e . s u b s t r i n g ( 0 , f i r s t C o m m a ) ; W r i t a b l e v a l , J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M
T e x t o u t K e y = n e w T e x t ( k e y ) ; O u t p u t C o l l e c t o r < L o n g W r i t a b l e , Tt o p 1 0 0 . s e t J o b N a m e ( " T o p
e x t > o c , 1 0 0 s i t e
/ / P r e p e n d a n ei n d e x
k n o w t o it h e
w h c h fv a l u e
i l e s o w R et h r o w s
p o t e r I O E x c e p t i o n
r e p o r t e r ) { t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e
/ / i t c a m e f r o m . o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t )t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W
k e y ) ;
T e x t o u t V a l = n e w T e x t ( " 2 " + v a l u e ) ;} t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x
o c . c o l l e c t ( o u t K e y , o u t V a l ) ; } t o p 1 0 0 . s e t O u t p u t F oo r m a t . c l a s s ) ;
r m a t ( S e q u e n c
} p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p Rt o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c
e d u c e B a s e
} i m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , Lt o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C
o n g W r i t a b l e , T e x t > {
p u b l i c s t a t i c c l a s s J o i n e x t e n d s M a p R e d u c e B a s e t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l
i m p l e m e n t s R e d u c e r < T e x t , T e x t , T e x t , T e xi n t {c o u n t
t > = 0 ; F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t
p u b l i c e d u c e (
v o i d r P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;
p u b l i c v o i d r e d u c e ( T e x t k e y , L o n g W r i t a b l e k e y , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p
I t e r a t o r < T e x t > i t e r , I t e r a t o r < T e x t > i t e r , P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1
O u t p u t C o l l e c t o r < T e x t , T e x t > o c , O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ;
o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i oR e p o r t e r
n { r e p o r t e r ) t h r o w s I O E x c e p t i oJ o b
n { l i m i t = n e w J o b ( t o p 1 0 0 ) ;
/ / F o r e a c h v a l u e , f i g u r e o u t w h i c h f i l e i t ' s f r o m a n d l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b
s t o r e i t / / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d s
/ / a c c o r d i n g l y . w< i1 0 0 (& & ui t e r . h a s N e x t ( ) )
h l e c o n t { J o b C o n t r o l j c = n e1 0 0 os i t e s rf o r
w J b C o n t o l
L i s t < S t r i n g > f i r s t = n e w A r r a y L i s t < S t r i n g > ( )o c . c o l l e c t ( k e y ,
; i t e r . n e x1 8 )t o
t ( ) ; 2 5 " ) ;
L i s t < S t r i n g > s e c o n d = n e w A r r a y L i s t < S t r i n g > (c o u n t + + ;
) ; j c . a d d J o b ( l o a d P a g e s ) ;
} j c . a d d J o b ( l o a d U s e r s ) ;
w h i l e ( i t e r . h a s N e x t ( ) ) { } j c . a d d J o b ( j o i n J o b ) ;
T e x t t = i t e r . n e x t ( ) ; } j c . a d d J o b ( g r o u p J o b ) ;
S t S t r
r i n i
g n g
v (
a )
l ;
u e = t . t o p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o wj c . a d d J o b ( l i m i t ) ;
s I O E x c e p t i o n {
i f ( v a l u e . c h a r A t ( 0 ) = = ' 1 ' ) J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s sj c . r u n ( ) ;
) ;
f i r s t . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ; t J o b N a m e ( " L o a d
l p . s e P a g e s " ) ; }
e l s e s e c o n d . a d d ( v a l u e . s u b s t r i n g (l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a} . c l a s s ) ;
1 ) ) ; t
46
47. Ease of Translation
Load Users Load Pages
Users = LOAD …
Filter by age Filtered = FILTER …
Pages = LOAD …
Join on name Joined = JOIN …
Group on url Grouped = GROUP …
Summed = … COUNT()…
Count clicks Sorted = ORDER …
Order by clicks
Top5 = LIMIT …
Take top 5
47
48. The Hadoop/Pig/Cassandra stack
Cassandra has gained some significant integration points
with Hadoop and its analytics tools
In order to achieve Hadoop’s data locality, Cassandra nodes
must be part of the Hadoop cluster by running a tasktracker
process. So the namenode and jobtracker can reside outside
of the Cassandra cluster
A three- node
Cassandra/Hadoop
cluster with external
namenode / jobtracker
48
49. Hadoop jobs
Cassandra has a Java source package for Hadoop integration
org.apache.cassandra.hadoop
ColumnFamilyInputFormat extends InputFormat
ColumnFamilyOutputFormat extends OutputFormat
ConfigHelper a helper class to configure Cassandra-specific
information
Hadoop output streaming was introduced in 0.7 but removed
from 0.8
49
50. Pig alongside Cassandra
The Pig integration CassandraStorage() (a LoadFunc
implementation) allows Pig to Load/Store data from/to
Cassandra
grunt> LOAD 'cassandra://Keyspace/cf' USING CassandraStorage();
The pig_cassandra script, shipped with Cassandra source,
performs the necessary initialisation (Pig environments
variables still needs to be set)
Pygmalion is a set of scripts and UDFs to facilitate the use of Pig
alongside Cassandra
50
51. Workflow
A workflow system provides an infrastructure to set up &
manage a sequence of interdependent jobs / set of jobs
The hadoop ecosystem includes a set of workflow tools to
run applications over MapReduce processes or High-level
languages
Cascading (http://www.cascading.org/). A java library defining data
processing workflows and rendering them to MapReduce jobs
Oozie (http://yahoo.github.com/oozie/)
51
52. Some links
http://hadoop.apache.org
http://pig.apache.org/
https://cwiki.apache.org/confluence/display/PIG/Index
PiggyBank: https://cwiki.apache.org/confluence/display/PIG/PiggyBank
DataFu: https://github.com/linkedin/datafu
Pygmalion: https://github.com/jeromatron/pygmalion
http://code.google.com/edu/parallel/mapreduce-tutorial.html
Video tutorials from Cloudera: http://www.cloudera.com/hadoop-training
Interesting papers:
http://bit.ly/rskJho - Original MapReduce paper
http://bit.ly/KvFXxT - Pig paper: ‘Building a High-Level Dataflow System on top of
MapReduce: The Pig Experience’
52
53. A simple data flow
Load checkins data
Keep only the two ids
Top 50 users / locations
[same script, different group key]
Group by user/loc id & Order
Limit to top 50
53
54. Another data flow
Load checkins data
Split_date
All the checkins, over weeks
Group by date
Group by weeks using
Count the tuples
Stream
54