Hadoop pig

The analytics stack
Hadoop & Pig

Outline of the presentation

 Hadoop
 Motivations. What is it? And high-level concepts
 The Ecosystem. The MapReduce model & framework
and HDFS
 Programming with Hadoop
 Pig
 What is it? Motivations
 Model & components
 Integration with Cassandra
2

Please interrupt and ask questions!

3

Traditional HPC systems

 CPU-intensive computations
 Relatively small amount of data
 Tightly-coupled applications
 Highly concurrent I/O requirements
 Complex message passing paradigms such as MPI,
PVM…
 Developers might need to spend some time
designing for failure
4

Challenges

 Data and storage
 Locality, computation close to the data

 In large-scale systems, nodes fail
 Mean time between failures: 1 node / 3 years, 1000 nodes / 1 day
 Built-in fault-tolerance

 Distributed programming is complex
 Need a simple data-parallel programming model. Users would
structure the application in high-level functions, the system
distributes the data & jobs and handles communications and faults

5

What requirements

 A simple data-parallel programming model, designed for
high scalability and resiliency
 Scalability to large-scale data volumes
 Automated fault-tolerance at application level rather
than relying on high-availability hardware
 Simplified I/O and tasks monitoring
 All based on cost-efficient commodity machines (cheap,
but unreliable), and commodity network

6

Hadoop’s core concepts

 Data spread in advance, persistent (in terms of
locality), and replicated
 No inter-dependencies / shared nothing architecture
 Applications written in two pieces of code
 And developers do not have to worry about the
underlying issues in networking, jobs interdependencies,
scheduling, etc…

7

Where does it come from?

 Hadoop originated from Apache Nutch, an open source
web search engine
 After the publications of the GFS and MapReduce papers,
in 2003 & 2004, the Nutch developers decided to
implement open source versions
 In February 2006, it became Hadoop, with a dedicated
team at Yahoo!
 September 2007 - release 0.14.1
 Last release 1.0.3 out last week
 Used by a large number of companies including Facebook,
LinkedIn, Twitter, hulu, among many others..
8

The model

 A map function processes a key/value pair to generate a set of
intermediate key/value pairs
 Divides the problem into smaller ‘intermediate key/value’ pairs
 The reduce function merge all intermediate values associated with
the same intermediate key
 Run-time system takes care of:
 Partitioning the input data across nodes (blocks/chunks typically of
64Mb to 128Mb)
 Scheduling the data and execution. Maps operate on a single block.
 Manages node failures, replication, re-submissions..

9

Simple Word Count
♯key: offset, value: line
def mapper():
for line in open(“doc”):
for word in line.split():
output(word, 1)

♯key: a word, value: iterator over counts
def reducer():
output(key, sum(value))

10

The Combiner

 A combiner is a local aggregation function for repeated keys
produced by the map
 Works for associative functions like sum, count, max

 Decreases the size of intermediate data / communications

 map-side aggregation for word count:
def combiner():
output(key, sum(values))

11

Some other basic examples…
 Distributed Grep:
 Map function emits a line if it matches a supplied pattern
 Reduce function is an identity function that copies the supplied intermediate
data to the output
 Count of URL accesses:
 Map function processes logs of web page requests and outputs <URL, 1>
 Reduce function adds together all values for the same URL, emitting <URL, total
count> pairs
 Reverse Web-Link graph:
 Map function outputs <tgt, src> for each link to a tgt in a page named src
 Reduce concatenates the list of all src URLS associated with a given tgt URL and
emits the pair: <tgt, list(src)>
 Inverted Index:
 Map function parses each document, emitting a sequence of <word, doc_ID>
 Reduce accepts all pairs for a given word and emits a <word, list(doc_ID)> pair

12

Hadoop Ecosystem

Core 13

components from http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/

Hadoop components

 Hadoop consists of two core components
 The MapReduce framework, and
 The Hadoop Distributed File System

 MapReduce layer
 JobTracker
 TaskTrackers
 HDFS layer
 Namenode
 Secondary namenode
 Datanode
Example of a typical physical distribution within a
14 Hadoop cluster

HDFS

 Scalable and fault-tolerant. Based on Namenode
Google’s GFS File1
1
 Single namenode stores metadata (file 2
3
names, block locations, etc.). 4

 Files split into chunks, replicated across
several datanodes (typically 3+). It is rack-
aware
 Optimised for large files, sequential
1 2 1 3
streaming reads, rather than random 2 1 4 2
4 3 3 4
 Files written once, no append
Datanodes
15

HDFS

 HDFS API / HDFS FS Shell for command line*
> hadoop fs –copyFromLocal local_dir hdfs_dir
> hadoop fs –copToLocal hdfs_dir local_dir

 Tools
 Flume: collects, aggregates and move log data from application
servers to HDFS
 Sqoop: HDFS import and export to SQL

*http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html
16

MapReduce execution

 In Hadoop, a Job (full program) is a set of tasks
 Each task (mapper or reducer) is attempted at least once, or
multiple times if it crashes. Multiple attempts may also occur
in parallel
 The tasks run inside a separate JVM on the tasktracker

 All the class files are assembled into a jar file, which will be
uploaded into HDFS, before notifying the tasktracker

17

MapReduce execution

MapReduce Job
Master

Split 0 Worker
Split 1 Worker
read Local write
Split 2 Worker Remote read
Worker
Split 3
Split 4 Worker Output files

Intermediate
Input files files locally

18

Getting Started…

 Multiple choices - Vanilla Apache version, or one of the
numerous existing distros
 hadoop.apache.org
 www.cloudera.com [A set of VMs is also provided]
 http://www.karmasphere.com/
 …

 Three ways to write jobs in Hadoop:
 Java API
 Hadoop Streaming (for Python, Perl, etc.)
 Pipes API (C++)

19

Word Count in Java
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

conf.setMapperClass(MapClass.class);
conf.setCombinerClass(ReduceClass.class);
conf.setReducerClass(ReduceClass.class);

FileInputFormat.setInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

JobClient.runJob(conf);
} 20

Word Count in Java – mapper

public class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable ONE = new IntWritable(1);

public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
out.collect(new text(itr.nextToken()), ONE);
}
}
}
21

Word Count in Java – reducer

public class ReduceClass extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
out.collect(key, new IntWritable(sum));
}
}
22

Getting keys and values

Input file

Reducer Reducer
Input Format

Input split Input split

Output Format
RecordWriter RecordWriter
RecordReader RecordReader

Output file Output file
Mapper Mapper

23

Hadoop Streaming
Mapper.py: #!/usr/bin/env python

import sys
for line in sys.stdin:
for word in line.split():
print "%st%s" % (word, 1)

Reducer.py: #!/usr/bin/env python
import sys
dict={}
for line in sys.stdin:
word, count = line.split("t", 1)
dict[word] = dict.get(word, 0) + int(count)
counts = dict.items()
for word, count in counts:
print "%st%s" % (word.lower(), count)
You can locally test your code on the command line:
$> cat data | mapper | sort | reducer
24

High-level tools

 MapReduce is fairly low-level: must think about
keys, values, partitioning, etc.
 How to express parallel algorithms by a series of
MapReduce jobs
 Can be hard to capture common job building blocks

 Different use cases require different tools!

25

Pig

 Apache Pig is a platform raising a level of abstraction for
processing large datasets. Its language, Pig Latin is a simple
query algebra expressing data transformations and applying
functions to records

Pig
MapReduce jobs Hadoop / HDFS
job submission

 Started at Yahoo! Research, >60% of Hadoop jobs within
Yahoo! are Pig jobs
26

Motivations

 MapReduce requires a Java programmer
 Solution was to abstract it and create a system where users are
familiar with scripting languages

 Other than very trivial applications, MapReduce requires
multiple stages, leading to long development cycles
 Rapid prototyping. Increased productivity

 In MapReduce users have to reinvent common functionality
(join, filter, etc.)
 Pig provides them
27

Used for

 Rapid prototyping of algorithms for processing large datasets
 Log analysis
 Ad hoc queries across various large datasets
 Analytics (including through sampling)

 Pig Mix provides a set of performance and scalability
benchmarks. Currently 1.1 times MapReduce speed.

28

Using Pig

 Grunt, the Pig shell

 Executing scripts directly

 Embedding Pig in Java (using PigServer, similar to SQL
using JDBC), or Python

 A range of tools including Eclipse plug-ins
 PigPen, Pig Editor…

29

Execution modes

 Pig has two execution types or modes: local mode and
Hadoop mode

 Local
 Pig runs in a single JVM and accesses the local filesystem.
Starting form v0.7 it uses the Hadoop job runner.
 Hadoop mode
 Pig runs on a Hadoop cluster (you need to tell Pig about the
version and point it to your Namenode and Jobtracker

30

Running Pig

 Pig resides on the user’s machine and can be independent
from the Hadoop cluster
 Pig is written in Java and is portable
 Compiles into map reduce jobs and submit them to the cluster
 No need to install anything extra on the cluster

Pig client
31

How does it work

 Pig defines a DAG. A step-by-step set of operations, each
performing a transformation
 Pig defines a logical plan for these transformations:

A = LOAD ’file' as (line);
• Parses, checks, & optimises
B = FOREACH A GENERATE
• Plan the execution
FLATTEN(TOKENIZE(line)) AS word;
• Maps & Reduces
C = GROUP B BY word;
• Passes the jar to Hadoop
D = FOREACH C GENERATE group,
• Monitor the progress
COUNT(words);
STORE D INTO ‘output’

32

Data types & expressions

 Scalar type:
 Int, Long, Float, Double, Chararray, Bytearray
 Complex type representing nested structures:
 Tuple: sequence of fields of any type
 Bag: an unordered collection of tuples
 Map: a set of key-value pairs. Keys must be atoms, values may
be any type
 Expressions:
 used in Pig as a part of a statement; field name, position ($),
arithmetic, conditional, comparison, Boolean, etc.
33

Functions

 Load / Store
 Data loaders; PigStorage, BinStorage, BinaryStorage,
TextLoader, PigDump
 Evaluation
 Many built-in functions MAX, COUNT, SUM, DIFF, SIZE…
 Filter
 A special type of eval function used by the FILTER operator.
IsEmpty is a built-in function
 Comparison
 Function used in ORDER statement; ASC | DESC

34

Schemas

 Schemas enable you to associate names and types of the
fields in the relation
 Schemas are optional but recommended whenever
possible; type declarations result in better parse-time error
checking and more efficient code execution

 They are defined using the AS keyword with operators
 Schema definition for simple data types:
> records = LOAD 'input/data' AS (id:int, date:chararray);

35

Statements and aliases

 Each statement, defining a data processing operator /
relation, produces a dataset with an alias

grunt> records = LOAD 'input/data' AS (id:int, date:chararray);

 LOAD returns a tuple, which elements can be referenced by
position or by name

 Very useful operators are DUMP, ILLUSTRATE, and DESCRIBE

36

Filtering data

 Filter is user to work with tuples and rows of data
 Select data you want, or remove the data you are not
interested in

 Filtering early in the processing pipeline minimises the
amount of data flowing through the system, which can
improve efficiency

grunt> filtered_records = FILTER records BY id == 234;

37

Foreach .. Generate

 Foreach .. Generate acts on columns on every row in a
relation
grunt> ids = FOREACH records GENERATE id;

 Positional reference. This statement has the same output
grunt> ids = FOREACH records GENERATE $0;

 The elements of ‘ids’ however are not named ‘id’ unless
you add ‘AS id’ at the end of your statement
grunt> ids = FOREACH records GENERATE $0 AS id;
38

Grouping and joining

 Group .. by makes an output bag containing grouped
fields with the same schema using a grouping key

 Join performs inner, equijoin of two or more relations
based on common field values.

 You can also perform outer joins using keywords left,
right and full

 Cogroup is similar to Group, using multiple relations, and
creates a nested set of output tuples
39

Ordering, combining, splitting…

 Order imposes an order on the output to sort a relation
by one or more fields
 The Limit statement limits the number of results
 Split partitions a relation into two or more relations
 the Sample operator selects a random data sample with
the stated sample size
 the Union operator to merge the contents of two or more
relations

40

Stream

 The Stream operator allows to transform data in a
relation using an external program or script

grunt> C = STREAM A THROUGH `cut -f 2`;
 Extract the second field of A using cut

 The scripts are shipped to the cluster using
grunt> DEFINE script `script.py` SHIP (‘script.py’);
grunt> D = STREAM C THROUGH script AS (…);

41

User defined functions

 Support and a community of user-defined functions (UDFs)

 UDFs can encapsulate users processing logic in filtering,
comparison, evaluation, grouping, or storage
 filter functions for instance are all subclasses of
FilterFunc, which itself is a subclass of EvalFunc

 PiggyBank: the Pig community sharing their UDFs
 DataFu: Linkedin's collection of Pig UDFs

42

A simple eval UDF example

package myudfs;

import …

public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String) input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception processing input row ", e);
}
}
}

43

An Example

Load Users Load Pages

Let’s find the top 5 most visited Filter by age
pages by users aged 18 – 25.
Input: user data file, and page Join on name

view data file. Group on url

Count clicks

Order by clicks

Take top 5
44

A simple script

Users = LOAD ‘users’ AS (name, age);
Filtered = FILTER Users BY
age >= 18 and age <= 25;
Pages = LOAD ‘pages’ AS (user, url);
Joined = JOIN Filtered BY name, Pages by user;
Grouped = GROUP Joined BY url;
Summed = FOREACH Grouped GENERATE group,
count(Joined) AS clicks;
Sorted = ORDER Summed BY clicks desc;
Top5 = LIMIT Sorted 5;

STORE Top5 INTO ‘top5sites’;

45

i
i
i
i
m
m
m
m
p
p
p
p

i m p o r t
o
o
o
o
r
r
r
r
t
t
t
t
j
j
j
j
a
a
a
a
v
v
v
v
a
a
a
a
.
.
.
.
i
u
u
u
o
t
t
t
.
i
i
i
I
l
l
l
O
.
.
.

o r g . a p a c h e . h a d o o p . f s . P a t h ;
E
A
I
L
x
r
t
i
c
r
e
s
e
a
r
t
p t i o n ;
y L i s t ;
a t o r ;
; / /
f o r
D o t h e
( S t r i n g
f o r
In MapReduce!
c r o s s
s 1 :
( S t r i n g
p r o d u c t
f i r s t )
s 2 :
{
a n

s e c o n
}
r e p o r t e r . s e t S t a t u s ( " O K " ) ;

d

d )
c o l l e c t
l p . s e t O u t p u t K e y C
l p . s e t O u t p u t V a l u
l p . s e t M a p p e r C l a s
F i l e I n p u t F o r m a t .
t h e v a l u e s
P a t hu s e r / g a t e s / p a g e s " ) ) ;
{
( " /
F i l e O u t p u t F o r m a t
l
e
s
a

.
a
C
(
d

s
s
l
L
d
s
a
o
I
(
s
a
n

e t O u t
T
s
d
p
e
(
P
u
x
T
a
t
t
e
g
P
.
x
e
a
c
t
s
t

p u t P a t h
l
.
.
h
a
c
c
(
s
l
l
l

i m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ; S t r i n g o u t v a l = k e y + " , " + s 1 +n e w "P a t h ( " / u
" , + s 2 ; s e r / g a t e s / t m p /
i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ; o c . c o l l e c t ( n u l l , n e w T e x t ( o ul p . s e t N u m R e d u c e T
t v a l ) ) ; a s k s ( 0 ) ;
i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ; r e p o r t e r . s e t S t a t u s ( " O K " ) ; J o b l o a d P a g e s = n e w J o b ( l p ) ;
i p o r t
m o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ; }
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ; } J o b C o n f l f u = n e w J o b C o n f ( M R E x
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t}; e t J o b N a m e ( " L o a d
l f u . s a n d F i l t e r U s e r s
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ; } l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x tp u b l i c os t a t i c
I n p u t F r m a t ; c l a s s L o a d J o i n e d e x t e n d s M a p R el f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a
d u c e B a s e
i m p o r t po r g . a h a d o o p . m a p r e d . M a p p e r ;
a c h e . i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n gl f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c
W r i t a b l e > {
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ; l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ;p u b l i c v o i d m a p ( F i l e I nI n p u t P a t h ( l f u ,
p u t F o r m a t . a d d n e w
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ; T e x t k , P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ; T e x t v a l , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ; c t o r < T e x t , lL o n g W r i t a b l e >
O u t p u t C o l e o c , n e w P a t h ( " / u s e r / g a t e s / t m p /
i m p t
o r o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ; R e p o r t e r r e p o r t e r ) t h r o w s I O E x c el f u . s e t N u m R e d u c e T a s k s ( 0 ) ;
p t i o n {
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o/ / aF i n d
r m t ; t h e u r l J o b l o a d U s e r s = n e w J o b ( l f u ) ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ; S t r i n g l i n e = v a l . t o S t r i n g ( ) ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ; J o b C o n f j oM R E x a m p l e . c l a s s ) ;
i n = n e w J o b C o n f (
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b C
o n t r o l ; i n t s e c o n d C o m m a C= ml i n e . i n d
o m a ) ; e x O f ( ' , ' , j o i n . s e t J o b N a m e ( " J o i n
f i r s t U s e r s a n
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ; S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o mj o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e
m a , s e c o n d C o m m a ) ;
/ / d r o p t h e r e s t o f t h e r e c o r d , I d oj o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l
n ' t n e e d i t a n y m o r e ,
p u b l i c c l a s s M R E x a m p l e { / / j u s t p a s s a 1 f o r t h e c o m b i n e r / r ej o i n . s e t O u t p u t V a l u e C l a s s ( T e x t .
d u c e r t o s u m i n s t e a d .
p u b l i c s t a t i c c l a s s L o a d P a g e s e x t e n d s M a p R e d u c e BT e x t
a s e o u t K e y = n e w T e x t ( k e y ) ; j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a
p e r . c l a s s ) ;
i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t ,o c . c o l l e c t ( o u t K e y ,
T e x t > { n e w L o n g W r i t a b l e ( 1j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s
L ) ) ;
} F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j
p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t} v a l , P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) )
O u t p u t C o l l e c t o r < T e x t , T e x t > o c , b l i c
p u s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R eF i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j
d u c e B a s e
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c ei m p l e m e n t s
p t i o n { R e d u c e r < T e x t , L o n g W r iP a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " )
t a b l e , W r i t a b l e C o m p a r a b l e ,
/ / P u l l t h e k e y o u t W r i t a b l e > { F i l e O ut O u t p u t P a t h ( j o i n ,
t p u t F o r m a t . s e n e w
S t r i n g l i n e = v a l . t o S t r i n g ( ) ; P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ; p u b l i c v o i d r e d u c e ( j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
S t r i ns t r i n g ( 0 , if i r s t C o m m a ) ;
g k e y = l n e . u b y , T e x t k e J o b j o i n J o b = n e w J o b ( j o i n ) ;
S t r i n g v a l u e = l i n e . s u b s t r i n g ( f i r s t C o m m a + 1I t e r a t o r < L o n g W r i t a b l e >
) ; i t e r , j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a
T e x t o u t K e y = n e w T e x t ( k e y ) ; O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a bj o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s
l e , W r i t a b l e > o c ,
/ / P r e p e n d a n i n d e x t o t h e v a l u e s o w e k n o w R e p o r t e r lr e p o r t e r )
w h i c h f i e t h r o w s I O E x c e p t i o n {
/ / i t c a m e f r o m . / / A d d u p a l l t h e v a l u e s w e s e e J o b C o n f g r o u p a= pn e w cJ o b C o n f ( M R
x m l . l a s s ) ;
T e x t o u t V a l v= ln e w ;T e x t ( " 1
" + a u ) g r o u p . s e t J o b N a m e ( " G r o u p U R L s " )
o c . c o l l e c t ( o u t K e y , o u t V a l ) ; l o n g s u m = 0 ; g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T
} i l e (w h e r . h a s N e x t ( ) )
i t { g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c
} s u m + = i t e r . n e x t ( ) . g e t ( ) ; g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g
p u b l i c s t a t i c c l a s s L o a d A n d F i l t e r U s e r s e x t e n d s M a p R er e p o r t e r . s e t S t a t u s ( " O K
d u c e B a s e " ) ; g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e
l e O u t p u t F o r m a t . c l a
i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , } e x t >
T { g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e
g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U
p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l , o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u mg r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r
) ) ;
O u t p u t C o l l e c t o r < T e x t , T e x t > o c , } F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g
R e p o r t e r r e p o r t e r ) t h r o w s I O} x c e p t i o n
E { P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
/ / P u l l t h e k e y o u t p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e o r m a t . s e t O u t p u t P a t h ( g r
F i l e O u t p u t F
S t r i n g l i n e = v a l . t o S t r i n g ( ) ; m p l e m e n t s
i M a p p e r < W r i t a b l e C o m p a r a b l e , aW r i t a b l e , /L o n g W r i t a b l e , u p e d " ) ) ;
P t h ( " / u s e r g a t e s / t m p / g r o
i n t f i r s t C o m m a = l i n e . i n d e x OT e x t > ){
f ( ' , ' ; g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
S t r i n g v a lf i r s t C o m m a
u e = l i n e . s+ b1 ) ;
u s t r i n g ( J o b g r o u p J o b = n e w J o b ( g r o u p ) ;
i n t a g e = I n t e g e r . p a r s e I n t ( v a l u e ) ; p u b l i c v o i d m a p ( g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J
i f ( a g e < 1 8 | | a g e > 2 5 ) r e t u r n ; W r i t a b l e C o m p a r a b l e k e y ,
S t r i n g k e y = l i n e . s u b s t r i n g ( 0 , f i r s t C o m m a ) ; W r i t a b l e v a l , J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M
T e x t o u t K e y = n e w T e x t ( k e y ) ; O u t p u t C o l l e c t o r < L o n g W r i t a b l e , Tt o p 1 0 0 . s e t J o b N a m e ( " T o p
e x t > o c , 1 0 0 s i t e
/ / P r e p e n d a n ei n d e x
k n o w t o it h e
w h c h fv a l u e
i l e s o w R et h r o w s
p o t e r I O E x c e p t i o n
r e p o r t e r ) { t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e
/ / i t c a m e f r o m . o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t )t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W
k e y ) ;
T e x t o u t V a l = n e w T e x t ( " 2 " + v a l u e ) ;} t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x
o c . c o l l e c t ( o u t K e y , o u t V a l ) ; } t o p 1 0 0 . s e t O u t p u t F oo r m a t . c l a s s ) ;
r m a t ( S e q u e n c
} p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p Rt o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c
e d u c e B a s e
} i m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , Lt o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C
o n g W r i t a b l e , T e x t > {
p u b l i c s t a t i c c l a s s J o i n e x t e n d s M a p R e d u c e B a s e t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l
i m p l e m e n t s R e d u c e r < T e x t , T e x t , T e x t , T e xi n t {c o u n t
t > = 0 ; F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t
p u b l i c e d u c e (
v o i d r P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;
p u b l i c v o i d r e d u c e ( T e x t k e y , L o n g W r i t a b l e k e y , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p
I t e r a t o r < T e x t > i t e r , I t e r a t o r < T e x t > i t e r , P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1
O u t p u t C o l l e c t o r < T e x t , T e x t > o c , O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ;
o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i oR e p o r t e r
n { r e p o r t e r ) t h r o w s I O E x c e p t i oJ o b
n { l i m i t = n e w J o b ( t o p 1 0 0 ) ;
/ / F o r e a c h v a l u e , f i g u r e o u t w h i c h f i l e i t ' s f r o m a n d l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b
s t o r e i t / / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d s
/ / a c c o r d i n g l y . w< i1 0 0 (& & ui t e r . h a s N e x t ( ) )
h l e c o n t { J o b C o n t r o l j c = n e1 0 0 os i t e s rf o r
w J b C o n t o l
L i s t < S t r i n g > f i r s t = n e w A r r a y L i s t < S t r i n g > ( )o c . c o l l e c t ( k e y ,
; i t e r . n e x1 8 )t o
t ( ) ; 2 5 " ) ;
L i s t < S t r i n g > s e c o n d = n e w A r r a y L i s t < S t r i n g > (c o u n t + + ;
) ; j c . a d d J o b ( l o a d P a g e s ) ;
} j c . a d d J o b ( l o a d U s e r s ) ;
w h i l e ( i t e r . h a s N e x t ( ) ) { } j c . a d d J o b ( j o i n J o b ) ;
T e x t t = i t e r . n e x t ( ) ; } j c . a d d J o b ( g r o u p J o b ) ;
S t S t r
r i n i
g n g
v (
a )
l ;
u e = t . t o p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o wj c . a d d J o b ( l i m i t ) ;
s I O E x c e p t i o n {
i f ( v a l u e . c h a r A t ( 0 ) = = ' 1 ' ) J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s sj c . r u n ( ) ;
) ;
f i r s t . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ; t J o b N a m e ( " L o a d
l p . s e P a g e s " ) ; }
e l s e s e c o n d . a d d ( v a l u e . s u b s t r i n g (l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a} . c l a s s ) ;
1 ) ) ; t
46

Ease of Translation

Load Users Load Pages
Users = LOAD …
Filter by age Filtered = FILTER …
Pages = LOAD …
Join on name Joined = JOIN …
Group on url Grouped = GROUP …
Summed = … COUNT()…
Count clicks Sorted = ORDER …
Order by clicks
Top5 = LIMIT …

Take top 5

47

The Hadoop/Pig/Cassandra stack
 Cassandra has gained some significant integration points
with Hadoop and its analytics tools
 In order to achieve Hadoop’s data locality, Cassandra nodes
must be part of the Hadoop cluster by running a tasktracker
process. So the namenode and jobtracker can reside outside
of the Cassandra cluster

A three- node
Cassandra/Hadoop
cluster with external
namenode / jobtracker

48

Hadoop jobs

 Cassandra has a Java source package for Hadoop integration
org.apache.cassandra.hadoop

 ColumnFamilyInputFormat extends InputFormat
 ColumnFamilyOutputFormat extends OutputFormat
 ConfigHelper a helper class to configure Cassandra-specific
information

 Hadoop output streaming was introduced in 0.7 but removed
from 0.8

49

Pig alongside Cassandra

 The Pig integration CassandraStorage() (a LoadFunc
implementation) allows Pig to Load/Store data from/to
Cassandra
grunt> LOAD 'cassandra://Keyspace/cf' USING CassandraStorage();

 The pig_cassandra script, shipped with Cassandra source,
performs the necessary initialisation (Pig environments
variables still needs to be set)

 Pygmalion is a set of scripts and UDFs to facilitate the use of Pig
alongside Cassandra

50

Workflow

 A workflow system provides an infrastructure to set up &
manage a sequence of interdependent jobs / set of jobs

 The hadoop ecosystem includes a set of workflow tools to
run applications over MapReduce processes or High-level
languages
 Cascading (http://www.cascading.org/). A java library defining data
processing workflows and rendering them to MapReduce jobs
 Oozie (http://yahoo.github.com/oozie/)

51

Some links

 http://hadoop.apache.org
 http://pig.apache.org/
 https://cwiki.apache.org/confluence/display/PIG/Index
 PiggyBank: https://cwiki.apache.org/confluence/display/PIG/PiggyBank
 DataFu: https://github.com/linkedin/datafu
 Pygmalion: https://github.com/jeromatron/pygmalion
 http://code.google.com/edu/parallel/mapreduce-tutorial.html
 Video tutorials from Cloudera: http://www.cloudera.com/hadoop-training
 Interesting papers:
 http://bit.ly/rskJho - Original MapReduce paper
 http://bit.ly/KvFXxT - Pig paper: ‘Building a High-Level Dataflow System on top of
MapReduce: The Pig Experience’

52

A simple data flow

Load checkins data

Keep only the two ids
Top 50 users / locations
[same script, different group key]
Group by user/loc id & Order

Limit to top 50

53

Another data flow

Load checkins data

Split_date

All the checkins, over weeks
Group by date

Group by weeks using
Count the tuples
Stream

54

Hadoop pig

More Related Content

Hadoop pig