Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
The analytics stack
  Hadoop & Pig
Outline of the presentation

 Hadoop
  Motivations. What is it? And high-level concepts
  The Ecosystem. The MapReduce model & framework
   and HDFS
  Programming with Hadoop
 Pig
  What is it? Motivations
  Model & components
 Integration with Cassandra
                             2
Please interrupt and ask questions!




                  3
Traditional HPC systems

 CPU-intensive computations
   Relatively small amount of data
   Tightly-coupled applications
   Highly concurrent I/O requirements
   Complex message passing paradigms such as MPI,
    PVM…
   Developers might need to spend some time
    designing for failure
                        4
Challenges

 Data and storage
   Locality, computation close to the data

 In large-scale systems, nodes fail
   Mean time between failures: 1 node / 3 years, 1000 nodes / 1 day
   Built-in fault-tolerance

 Distributed programming is complex
   Need a simple data-parallel programming model. Users would
     structure the application in high-level functions, the system
     distributes the data & jobs and handles communications and faults

                                   5
What requirements

 A simple data-parallel programming model, designed for
  high scalability and resiliency
      Scalability to large-scale data volumes
      Automated fault-tolerance at application level rather
       than relying on high-availability hardware
      Simplified I/O and tasks monitoring
      All based on cost-efficient commodity machines (cheap,
       but unreliable), and commodity network


                              6
Hadoop’s core concepts

 Data spread in advance, persistent (in terms of
  locality), and replicated
 No inter-dependencies / shared nothing architecture
 Applications written in two pieces of code
   And developers do not have to worry about the
    underlying issues in networking, jobs interdependencies,
    scheduling, etc…


                             7
Where does it come from?

 Hadoop originated from Apache Nutch, an open source
  web search engine
 After the publications of the GFS and MapReduce papers,
  in 2003 & 2004, the Nutch developers decided to
  implement open source versions
 In February 2006, it became Hadoop, with a dedicated
  team at Yahoo!
 September 2007 - release 0.14.1
 Last release 1.0.3 out last week
 Used by a large number of companies including Facebook,
  LinkedIn, Twitter, hulu, among many others..
                            8
The model

 A map function processes a key/value pair to generate a set of
  intermediate key/value pairs
   Divides the problem into smaller ‘intermediate key/value’ pairs
 The reduce function merge all intermediate values associated with
  the same intermediate key
 Run-time system takes care of:
   Partitioning the input data across nodes (blocks/chunks typically of
    64Mb to 128Mb)
   Scheduling the data and execution. Maps operate on a single block.
   Manages node failures, replication, re-submissions..

                                   9
Simple Word Count
♯key: offset, value: line
def mapper():
   for line in open(“doc”):
       for word in line.split():
        output(word, 1)

♯key: a word, value: iterator over counts
def reducer():
    output(key, sum(value))


                              10
The Combiner

 A combiner is a local aggregation function for repeated keys
  produced by the map
 Works for associative functions like sum, count, max

 Decreases the size of intermediate data / communications

 map-side aggregation for word count:
         def combiner():
                 output(key, sum(values))


                               11
Some other basic examples…
 Distributed Grep:
   Map function emits a line if it matches a supplied pattern
   Reduce function is an identity function that copies the supplied intermediate
    data to the output
 Count of URL accesses:
   Map function processes logs of web page requests and outputs <URL, 1>
   Reduce function adds together all values for the same URL, emitting <URL, total
    count> pairs
 Reverse Web-Link graph:
   Map function outputs <tgt, src> for each link to a tgt in a page named src
   Reduce concatenates the list of all src URLS associated with a given tgt URL and
    emits the pair: <tgt, list(src)>
 Inverted Index:
   Map function parses each document, emitting a sequence of <word, doc_ID>
   Reduce accepts all pairs for a given word and emits a <word, list(doc_ID)> pair

                                         12
Hadoop Ecosystem




    Core      13

 components        from http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
Hadoop components

 Hadoop consists of two core components
   The MapReduce framework, and
   The Hadoop Distributed File System

 MapReduce layer
   JobTracker
   TaskTrackers
 HDFS layer
   Namenode
   Secondary namenode
   Datanode
                                    Example of a typical physical distribution within a
                               14   Hadoop cluster
HDFS

 Scalable and fault-tolerant. Based on               Namenode
  Google’s GFS                                                    File1
                                                                   1
 Single namenode stores metadata (file                            2
                                                                   3
  names, block locations, etc.).                                   4

 Files split into chunks, replicated across
  several datanodes (typically 3+). It is rack-
  aware
 Optimised for large files, sequential
                                                  1    2    1     3
  streaming reads, rather than random             2    1    4     2
                                                  4    3    3     4
 Files written once, no append
                                                      Datanodes
                                       15
HDFS

  HDFS API / HDFS FS Shell for command line*
    > hadoop fs –copyFromLocal local_dir hdfs_dir
    > hadoop fs –copToLocal hdfs_dir local_dir


  Tools
     Flume: collects, aggregates and move log data from application
      servers to HDFS
     Sqoop: HDFS import and export to SQL

*http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html
                                            16
MapReduce execution

 In Hadoop, a Job (full program) is a set of tasks
 Each task (mapper or reducer) is attempted at least once, or
  multiple times if it crashes. Multiple attempts may also occur
  in parallel
 The tasks run inside a separate JVM on the tasktracker

 All the class files are assembled into a jar file, which will be
  uploaded into HDFS, before notifying the tasktracker


                                  17
MapReduce execution

MapReduce Job
                                                Master




     Split 0            Worker
     Split 1                                                                 Worker
                 read            Local write
     Split 2            Worker                                 Remote read
                                                                             Worker
     Split 3
     Split 4            Worker                                                        Output files

                                               Intermediate
   Input files                                 files locally

                                               18
Getting Started…

 Multiple choices - Vanilla Apache version, or one of the
  numerous existing distros
     hadoop.apache.org
     www.cloudera.com [A set of VMs is also provided]
     http://www.karmasphere.com/
     …

 Three ways to write jobs in Hadoop:
   Java API
   Hadoop Streaming (for Python, Perl, etc.)
   Pipes API (C++)

                                         19
Word Count in Java
public static void main(String[] args) throws Exception {
   JobConf conf = new JobConf(WordCount.class);
   conf.setJobName("wordcount");

    conf.setMapperClass(MapClass.class);
    conf.setCombinerClass(ReduceClass.class);
    conf.setReducerClass(ReduceClass.class);

    FileInputFormat.setInputPaths(conf, args[0]);
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    JobClient.runJob(conf);
}                                 20
Word Count in Java – mapper

public class MapClass extends MapReduceBase
   implements Mapper<LongWritable, Text, Text, IntWritable> {

    private final static IntWritable ONE = new IntWritable(1);

    public void map(LongWritable key, Text value,
                    OutputCollector<Text, IntWritable> out,
                    Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);
      while (itr.hasMoreTokens()) {
        out.collect(new text(itr.nextToken()), ONE);
      }
    }
}
                              21
Word Count in Java – reducer


public class ReduceClass extends MapReduceBase
   implements Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable> values,
                       OutputCollector<Text, IntWritable> out,
                       Reporter reporter) throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += values.next().get();
      }
      out.collect(key, new IntWritable(sum));
    }
}
                             22
Getting keys and values


                              Input file



                                                                                 Reducer        Reducer
Input Format




                Input split                 Input split




                                                               Output Format
                                                                               RecordWriter   RecordWriter
               RecordReader                RecordReader



                                                                                Output file    Output file
                 Mapper                      Mapper




                                                          23
Hadoop Streaming
   Mapper.py:     #!/usr/bin/env python

                  import sys
                  for line in sys.stdin:
                    for word in line.split():
                          print "%st%s" % (word, 1)

 Reducer.py:      #!/usr/bin/env python
                  import sys
                  dict={}
                  for line in sys.stdin:
                    word, count = line.split("t", 1)
                    dict[word] = dict.get(word, 0) + int(count)
                  counts = dict.items()
                  for word, count in counts:
                    print "%st%s" % (word.lower(), count)
You can locally test your code on the command line:
          $> cat data | mapper | sort | reducer
                                   24
High-level tools

 MapReduce is fairly low-level: must think about
  keys, values, partitioning, etc.
 How to express parallel algorithms by a series of
  MapReduce jobs
   Can be hard to capture common job building blocks

 Different use cases require different tools!

                             25
Pig

 Apache Pig is a platform raising a level of abstraction for
  processing large datasets. Its language, Pig Latin is a simple
  query algebra expressing data transformations and applying
  functions to records


        Pig
                           MapReduce jobs            Hadoop / HDFS
  job submission



 Started at Yahoo! Research, >60% of Hadoop jobs within
  Yahoo! are Pig jobs
                                26
Motivations

 MapReduce requires a Java programmer
   Solution was to abstract it and create a system where users are
    familiar with scripting languages

 Other than very trivial applications, MapReduce requires
  multiple stages, leading to long development cycles
   Rapid prototyping. Increased productivity

 In MapReduce users have to reinvent common functionality
  (join, filter, etc.)
   Pig provides them
                                 27
Used for

   Rapid prototyping of algorithms for processing large datasets
   Log analysis
   Ad hoc queries across various large datasets
   Analytics (including through sampling)

 Pig Mix provides a set of performance and scalability
  benchmarks. Currently 1.1 times MapReduce speed.


                                28
Using Pig

 Grunt, the Pig shell

 Executing scripts directly

 Embedding Pig in Java (using PigServer, similar to SQL
  using JDBC), or Python

 A range of tools including Eclipse plug-ins
   PigPen, Pig Editor…


                               29
Execution modes

 Pig has two execution types or modes: local mode and
  Hadoop mode

 Local
   Pig runs in a single JVM and accesses the local filesystem.
    Starting form v0.7 it uses the Hadoop job runner.
 Hadoop mode
   Pig runs on a Hadoop cluster (you need to tell Pig about the
    version and point it to your Namenode and Jobtracker


                                 30
Running Pig

 Pig resides on the user’s machine and can be independent
  from the Hadoop cluster
 Pig is written in Java and is portable
   Compiles into map reduce jobs and submit them to the cluster
 No need to install anything extra on the cluster




  Pig client
                                31
How does it work

  Pig defines a DAG. A step-by-step set of operations, each
   performing a transformation
  Pig defines a logical plan for these transformations:

A = LOAD ’file' as (line);
                                        • Parses, checks, & optimises
B = FOREACH A GENERATE
                                        • Plan the execution
FLATTEN(TOKENIZE(line)) AS word;
                                            • Maps & Reduces
C = GROUP B BY word;
                                        • Passes the jar to Hadoop
D = FOREACH C GENERATE group,
                                        • Monitor the progress
COUNT(words);
STORE D INTO ‘output’


                                   32
Data types & expressions

 Scalar type:
   Int, Long, Float, Double, Chararray, Bytearray
 Complex type representing nested structures:
   Tuple: sequence of fields of any type
   Bag: an unordered collection of tuples
   Map: a set of key-value pairs. Keys must be atoms, values may
    be any type
 Expressions:
   used in Pig as a part of a statement; field name, position ($),
    arithmetic, conditional, comparison, Boolean, etc.
                                 33
Functions

 Load / Store
   Data loaders; PigStorage, BinStorage, BinaryStorage,
    TextLoader, PigDump
 Evaluation
   Many built-in functions MAX, COUNT, SUM, DIFF, SIZE…
 Filter
   A special type of eval function used by the FILTER operator.
    IsEmpty is a built-in function
 Comparison
   Function used in ORDER statement; ASC | DESC

                                34
Schemas

 Schemas enable you to associate names and types of the
  fields in the relation
 Schemas are optional but recommended whenever
  possible; type declarations result in better parse-time error
  checking and more efficient code execution

 They are defined using the AS keyword with operators
   Schema definition for simple data types:
  > records = LOAD 'input/data' AS (id:int, date:chararray);


                                35
Statements and aliases

 Each statement, defining a data processing operator /
  relation, produces a dataset with an alias

grunt> records = LOAD 'input/data' AS (id:int, date:chararray);


 LOAD returns a tuple, which elements can be referenced by
  position or by name

 Very useful operators are DUMP, ILLUSTRATE, and DESCRIBE

                                36
Filtering data

 Filter is user to work with tuples and rows of data
 Select data you want, or remove the data you are not
  interested in

 Filtering early in the processing pipeline minimises the
  amount of data flowing through the system, which can
  improve efficiency

grunt> filtered_records = FILTER records BY id == 234;


                             37
Foreach .. Generate

 Foreach .. Generate acts on columns on every row in a
  relation
grunt> ids = FOREACH records GENERATE id;


 Positional reference. This statement has the same output
grunt> ids = FOREACH records GENERATE $0;


 The elements of ‘ids’ however are not named ‘id’ unless
  you add ‘AS id’ at the end of your statement
grunt> ids = FOREACH records GENERATE $0 AS id;
                              38
Grouping and joining

 Group .. by makes an output bag containing grouped
  fields with the same schema using a grouping key

 Join performs inner, equijoin of two or more relations
  based on common field values.

 You can also perform outer joins using keywords left,
  right and full

 Cogroup is similar to Group, using multiple relations, and
  creates a nested set of output tuples
                              39
Ordering, combining, splitting…

 Order imposes an order on the output to sort a relation
  by one or more fields
 The Limit statement limits the number of results
 Split partitions a relation into two or more relations
 the Sample operator selects a random data sample with
  the stated sample size
 the Union operator to merge the contents of two or more
  relations

                              40
Stream

 The Stream operator allows to transform data in a
  relation using an external program or script

grunt> C = STREAM A THROUGH `cut -f 2`;
   Extract the second field of A using cut

 The scripts are shipped to the cluster using
  grunt> DEFINE script `script.py` SHIP (‘script.py’);
  grunt> D = STREAM C THROUGH script AS (…);


                              41
User defined functions

 Support and a community of user-defined functions (UDFs)

 UDFs can encapsulate users processing logic in filtering,
  comparison, evaluation, grouping, or storage
   filter functions for instance are all subclasses of
    FilterFunc, which itself is a subclass of EvalFunc

 PiggyBank: the Pig community sharing their UDFs
 DataFu: Linkedin's collection of Pig UDFs


                                42
A simple eval UDF example

package myudfs;

import …

public class UPPER extends EvalFunc<String>
{
  public String exec(Tuple input) throws IOException {
    if (input == null || input.size() == 0)
       return null;
    try{
       String str = (String) input.get(0);
       return str.toUpperCase();
    }catch(Exception e){
       throw WrappedIOException.wrap("Caught exception processing input row ", e);
    }
  }
}

                                          43
An Example

                                     Load Users                     Load Pages

Let’s find the top 5 most visited    Filter by age
  pages by users aged 18 – 25.
  Input: user data file, and page                     Join on name

  view data file.                                     Group on url

                                                      Count clicks

                                                     Order by clicks

                                                       Take top 5
                                44
A simple script

Users    = LOAD ‘users’ AS (name, age);
Filtered = FILTER Users BY
                  age >= 18 and age <= 25;
Pages    = LOAD ‘pages’ AS (user, url);
Joined   = JOIN Filtered BY name, Pages by user;
Grouped = GROUP Joined BY url;
Summed   = FOREACH Grouped GENERATE group,
                   count(Joined) AS clicks;
Sorted   = ORDER Summed BY clicks desc;
Top5     = LIMIT Sorted 5;

STORE Top5 INTO ‘top5sites’;

                         45
i
i
i
i
    m
    m
    m
    m
        p
        p
        p
        p

i m p o r t
            o
            o
            o
            o
                r
                r
                r
                r
                    t
                    t
                    t
                    t
                         j
                         j
                         j
                         j
                              a
                              a
                              a
                              a
                                  v
                                  v
                                  v
                                  v
                                      a
                                      a
                                      a
                                      a
                                          .
                                          .
                                          .
                                          .
                                              i
                                              u
                                              u
                                              u
                                                  o
                                                  t
                                                  t
                                                  t
                                                      .
                                                      i
                                                      i
                                                      i
                                                          I
                                                          l
                                                          l
                                                          l
                                                              O
                                                              .
                                                              .
                                                              .

              o r g . a p a c h e . h a d o o p . f s . P a t h ;
                                                                  E
                                                                  A
                                                                  I
                                                                  L
                                                                      x
                                                                      r
                                                                      t
                                                                      i
                                                                          c
                                                                          r
                                                                          e
                                                                          s
                                                                              e
                                                                              a
                                                                              r
                                                                              t
                                                                                  p t i o n ;
                                                                                  y L i s t ;
                                                                                  a t o r ;
                                                                                  ;                     / /
                                                                                                      f o r
                                                                                                              D o   t h e
                                                                                                              ( S t r i n g
                                                                                                                f o r
                                                                                                                           In MapReduce!
                                                                                                                            c r o s s
                                                                                                                              s 1   :
                                                                                                                        ( S t r i n g
                                                                                                                                        p r o d u c t
                                                                                                                                        f i r s t )
                                                                                                                                        s 2   :
                                                                                                                                                      {
                                                                                                                                                        a n

                                                                                                                                                  s e c o n
                                                                                                                                                                 }
                                                                                                                                                                         r e p o r t e r . s e t S t a t u s ( " O K " ) ;


                                                                                                                                                                                                                     d

                                                                                                                                                                                                                     d   )
                                                                                                                                                                                                                           c o l l e c t
                                                                                                                                                                                                                                           l p . s e t O u t p u t K e y C
                                                                                                                                                                                                                                           l p . s e t O u t p u t V a l u
                                                                                                                                                                                                                                           l p . s e t M a p p e r C l a s
                                                                                                                                                                                                                                           F i l e I n p u t F o r m a t .
                                                                                                                                                                                                                                           t h e   v a l u e s
                                                                                                                                                                                                                           P a t hu s e r / g a t e s / p a g e s " ) ) ;
                                                                                                                                                                                                                             {
                                                                                                                                                                                                                                   ( " /
                                                                                                                                                                                                                                           F i l e O u t p u t F o r m a t
                                                                                                                                                                                                                                                                             l
                                                                                                                                                                                                                                                                             e
                                                                                                                                                                                                                                                                             s
                                                                                                                                                                                                                                                                             a

                                                                                                                                                                                                                                                                             .
                                                                                                                                                                                                                                                                                 a
                                                                                                                                                                                                                                                                                 C
                                                                                                                                                                                                                                                                                 (
                                                                                                                                                                                                                                                                                 d

                                                                                                                                                                                                                                                                                 s
                                                                                                                                                                                                                                                                                     s
                                                                                                                                                                                                                                                                                     l
                                                                                                                                                                                                                                                                                     L
                                                                                                                                                                                                                                                                                     d
                                                                                                                                                                                                                                                                                         s
                                                                                                                                                                                                                                                                                         a
                                                                                                                                                                                                                                                                                         o
                                                                                                                                                                                                                                                                                         I
                                                                                                                                                                                                                                                                                             (
                                                                                                                                                                                                                                                                                             s
                                                                                                                                                                                                                                                                                             a
                                                                                                                                                                                                                                                                                             n

                                                                                                                                                                                                                                                                                     e t O u t
                                                                                                                                                                                                                                                                                                 T
                                                                                                                                                                                                                                                                                                 s
                                                                                                                                                                                                                                                                                                 d
                                                                                                                                                                                                                                                                                                 p
                                                                                                                                                                                                                                                                                                     e
                                                                                                                                                                                                                                                                                                     (
                                                                                                                                                                                                                                                                                                     P
                                                                                                                                                                                                                                                                                                     u
                                                                                                                                                                                                                                                                                                         x
                                                                                                                                                                                                                                                                                                         T
                                                                                                                                                                                                                                                                                                         a
                                                                                                                                                                                                                                                                                                         t
                                                                                                                                                                                                                                                                                                             t
                                                                                                                                                                                                                                                                                                             e
                                                                                                                                                                                                                                                                                                             g
                                                                                                                                                                                                                                                                                                             P
                                                                                                                                                                                                                                                                                                                 .
                                                                                                                                                                                                                                                                                                                 x
                                                                                                                                                                                                                                                                                                                 e
                                                                                                                                                                                                                                                                                                                 a
                                                                                                                                                                                                                                                                                                                     c
                                                                                                                                                                                                                                                                                                                     t
                                                                                                                                                                                                                                                                                                                     s
                                                                                                                                                                                                                                                                                                                     t

                                                                                                                                                                                                                                                                                                         p u t P a t h
                                                                                                                                                                                                                                                                                                                         l
                                                                                                                                                                                                                                                                                                                         .
                                                                                                                                                                                                                                                                                                                         .
                                                                                                                                                                                                                                                                                                                         h
                                                                                                                                                                                                                                                                                                                             a
                                                                                                                                                                                                                                                                                                                             c
                                                                                                                                                                                                                                                                                                                             c
                                                                                                                                                                                                                                                                                                                             (
                                                                                                                                                                                                                                                                                                                                 s
                                                                                                                                                                                                                                                                                                                                 l
                                                                                                                                                                                                                                                                                                                                 l
                                                                                                                                                                                                                                                                                                                                 l


i m p o r t   o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ;                                       S t r i n g   o u t v a l   =   k e                                                          y     +   " , "   +   s 1   +n e w "P a t h ( " / u
                                                                                                                                                                                                                                                     " ,     +   s 2 ;       s   e   r / g a t           e s / t m p /
i m p o r t   o r g . a p a c h e . h a d o o p . i o . T e x t ;                                                       o c . c o l l e c t ( n u l l ,   n                                                          e   w   T e x t ( o ul p . s e t N u m R e d u c e T
                                                                                                                                                                                                                                           t v a l ) ) ;                     a   s   k s ( 0 )           ;
i m p o r t   o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ;                                               r e p o r t e r . s e t S t a t u s                                                          (   " O K " ) ;       J o b   l o a d P a g e s   =     n   e   w   J o b           ( l p ) ;
i p o r t
 m           o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ;        }
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ;         }                                                                                                                                 J o b C o n f   l f u   =   n e w   J o b C o n f ( M R E x
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t};                                                                                                                                      e t J o b N a m e ( " L o a d
                                                                                                                                                                                                                                          l f u . s                    a n d   F i l t e r   U s e r s
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ;         }                                                                                                                                                 l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x tp u b l i c os t a t i c
                                                                                        I n p u t F   r m a t ;     c l a s s   L o a d J o i n e d   e x t                                                           e n d s   M a p R el f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a
                                                                                                                                                                                                                                          d u c e B a s e
i m p o r t po r g . a h a d o o p . m a p r e d . M a p p e r ;
              a c h e .                                                                         i m p l e m e n t s   M a p p e r < T e x t ,   T e x t ,                                                             T e x t ,   L o n gl f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c
                                                                                                                                                                                                                                          W r i t a b l e >   {
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ;                                                                                                                                               l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ;p u b l i c    v o i d   m a p (                                                                                                           F i l e I nI n p u t P a t h ( l f u ,
                                                                                                                                                                                                                                                      p u t F o r m a t . a d d     n e w
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ;                       T e x t   k ,                                                                                             P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ;
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ;                                 T e x t   v a l ,                                                                                                         F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ;                           c t o r < T e x t , lL o n g W r i t a b l e >
                                                                                                                O u t p u t C o   l e                                                                                   o c ,                     n e w   P a t h ( " / u s e r / g a t e s / t m p /
i m p t
   o r     o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ;    R e p o r t e r   r e p o r t e r )   t h r                                                           o w s   I O E x c el f u . s e t N u m R e d u c e T a s k s ( 0 ) ;
                                                                                                                                                                                                                                          p t i o n   {
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o/ / aF i n d
                                                                                                        r m   t ;       t h e   u r l                                                                                                     J o b   l o a d U s e r s   =   n e w   J o b ( l f u ) ;
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ;         S t r i n g   l i n e   =   v a l . t o S t r i n g                                                           ( ) ;
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ;           i n t   f i r s t C o m m a   =   l i n e . i n d e                                                           x O f ( ' , ' ) ;   J o b C o n f   j oM R E x a m p l e . c l a s s ) ;
                                                                                                                                                                                                                                                              i n   =   n e w   J o b C o n f (
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b C
                                                o n t r o l ;                                           i n t   s e c o n d C o m m a C= ml i n e . i n d
                                                                                                                                        o   m a ) ;                                                                   e x O f ( ' , ' ,   j o i n . s e t J o b N a m e ( " J o i n
                                                                                                                                                                                                                                          f i r s t                                   U s e r s   a n
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ;   S t r i n g   k e y   =   l i n e . s u b s t r i n                                                           g ( f i r s t C o mj o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e
                                                                                                                                                                                                                                          m a ,   s e c o n d C o m m a ) ;
                                                                                                        / /   d r o p   t h e   r e s t   o f   t h e   r e                                                           c o r d ,   I   d oj o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l
                                                                                                                                                                                                                                          n ' t   n e e d   i t   a n y m o r e ,
p u b l i c   c l a s s   M R E x a m p l e   {                                                         / /   j u s t   p a s s   a   1   f o r   t h e   c                                                           o m b i n e r / r ej o i n . s e t O u t p u t V a l u e C l a s s ( T e x t .
                                                                                                                                                                                                                                          d u c e r   t o   s u m   i n s t e a d .
        p u b l i c   s t a t i c   c l a s s   L o a d P a g e s   e x t e n d s   M a p R e d u c e BT e x t
                                                                                                        a s e     o u t K e y   =   n e w   T e x t ( k e y                                                           ) ;                 j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a
                                                                                                                                                                                                                                                                  p e r . c l a s s ) ;
                i m p l e m e n t s   M a p p e r < L o n g W r i t a b l e ,   T e x t ,   T e x t ,o c . c o l l e c t ( o u t K e y ,
                                                                                                        T e x t >   {                       n e w   L o n g                                                           W r i t a b l e ( 1j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s
                                                                                                                                                                                                                                          L ) ) ;
                                                                                                }                                                                                                                                         F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j
                p u b l i c   v o i d   m a p ( L o n g W r i t a b l e   k ,   T e x t} v a l ,                                                                                                                          P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) )
                             O u t p u t C o l l e c t o r < T e x t ,   T e x t >   o c , b l i c
                                                                                        p u           s t a t i c   c l a s s   R e d u c e U r l s   e x t                                                           e n d s   M a p R eF i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j
                                                                                                                                                                                                                                          d u c e B a s e
                                R e p o r t e r   r e p o r t e r )   t h r o w s   I O E x c ei m p l e m e n t s
                                                                                                p t i o n   {         R e d u c e r < T e x t ,   L o n g W                                                           r iP a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " )
                                                                                                                                                                                                                          t a b l e ,   W r i t a b l e C o m p a r a b l e ,
                        / /   P u l l   t h e   k e y   o u t                   W r i t a b l e >   {                                                                                                                                     F i l e O ut O u t p u t P a t h ( j o i n ,
                                                                                                                                                                                                                                                      t p u t F o r m a t . s e           n e w
                        S t r i n g   l i n e   =   v a l . t o S t r i n g ( ) ;                                                                                                                                         P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
                        i n t   f i r s t C o m m a   =   l i n e . i n d e x O f ( ' , ' ) ;   p u b l i c   v o i d   r e d u c e (                                                                                                     j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
                        S t r i ns t r i n g ( 0 , if i r s t C o m m a ) ;
                                  g   k e y   =   l   n e .   u b                                      y ,      T e x t   k e                                                                                                             J o b   j o i n J o b   =   n e w   J o b ( j o i n ) ;
                        S t r i n g   v a l u e   =   l i n e . s u b s t r i n g ( f i r s t C o m m a   +   1I t e r a t o r < L o n g W r i t a b l e >
                                                                                                                ) ;                                                                                                     i t e r ,         j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a
                        T e x t   o u t K e y   =   n e w   T e x t ( k e y ) ;                                 O u t p u t C o l l e c t o r < W r i t a b                                                           l e C o m p a r a bj o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s
                                                                                                                                                                                                                                          l e ,   W r i t a b l e >   o c ,
                        / /   P r e p e n d   a n   i n d e x   t o   t h e   v a l u e   s o   w e   k n o w   R e p o r t e r lr e p o r t e r )
                                                                                                                w h i c h   f i   e                   t h r                                                           o w s   I O E x c e p t i o n   {
                        / /   i t   c a m e   f r o m .                                                 / /   A d d   u p   a l l   t h e   v a l u e s   w                                                           e   s e e           J o b C o n f   g r o u p a= pn e w cJ o b C o n f ( M R
                                                                                                                                                                                                                                                                  x   m   l   .   l a s s ) ;
                        T e x t   o u t V a l v= ln e w ;T e x t ( " 1
                                      "   +     a   u   )                                                                                                                                                                                 g r o u p . s e t J o b N a m e ( " G r o u p   U R L s " )
                        o c . c o l l e c t ( o u t K e y ,   o u t V a l ) ;                           l o n g   s u m   =   0 ;                                                                                                         g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T
                }                                                                             i l e   (w h e r . h a s N e x t ( ) )
                                                                                                        i t                             {                                                                                                 g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c
        }                                                                                                       s u m   + =   i t e r . n e x t ( ) . g e t                                                           ( ) ;               g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g
        p u b l i c   s t a t i c   c l a s s   L o a d A n d F i l t e r U s e r s   e x t e n d s   M a p R er e p o r t e r . s e t S t a t u s ( " O K
                                                                                                                d u c e B a s e                                                                                       " ) ;               g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e
                                                                                                                                                                                                                                                                   l e O u t p u t F o r m a t . c l a
                i m p l e m e n t s   M a p p e r < L o n g W r i t a b l e ,   T e x t ,   T e x t ,   } e x t >
                                                                                                        T           {                                                                                                                     g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e
                                                                                                                                                                                                                                          g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U
                              p u b l i c   v o i d   m a p ( L o n g W r i t                                         a b l e   k     ,   T e x t   v a l ,       o c . c o l l e c t ( k e y ,   n e w   L o n g W r i t a b l e ( s u mg r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r
                                                                                                                                                                                                                                          ) ) ;
                                  O u t p u t C o l l e c t o r < T e x t ,                                           T e x t >       o c ,               }                                                                               F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g
                                              R e p o r t e r   r e p o r t e                                         r )   t h r     o w s   I O} x c e p t i o n
                                                                                                                                                  E                   {                                                   P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
                                      / /   P u l l   t h e   k e y   o u t                                                                       p u b l i c   s t a t i c   c l a s s   L o a d C l i c k s   e x t e n d s   M a p R e d u c e B a s e o r m a t . s e t O u t p u t P a t h ( g r
                                                                                                                                                                                                                                     F i l e O u t p u t F
                                      S t r i n g   l i n e   =   v a l . t o                                         S t r i n g     ( ) ;        m p l e m e n t s
                                                                                                                                                          i            M a p p e r < W r i t a b l e C o m p a r a b l e , aW r i t a b l e , /L o n g W r i t a b l e , u p e d " ) ) ;
                                                                                                                                                                                                                          P   t h ( " / u s e r   g a t e s / t m p / g r o
                                      i n t   f i r s t C o m m a   =   l i n                                         e . i n d e     x OT e x t > ){
                                                                                                                                          f ( ' , '   ;                                                                                   g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
                                      S t r i n g   v a lf i r s t C o m m a
                                                          u e   =   l i n e .                                         s+ b1 ) ;
                                                                                                                        u   s t r     i n g (                                                                                             J o b   g r o u p J o b   =   n e w   J o b ( g r o u p ) ;
                                      i n t   a g e   =   I n t e g e r . p a                                         r s e I n t     ( v a l u e ) ;     p u b l i c   v o i d   m a p (                                                 g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J
                                      i f   ( a g e   <   1 8   | |   a g e                                           >   2 5 )       r e t u r n ;                       W r i t a b l e C o m p a r a b l e   k e y ,
                                      S t r i n g   k e y   =   l i n e . s u                                         b s t r i n     g ( 0 ,   f i r s t C o m m a ) ;   W r i t a b l e   v a l ,                                       J o b C o n f   t o p 1 0 0   =   n e w   J o b C o n f ( M
                                      T e x t   o u t K e y   =   n e w   T e                                         x t ( k e y     ) ;                                 O u t p u t C o l l e c t o r < L o n g W r i t a b l e ,   Tt o p 1 0 0 . s e t J o b N a m e ( " T o p
                                                                                                                                                                                                                                        e x t >   o c ,                               1 0 0   s i t e
                                      / /   P r e p e n d   a n ei n d e x
                                                                    k n o w                                           t o it h e
                                                                                                                      w h   c h       fv a l u e
                                                                                                                                        i l e       s o   w               R et h r o w s
                                                                                                                                                                              p o   t e r   I O E x c e p t i o n
                                                                                                                                                                                            r e p o r t e r )       {                     t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e
                                      / /   i t   c a m e   f r o m .                                                                                             o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l ,   ( T e x t )t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W
                                                                                                                                                                                                                                          k e y ) ;
                                      T e x t   o u t V a l   =   n e w   T e                                         x t ( " 2 "       +   v a l u e ) ;}                                                                                t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x
                                      o c . c o l l e c t ( o u t K e y ,   o                                         u t V a l )     ;           }                                                                                       t o p 1 0 0 . s e t O u t p u t F oo r m a t . c l a s s ) ;
                                                                                                                                                                                                                                                                              r m a t ( S e q u e n c
                              }                                                                                                                   p u b l i c   s t a t i c   c l a s s   L i m i t C l i c k s   e x t e n d s   M a p Rt o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c
                                                                                                                                                                                                                                          e d u c e B a s e
                }                                                                                                                                         i m p l e m e n t s   R e d u c e r < L o n g W r i t a b l e ,   T e x t ,   Lt o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C
                                                                                                                                                                                                                                          o n g W r i t a b l e ,   T e x t >   {
                p u b l i c   s t a t i c   c l a s s   J o i n   e x t e n d s   M                                                   a p R e d u c e B a s e                                                                             t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l
                        i m p l e m e n t s   R e d u c e r < T e x t ,   T e x t ,                                                     T e x t ,   T e xi n t {c o u n t
                                                                                                                                                          t >                 =   0 ;                                                     F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t
                                                                                                                                                          p u b l i c e d u c e (
                                                                                                                                                          v o i d   r                                                     P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;
                              p u b l i c                 v o     i   d   r e         d   u c e ( T       e   x t   k e y    ,                                    L o n g W r i t a b l e   k e y ,                                F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p
                                                            I     t   e r a t         o   r < T e x       t   >   i t e r    ,                                    I t e r a t o r < T e x t >   i t e r ,                 P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1
                                                            O     u   t p u t         C   o l l e c       t   o r < T e x    t ,      T e x t >   o c ,           O u t p u t C o l l e c t o r < L o n g W r i t a b l e ,   T e x t >   t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ;
                                                                                                                                                                                                                                          o c ,
                                                            R     e   p o r t         e   r   r e p       o   r t e r )      t h r    o w s   I O E x c e p t i oR e p o r t e r
                                                                                                                                                                  n   {             r e p o r t e r )   t h r o w s   I O E x c e p t i oJ o b
                                                                                                                                                                                                                                          n   {   l i m i t   =   n e w   J o b ( t o p 1 0 0 ) ;
                                              / /         F o     r     e a c         h     v a l u       e   ,   f i g u    r e      o u t   w h i c h   f i l e   i t ' s   f r o m   a n d                                             l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b
s t o r e               i t                                                                                                                                       / /   O n l y   o u t p u t   t h e   f i r s t   1 0 0   r e c o r d s
                                              / /   a c c o r d i n g l y .                                                                                       w< i1 0 0 (& & ui t e r . h a s N e x t ( ) )
                                                                                                                                                                    h   l e     c o   n t                             {                   J o b C o n t r o l   j c   =   n e1 0 0 os i t e s rf o r
                                                                                                                                                                                                                                                                              w   J   b C o n t   o l
                                              L i s t < S t r i n g >   f i r s t   =   n e w   A                                     r r a y L i s t < S t r i n g > ( )o c . c o l l e c t ( k e y ,
                                                                                                                                                                          ;                               i t e r . n e x1 8 )t o
                                                                                                                                                                                                                          t (   ) ;   2 5 " ) ;
                                              L i s t < S t r i n g >   s e c o n d   =   n e w                                       A r r a y L i s t < S t r i n g > (c o u n t + + ;
                                                                                                                                                                          ) ;                                                             j c . a d d J o b ( l o a d P a g e s ) ;
                                                                                                                                                                  }                                                                       j c . a d d J o b ( l o a d U s e r s ) ;
                        w h i l e                                     ( i t       e   r .   h a   s   N   e x t   (   )   )   {                           }                                                                               j c . a d d J o b ( j o i n J o b ) ;
                                T                                 e   x t         t     =     i   t   e   r . n   e   x   t ( ) ;                 }                                                                                       j c . a d d J o b ( g r o u p J o b ) ;
                                S                                 t   S t r
                                                                      r i n       i
                                                                                  g   n g
                                                                                        v   (
                                                                                            a )
                                                                                              l   ;
                                                                                                  u   e     =     t   .   t o                     p u b l i c   s t a t i c   v o i d   m a i n ( S t r i n g [ ]   a r g s )   t h r o wj c . a d d J o b ( l i m i t ) ;
                                                                                                                                                                                                                                          s   I O E x c e p t i o n   {
                                i                                 f     ( v       a   l u   e .   c   h   a r A   t   (   0 )   = =     ' 1 ' )           J o b C o n f   l p   =   n e w   J o b C o n f ( M R E x a m p l e . c l a s sj c . r u n ( ) ;
                                                                                                                                                                                                                                          ) ;
f i r s t . a d d ( v a l u e . s                                 u   b s t       r   i n   g (   1   )   ) ;                                          t J o b N a m e ( " L o a d
                                                                                                                                                          l p . s e                  P a g e s " ) ;                              }
                                e                                 l   s e         s   e c   o n   d   .   a d d   ( v a l u e .       s u b s t r i n g (l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a} . c l a s s ) ;
                                                                                                                                                          1 ) ) ;                                                         t
                                                                                                                                                                          46
Ease of Translation

Load Users                     Load Pages
                                                 Users = LOAD …
Filter by age                                    Filtered = FILTER …
                                                 Pages = LOAD …
                 Join on name                    Joined = JOIN …
                 Group on url                    Grouped = GROUP …
                                                 Summed = … COUNT()…
                 Count clicks                    Sorted = ORDER …
                Order by clicks
                                                 Top5 = LIMIT …

                  Take top 5

                                            47
The Hadoop/Pig/Cassandra stack
 Cassandra has gained some significant integration points
  with Hadoop and its analytics tools
 In order to achieve Hadoop’s data locality, Cassandra nodes
  must be part of the Hadoop cluster by running a tasktracker
  process. So the namenode and jobtracker can reside outside
  of the Cassandra cluster

                                              A three- node
                                              Cassandra/Hadoop
                                              cluster with external
                                              namenode / jobtracker



                              48
Hadoop jobs

 Cassandra has a Java source package for Hadoop integration
  org.apache.cassandra.hadoop

 ColumnFamilyInputFormat extends InputFormat
 ColumnFamilyOutputFormat extends OutputFormat
 ConfigHelper a helper class to configure Cassandra-specific
  information

 Hadoop output streaming was introduced in 0.7 but removed
  from 0.8

                              49
Pig alongside Cassandra

 The Pig integration CassandraStorage() (a LoadFunc
  implementation) allows Pig to Load/Store data from/to
  Cassandra
  grunt> LOAD 'cassandra://Keyspace/cf' USING CassandraStorage();

 The pig_cassandra script, shipped with Cassandra source,
  performs the necessary initialisation (Pig environments
  variables still needs to be set)

 Pygmalion is a set of scripts and UDFs to facilitate the use of Pig
  alongside Cassandra

                                 50
Workflow

 A workflow system provides an infrastructure to set up &
  manage a sequence of interdependent jobs / set of jobs

 The hadoop ecosystem includes a set of workflow tools to
  run applications over MapReduce processes or High-level
  languages
   Cascading (http://www.cascading.org/). A java library defining data
    processing workflows and rendering them to MapReduce jobs
   Oozie (http://yahoo.github.com/oozie/)

                                  51
Some links

 http://hadoop.apache.org
 http://pig.apache.org/
 https://cwiki.apache.org/confluence/display/PIG/Index
 PiggyBank: https://cwiki.apache.org/confluence/display/PIG/PiggyBank
 DataFu: https://github.com/linkedin/datafu
 Pygmalion: https://github.com/jeromatron/pygmalion
 http://code.google.com/edu/parallel/mapreduce-tutorial.html
 Video tutorials from Cloudera: http://www.cloudera.com/hadoop-training
 Interesting papers:
   http://bit.ly/rskJho - Original MapReduce paper
   http://bit.ly/KvFXxT - Pig paper: ‘Building a High-Level Dataflow System on top of
    MapReduce: The Pig Experience’

                                          52
A simple data flow

    Load checkins data



   Keep only the two ids
                                    Top 50 users / locations
                                    [same script, different group key]
Group by user/loc id & Order



      Limit to top 50


                               53
Another data flow

Load checkins data



    Split_date

                     All the checkins, over weeks
  Group by date



                          Group by weeks using
 Count the tuples
                                Stream


                     54

More Related Content

Hadoop pig

  • 1. The analytics stack Hadoop & Pig
  • 2. Outline of the presentation  Hadoop  Motivations. What is it? And high-level concepts  The Ecosystem. The MapReduce model & framework and HDFS  Programming with Hadoop  Pig  What is it? Motivations  Model & components  Integration with Cassandra 2
  • 3. Please interrupt and ask questions! 3
  • 4. Traditional HPC systems  CPU-intensive computations  Relatively small amount of data  Tightly-coupled applications  Highly concurrent I/O requirements  Complex message passing paradigms such as MPI, PVM…  Developers might need to spend some time designing for failure 4
  • 5. Challenges  Data and storage  Locality, computation close to the data  In large-scale systems, nodes fail  Mean time between failures: 1 node / 3 years, 1000 nodes / 1 day  Built-in fault-tolerance  Distributed programming is complex  Need a simple data-parallel programming model. Users would structure the application in high-level functions, the system distributes the data & jobs and handles communications and faults 5
  • 6. What requirements  A simple data-parallel programming model, designed for high scalability and resiliency  Scalability to large-scale data volumes  Automated fault-tolerance at application level rather than relying on high-availability hardware  Simplified I/O and tasks monitoring  All based on cost-efficient commodity machines (cheap, but unreliable), and commodity network 6
  • 7. Hadoop’s core concepts  Data spread in advance, persistent (in terms of locality), and replicated  No inter-dependencies / shared nothing architecture  Applications written in two pieces of code  And developers do not have to worry about the underlying issues in networking, jobs interdependencies, scheduling, etc… 7
  • 8. Where does it come from?  Hadoop originated from Apache Nutch, an open source web search engine  After the publications of the GFS and MapReduce papers, in 2003 & 2004, the Nutch developers decided to implement open source versions  In February 2006, it became Hadoop, with a dedicated team at Yahoo!  September 2007 - release 0.14.1  Last release 1.0.3 out last week  Used by a large number of companies including Facebook, LinkedIn, Twitter, hulu, among many others.. 8
  • 9. The model  A map function processes a key/value pair to generate a set of intermediate key/value pairs  Divides the problem into smaller ‘intermediate key/value’ pairs  The reduce function merge all intermediate values associated with the same intermediate key  Run-time system takes care of:  Partitioning the input data across nodes (blocks/chunks typically of 64Mb to 128Mb)  Scheduling the data and execution. Maps operate on a single block.  Manages node failures, replication, re-submissions.. 9
  • 10. Simple Word Count ♯key: offset, value: line def mapper(): for line in open(“doc”): for word in line.split(): output(word, 1) ♯key: a word, value: iterator over counts def reducer(): output(key, sum(value)) 10
  • 11. The Combiner  A combiner is a local aggregation function for repeated keys produced by the map  Works for associative functions like sum, count, max  Decreases the size of intermediate data / communications  map-side aggregation for word count: def combiner(): output(key, sum(values)) 11
  • 12. Some other basic examples…  Distributed Grep:  Map function emits a line if it matches a supplied pattern  Reduce function is an identity function that copies the supplied intermediate data to the output  Count of URL accesses:  Map function processes logs of web page requests and outputs <URL, 1>  Reduce function adds together all values for the same URL, emitting <URL, total count> pairs  Reverse Web-Link graph:  Map function outputs <tgt, src> for each link to a tgt in a page named src  Reduce concatenates the list of all src URLS associated with a given tgt URL and emits the pair: <tgt, list(src)>  Inverted Index:  Map function parses each document, emitting a sequence of <word, doc_ID>  Reduce accepts all pairs for a given word and emits a <word, list(doc_ID)> pair 12
  • 13. Hadoop Ecosystem Core 13 components from http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
  • 14. Hadoop components  Hadoop consists of two core components  The MapReduce framework, and  The Hadoop Distributed File System  MapReduce layer  JobTracker  TaskTrackers  HDFS layer  Namenode  Secondary namenode  Datanode Example of a typical physical distribution within a 14 Hadoop cluster
  • 15. HDFS  Scalable and fault-tolerant. Based on Namenode Google’s GFS File1 1  Single namenode stores metadata (file 2 3 names, block locations, etc.). 4  Files split into chunks, replicated across several datanodes (typically 3+). It is rack- aware  Optimised for large files, sequential 1 2 1 3 streaming reads, rather than random 2 1 4 2 4 3 3 4  Files written once, no append Datanodes 15
  • 16. HDFS  HDFS API / HDFS FS Shell for command line* > hadoop fs –copyFromLocal local_dir hdfs_dir > hadoop fs –copToLocal hdfs_dir local_dir  Tools  Flume: collects, aggregates and move log data from application servers to HDFS  Sqoop: HDFS import and export to SQL *http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html 16
  • 17. MapReduce execution  In Hadoop, a Job (full program) is a set of tasks  Each task (mapper or reducer) is attempted at least once, or multiple times if it crashes. Multiple attempts may also occur in parallel  The tasks run inside a separate JVM on the tasktracker  All the class files are assembled into a jar file, which will be uploaded into HDFS, before notifying the tasktracker 17
  • 18. MapReduce execution MapReduce Job Master Split 0 Worker Split 1 Worker read Local write Split 2 Worker Remote read Worker Split 3 Split 4 Worker Output files Intermediate Input files files locally 18
  • 19. Getting Started…  Multiple choices - Vanilla Apache version, or one of the numerous existing distros  hadoop.apache.org  www.cloudera.com [A set of VMs is also provided]  http://www.karmasphere.com/  …  Three ways to write jobs in Hadoop:  Java API  Hadoop Streaming (for Python, Perl, etc.)  Pipes API (C++) 19
  • 20. Word Count in Java public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); FileInputFormat.setInputPaths(conf, args[0]); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } 20
  • 21. Word Count in Java – mapper public class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { out.collect(new text(itr.nextToken()), ONE); } } } 21
  • 22. Word Count in Java – reducer public class ReduceClass extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); } } 22
  • 23. Getting keys and values Input file Reducer Reducer Input Format Input split Input split Output Format RecordWriter RecordWriter RecordReader RecordReader Output file Output file Mapper Mapper 23
  • 24. Hadoop Streaming Mapper.py: #!/usr/bin/env python import sys for line in sys.stdin: for word in line.split(): print "%st%s" % (word, 1) Reducer.py: #!/usr/bin/env python import sys dict={} for line in sys.stdin: word, count = line.split("t", 1) dict[word] = dict.get(word, 0) + int(count) counts = dict.items() for word, count in counts: print "%st%s" % (word.lower(), count) You can locally test your code on the command line: $> cat data | mapper | sort | reducer 24
  • 25. High-level tools  MapReduce is fairly low-level: must think about keys, values, partitioning, etc.  How to express parallel algorithms by a series of MapReduce jobs  Can be hard to capture common job building blocks  Different use cases require different tools! 25
  • 26. Pig  Apache Pig is a platform raising a level of abstraction for processing large datasets. Its language, Pig Latin is a simple query algebra expressing data transformations and applying functions to records Pig MapReduce jobs Hadoop / HDFS job submission  Started at Yahoo! Research, >60% of Hadoop jobs within Yahoo! are Pig jobs 26
  • 27. Motivations  MapReduce requires a Java programmer  Solution was to abstract it and create a system where users are familiar with scripting languages  Other than very trivial applications, MapReduce requires multiple stages, leading to long development cycles  Rapid prototyping. Increased productivity  In MapReduce users have to reinvent common functionality (join, filter, etc.)  Pig provides them 27
  • 28. Used for  Rapid prototyping of algorithms for processing large datasets  Log analysis  Ad hoc queries across various large datasets  Analytics (including through sampling)  Pig Mix provides a set of performance and scalability benchmarks. Currently 1.1 times MapReduce speed. 28
  • 29. Using Pig  Grunt, the Pig shell  Executing scripts directly  Embedding Pig in Java (using PigServer, similar to SQL using JDBC), or Python  A range of tools including Eclipse plug-ins  PigPen, Pig Editor… 29
  • 30. Execution modes  Pig has two execution types or modes: local mode and Hadoop mode  Local  Pig runs in a single JVM and accesses the local filesystem. Starting form v0.7 it uses the Hadoop job runner.  Hadoop mode  Pig runs on a Hadoop cluster (you need to tell Pig about the version and point it to your Namenode and Jobtracker 30
  • 31. Running Pig  Pig resides on the user’s machine and can be independent from the Hadoop cluster  Pig is written in Java and is portable  Compiles into map reduce jobs and submit them to the cluster  No need to install anything extra on the cluster Pig client 31
  • 32. How does it work  Pig defines a DAG. A step-by-step set of operations, each performing a transformation  Pig defines a logical plan for these transformations: A = LOAD ’file' as (line); • Parses, checks, & optimises B = FOREACH A GENERATE • Plan the execution FLATTEN(TOKENIZE(line)) AS word; • Maps & Reduces C = GROUP B BY word; • Passes the jar to Hadoop D = FOREACH C GENERATE group, • Monitor the progress COUNT(words); STORE D INTO ‘output’ 32
  • 33. Data types & expressions  Scalar type:  Int, Long, Float, Double, Chararray, Bytearray  Complex type representing nested structures:  Tuple: sequence of fields of any type  Bag: an unordered collection of tuples  Map: a set of key-value pairs. Keys must be atoms, values may be any type  Expressions:  used in Pig as a part of a statement; field name, position ($), arithmetic, conditional, comparison, Boolean, etc. 33
  • 34. Functions  Load / Store  Data loaders; PigStorage, BinStorage, BinaryStorage, TextLoader, PigDump  Evaluation  Many built-in functions MAX, COUNT, SUM, DIFF, SIZE…  Filter  A special type of eval function used by the FILTER operator. IsEmpty is a built-in function  Comparison  Function used in ORDER statement; ASC | DESC 34
  • 35. Schemas  Schemas enable you to associate names and types of the fields in the relation  Schemas are optional but recommended whenever possible; type declarations result in better parse-time error checking and more efficient code execution  They are defined using the AS keyword with operators  Schema definition for simple data types: > records = LOAD 'input/data' AS (id:int, date:chararray); 35
  • 36. Statements and aliases  Each statement, defining a data processing operator / relation, produces a dataset with an alias grunt> records = LOAD 'input/data' AS (id:int, date:chararray);  LOAD returns a tuple, which elements can be referenced by position or by name  Very useful operators are DUMP, ILLUSTRATE, and DESCRIBE 36
  • 37. Filtering data  Filter is user to work with tuples and rows of data  Select data you want, or remove the data you are not interested in  Filtering early in the processing pipeline minimises the amount of data flowing through the system, which can improve efficiency grunt> filtered_records = FILTER records BY id == 234; 37
  • 38. Foreach .. Generate  Foreach .. Generate acts on columns on every row in a relation grunt> ids = FOREACH records GENERATE id;  Positional reference. This statement has the same output grunt> ids = FOREACH records GENERATE $0;  The elements of ‘ids’ however are not named ‘id’ unless you add ‘AS id’ at the end of your statement grunt> ids = FOREACH records GENERATE $0 AS id; 38
  • 39. Grouping and joining  Group .. by makes an output bag containing grouped fields with the same schema using a grouping key  Join performs inner, equijoin of two or more relations based on common field values.  You can also perform outer joins using keywords left, right and full  Cogroup is similar to Group, using multiple relations, and creates a nested set of output tuples 39
  • 40. Ordering, combining, splitting…  Order imposes an order on the output to sort a relation by one or more fields  The Limit statement limits the number of results  Split partitions a relation into two or more relations  the Sample operator selects a random data sample with the stated sample size  the Union operator to merge the contents of two or more relations 40
  • 41. Stream  The Stream operator allows to transform data in a relation using an external program or script grunt> C = STREAM A THROUGH `cut -f 2`;  Extract the second field of A using cut  The scripts are shipped to the cluster using grunt> DEFINE script `script.py` SHIP (‘script.py’); grunt> D = STREAM C THROUGH script AS (…); 41
  • 42. User defined functions  Support and a community of user-defined functions (UDFs)  UDFs can encapsulate users processing logic in filtering, comparison, evaluation, grouping, or storage  filter functions for instance are all subclasses of FilterFunc, which itself is a subclass of EvalFunc  PiggyBank: the Pig community sharing their UDFs  DataFu: Linkedin's collection of Pig UDFs 42
  • 43. A simple eval UDF example package myudfs; import … public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String) input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } } 43
  • 44. An Example Load Users Load Pages Let’s find the top 5 most visited Filter by age pages by users aged 18 – 25. Input: user data file, and page Join on name view data file. Group on url Count clicks Order by clicks Take top 5 44
  • 45. A simple script Users = LOAD ‘users’ AS (name, age); Filtered = FILTER Users BY age >= 18 and age <= 25; Pages = LOAD ‘pages’ AS (user, url); Joined = JOIN Filtered BY name, Pages by user; Grouped = GROUP Joined BY url; Summed = FOREACH Grouped GENERATE group, count(Joined) AS clicks; Sorted = ORDER Summed BY clicks desc; Top5 = LIMIT Sorted 5; STORE Top5 INTO ‘top5sites’; 45
  • 46. i i i i m m m m p p p p i m p o r t o o o o r r r r t t t t j j j j a a a a v v v v a a a a . . . . i u u u o t t t . i i i I l l l O . . . o r g . a p a c h e . h a d o o p . f s . P a t h ; E A I L x r t i c r e s e a r t p t i o n ; y L i s t ; a t o r ; ; / / f o r D o t h e ( S t r i n g f o r In MapReduce! c r o s s s 1 : ( S t r i n g p r o d u c t f i r s t ) s 2 : { a n s e c o n } r e p o r t e r . s e t S t a t u s ( " O K " ) ; d d ) c o l l e c t l p . s e t O u t p u t K e y C l p . s e t O u t p u t V a l u l p . s e t M a p p e r C l a s F i l e I n p u t F o r m a t . t h e v a l u e s P a t hu s e r / g a t e s / p a g e s " ) ) ; { ( " / F i l e O u t p u t F o r m a t l e s a . a C ( d s s l L d s a o I ( s a n e t O u t T s d p e ( P u x T a t t e g P . x e a c t s t p u t P a t h l . . h a c c ( s l l l i m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ; S t r i n g o u t v a l = k e y + " , " + s 1 +n e w "P a t h ( " / u " , + s 2 ; s e r / g a t e s / t m p / i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ; o c . c o l l e c t ( n u l l , n e w T e x t ( o ul p . s e t N u m R e d u c e T t v a l ) ) ; a s k s ( 0 ) ; i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ; r e p o r t e r . s e t S t a t u s ( " O K " ) ; J o b l o a d P a g e s = n e w J o b ( l p ) ; i p o r t m o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ; } i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ; } J o b C o n f l f u = n e w J o b C o n f ( M R E x i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t}; e t J o b N a m e ( " L o a d l f u . s a n d F i l t e r U s e r s i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ; } l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x tp u b l i c os t a t i c I n p u t F r m a t ; c l a s s L o a d J o i n e d e x t e n d s M a p R el f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a d u c e B a s e i m p o r t po r g . a h a d o o p . m a p r e d . M a p p e r ; a c h e . i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n gl f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c W r i t a b l e > { i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ; l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ;p u b l i c v o i d m a p ( F i l e I nI n p u t P a t h ( l f u , p u t F o r m a t . a d d n e w i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ; T e x t k , P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ; T e x t v a l , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ; c t o r < T e x t , lL o n g W r i t a b l e > O u t p u t C o l e o c , n e w P a t h ( " / u s e r / g a t e s / t m p / i m p t o r o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ; R e p o r t e r r e p o r t e r ) t h r o w s I O E x c el f u . s e t N u m R e d u c e T a s k s ( 0 ) ; p t i o n { i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o/ / aF i n d r m t ; t h e u r l J o b l o a d U s e r s = n e w J o b ( l f u ) ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ; S t r i n g l i n e = v a l . t o S t r i n g ( ) ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ; J o b C o n f j oM R E x a m p l e . c l a s s ) ; i n = n e w J o b C o n f ( i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b C o n t r o l ; i n t s e c o n d C o m m a C= ml i n e . i n d o m a ) ; e x O f ( ' , ' , j o i n . s e t J o b N a m e ( " J o i n f i r s t U s e r s a n i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ; S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o mj o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e m a , s e c o n d C o m m a ) ; / / d r o p t h e r e s t o f t h e r e c o r d , I d oj o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l n ' t n e e d i t a n y m o r e , p u b l i c c l a s s M R E x a m p l e { / / j u s t p a s s a 1 f o r t h e c o m b i n e r / r ej o i n . s e t O u t p u t V a l u e C l a s s ( T e x t . d u c e r t o s u m i n s t e a d . p u b l i c s t a t i c c l a s s L o a d P a g e s e x t e n d s M a p R e d u c e BT e x t a s e o u t K e y = n e w T e x t ( k e y ) ; j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a p e r . c l a s s ) ; i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t ,o c . c o l l e c t ( o u t K e y , T e x t > { n e w L o n g W r i t a b l e ( 1j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s L ) ) ; } F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t} v a l , P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) O u t p u t C o l l e c t o r < T e x t , T e x t > o c , b l i c p u s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R eF i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j d u c e B a s e R e p o r t e r r e p o r t e r ) t h r o w s I O E x c ei m p l e m e n t s p t i o n { R e d u c e r < T e x t , L o n g W r iP a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " ) t a b l e , W r i t a b l e C o m p a r a b l e , / / P u l l t h e k e y o u t W r i t a b l e > { F i l e O ut O u t p u t P a t h ( j o i n , t p u t F o r m a t . s e n e w S t r i n g l i n e = v a l . t o S t r i n g ( ) ; P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ; p u b l i c v o i d r e d u c e ( j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ; S t r i ns t r i n g ( 0 , if i r s t C o m m a ) ; g k e y = l n e . u b y , T e x t k e J o b j o i n J o b = n e w J o b ( j o i n ) ; S t r i n g v a l u e = l i n e . s u b s t r i n g ( f i r s t C o m m a + 1I t e r a t o r < L o n g W r i t a b l e > ) ; i t e r , j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a T e x t o u t K e y = n e w T e x t ( k e y ) ; O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a bj o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s l e , W r i t a b l e > o c , / / P r e p e n d a n i n d e x t o t h e v a l u e s o w e k n o w R e p o r t e r lr e p o r t e r ) w h i c h f i e t h r o w s I O E x c e p t i o n { / / i t c a m e f r o m . / / A d d u p a l l t h e v a l u e s w e s e e J o b C o n f g r o u p a= pn e w cJ o b C o n f ( M R x m l . l a s s ) ; T e x t o u t V a l v= ln e w ;T e x t ( " 1 " + a u ) g r o u p . s e t J o b N a m e ( " G r o u p U R L s " ) o c . c o l l e c t ( o u t K e y , o u t V a l ) ; l o n g s u m = 0 ; g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T } i l e (w h e r . h a s N e x t ( ) ) i t { g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c } s u m + = i t e r . n e x t ( ) . g e t ( ) ; g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g p u b l i c s t a t i c c l a s s L o a d A n d F i l t e r U s e r s e x t e n d s M a p R er e p o r t e r . s e t S t a t u s ( " O K d u c e B a s e " ) ; g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e l e O u t p u t F o r m a t . c l a i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , } e x t > T { g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l , o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u mg r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r ) ) ; O u t p u t C o l l e c t o r < T e x t , T e x t > o c , } F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g R e p o r t e r r e p o r t e r ) t h r o w s I O} x c e p t i o n E { P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ; / / P u l l t h e k e y o u t p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e o r m a t . s e t O u t p u t P a t h ( g r F i l e O u t p u t F S t r i n g l i n e = v a l . t o S t r i n g ( ) ; m p l e m e n t s i M a p p e r < W r i t a b l e C o m p a r a b l e , aW r i t a b l e , /L o n g W r i t a b l e , u p e d " ) ) ; P t h ( " / u s e r g a t e s / t m p / g r o i n t f i r s t C o m m a = l i n e . i n d e x OT e x t > ){ f ( ' , ' ; g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ; S t r i n g v a lf i r s t C o m m a u e = l i n e . s+ b1 ) ; u s t r i n g ( J o b g r o u p J o b = n e w J o b ( g r o u p ) ; i n t a g e = I n t e g e r . p a r s e I n t ( v a l u e ) ; p u b l i c v o i d m a p ( g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J i f ( a g e < 1 8 | | a g e > 2 5 ) r e t u r n ; W r i t a b l e C o m p a r a b l e k e y , S t r i n g k e y = l i n e . s u b s t r i n g ( 0 , f i r s t C o m m a ) ; W r i t a b l e v a l , J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M T e x t o u t K e y = n e w T e x t ( k e y ) ; O u t p u t C o l l e c t o r < L o n g W r i t a b l e , Tt o p 1 0 0 . s e t J o b N a m e ( " T o p e x t > o c , 1 0 0 s i t e / / P r e p e n d a n ei n d e x k n o w t o it h e w h c h fv a l u e i l e s o w R et h r o w s p o t e r I O E x c e p t i o n r e p o r t e r ) { t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e / / i t c a m e f r o m . o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t )t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W k e y ) ; T e x t o u t V a l = n e w T e x t ( " 2 " + v a l u e ) ;} t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x o c . c o l l e c t ( o u t K e y , o u t V a l ) ; } t o p 1 0 0 . s e t O u t p u t F oo r m a t . c l a s s ) ; r m a t ( S e q u e n c } p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p Rt o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c e d u c e B a s e } i m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , Lt o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C o n g W r i t a b l e , T e x t > { p u b l i c s t a t i c c l a s s J o i n e x t e n d s M a p R e d u c e B a s e t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l i m p l e m e n t s R e d u c e r < T e x t , T e x t , T e x t , T e xi n t {c o u n t t > = 0 ; F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t p u b l i c e d u c e ( v o i d r P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ; p u b l i c v o i d r e d u c e ( T e x t k e y , L o n g W r i t a b l e k e y , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p I t e r a t o r < T e x t > i t e r , I t e r a t o r < T e x t > i t e r , P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1 O u t p u t C o l l e c t o r < T e x t , T e x t > o c , O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ; o c , R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i oR e p o r t e r n { r e p o r t e r ) t h r o w s I O E x c e p t i oJ o b n { l i m i t = n e w J o b ( t o p 1 0 0 ) ; / / F o r e a c h v a l u e , f i g u r e o u t w h i c h f i l e i t ' s f r o m a n d l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b s t o r e i t / / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d s / / a c c o r d i n g l y . w< i1 0 0 (& & ui t e r . h a s N e x t ( ) ) h l e c o n t { J o b C o n t r o l j c = n e1 0 0 os i t e s rf o r w J b C o n t o l L i s t < S t r i n g > f i r s t = n e w A r r a y L i s t < S t r i n g > ( )o c . c o l l e c t ( k e y , ; i t e r . n e x1 8 )t o t ( ) ; 2 5 " ) ; L i s t < S t r i n g > s e c o n d = n e w A r r a y L i s t < S t r i n g > (c o u n t + + ; ) ; j c . a d d J o b ( l o a d P a g e s ) ; } j c . a d d J o b ( l o a d U s e r s ) ; w h i l e ( i t e r . h a s N e x t ( ) ) { } j c . a d d J o b ( j o i n J o b ) ; T e x t t = i t e r . n e x t ( ) ; } j c . a d d J o b ( g r o u p J o b ) ; S t S t r r i n i g n g v ( a ) l ; u e = t . t o p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o wj c . a d d J o b ( l i m i t ) ; s I O E x c e p t i o n { i f ( v a l u e . c h a r A t ( 0 ) = = ' 1 ' ) J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s sj c . r u n ( ) ; ) ; f i r s t . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ; t J o b N a m e ( " L o a d l p . s e P a g e s " ) ; } e l s e s e c o n d . a d d ( v a l u e . s u b s t r i n g (l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a} . c l a s s ) ; 1 ) ) ; t 46
  • 47. Ease of Translation Load Users Load Pages Users = LOAD … Filter by age Filtered = FILTER … Pages = LOAD … Join on name Joined = JOIN … Group on url Grouped = GROUP … Summed = … COUNT()… Count clicks Sorted = ORDER … Order by clicks Top5 = LIMIT … Take top 5 47
  • 48. The Hadoop/Pig/Cassandra stack  Cassandra has gained some significant integration points with Hadoop and its analytics tools  In order to achieve Hadoop’s data locality, Cassandra nodes must be part of the Hadoop cluster by running a tasktracker process. So the namenode and jobtracker can reside outside of the Cassandra cluster A three- node Cassandra/Hadoop cluster with external namenode / jobtracker 48
  • 49. Hadoop jobs  Cassandra has a Java source package for Hadoop integration org.apache.cassandra.hadoop  ColumnFamilyInputFormat extends InputFormat  ColumnFamilyOutputFormat extends OutputFormat  ConfigHelper a helper class to configure Cassandra-specific information  Hadoop output streaming was introduced in 0.7 but removed from 0.8 49
  • 50. Pig alongside Cassandra  The Pig integration CassandraStorage() (a LoadFunc implementation) allows Pig to Load/Store data from/to Cassandra grunt> LOAD 'cassandra://Keyspace/cf' USING CassandraStorage();  The pig_cassandra script, shipped with Cassandra source, performs the necessary initialisation (Pig environments variables still needs to be set)  Pygmalion is a set of scripts and UDFs to facilitate the use of Pig alongside Cassandra 50
  • 51. Workflow  A workflow system provides an infrastructure to set up & manage a sequence of interdependent jobs / set of jobs  The hadoop ecosystem includes a set of workflow tools to run applications over MapReduce processes or High-level languages  Cascading (http://www.cascading.org/). A java library defining data processing workflows and rendering them to MapReduce jobs  Oozie (http://yahoo.github.com/oozie/) 51
  • 52. Some links  http://hadoop.apache.org  http://pig.apache.org/  https://cwiki.apache.org/confluence/display/PIG/Index  PiggyBank: https://cwiki.apache.org/confluence/display/PIG/PiggyBank  DataFu: https://github.com/linkedin/datafu  Pygmalion: https://github.com/jeromatron/pygmalion  http://code.google.com/edu/parallel/mapreduce-tutorial.html  Video tutorials from Cloudera: http://www.cloudera.com/hadoop-training  Interesting papers:  http://bit.ly/rskJho - Original MapReduce paper  http://bit.ly/KvFXxT - Pig paper: ‘Building a High-Level Dataflow System on top of MapReduce: The Pig Experience’ 52
  • 53. A simple data flow Load checkins data Keep only the two ids Top 50 users / locations [same script, different group key] Group by user/loc id & Order Limit to top 50 53
  • 54. Another data flow Load checkins data Split_date All the checkins, over weeks Group by date Group by weeks using Count the tuples Stream 54