Lecture 5 - MapReduce
Lecture 5 - MapReduce
Lecture 5 - MapReduce
MapReduce Patterns
1
Lecture Outlines
• MapReduce Patterns
Review
• Analytics patterns.
• Big Data Patterns
Keywords
2
Review
Analytics Patterns
• Analytics Patterns
• Alpha
• Beta
• Gamma
• Delta
• Analytics Architectural Components & Styles
• Load Leveling with Queues
• Load Balancing with Multiple Consumers
• Leader Election
• Sharding
• Consistency, Availability & Partition Tolerance (CAP)
• Bloom Filter
• Materialized Views
• Lambda Architecture
• Scheduler-Agent-Supervisor
• Pipes & Filters
• Web Service
• Consensus in Distributed Systems
3
MapReduce Patterns
4
MapReduce Patterns
5
MapReduce Patterns
6
MapReduce Patterns
7
MapReduce Patterns
8
MapReduce Patterns
10
MapReduce Patterns
Numerical
Summarization
• Numerical summarization patterns are used to compute various
statistics such as counts, maximum, minimum, mean, etc.
• These statistics help in presenting the data in a summarized form.
• For example, computing the total number of likes for a particular post,
computing the average monthly rainfall or finding the average number of visitors
per month on a website.
• For the examples in this section, we will use mock data similar to the
data collected by a web analytics service that shows various statistics
for page visits for a website.
• Each page has some tracking code which sends the visitor’s IP address along
with a timestamp to the web analytics service.
• Each visit to a page is logged as one row in the log.
• The log file contains the following columns: Date (YYYY-MM-DD), Time
(HH:MM:SS), URL, IP, Visit-Length. 11
MapReduce Patterns
Numerical
Summarization
Count
• To compute count, the mapper function produces key-value pairs
where the key is the field to group-by and value is either ‘1’ or any
related items required to compute count.
• The reducer function receives the key-value pairs grouped by the same
key and adds up the values for each group to compute count.
• Let us look at an example of computing the total number of times each
page is visited in the year 2014, from the web analytics service logs.
12
MapReduce Patterns
Numerical
Summarization
13
MapReduce Patterns
Numerical
Summarization
Count
• The Figure shows the data and the key-value pairs at each step of the
MapReduce job for computing count.
• The mapper function in this example break down each line of the
input and produces key-value pairs where the key is the URL and value
is ‘1’.
• The reducer receives the list of values grouped by the key and sums up
the values to compute count.
14
MapReduce Patterns
Numerical
Summarization
Max/Min
• To compute maximum or minimum, the mapper function produces
key-value pairs where the key is the field to group-by and value
contains related items required to compute maximum or minimum.
• The reducer function receives the list of values grouped by the same
key and finds the minimum or maximum value.
• Let us look at an example of computing the most visited page in each
month of the year 2014, from the web analytics service logs.
15
MapReduce Patterns
Numerical
Summarization
16
MapReduce Patterns
Numerical
Summarization
Max/Min
• Figure shows the data and the key-value pairs at each step of the MapReduce job for finding most
visited page.
• The MapReduce execution framework may consider it as multiple MapReduce jobs chained
together.
• The mapper function in this example break down each line of the input and produces key-value pairs
where the key is a tuple comprising of month and URL, and the value is ‘1’.
• The reducer receives the list of values grouped by the key and sums up the values to count the visits
for each page in each month.
• The reducer produces month as the key and a tuple comprising of page visit count and page URL as
the value.
• The second reducer receives a list of (visit count, URL) pairs grouped by the same month, and
computes the maximum visit count from the list of values.
• The second reducer produces month as the key and a tuple comprising of maximum page visit count
and page URL and the value.
• In this example, a two-step job was required because we need to compute the page visit counts first
before finding the maximum count.
17
MapReduce Patterns
Numerical
Summarization
Average
• To compute the average, the mapper function produces key-value
pairs where the key is the field to group-by and value contains related
items required to compute the average.
• The reducer function receives the list of values grouped by the same
key and finds the average value.
• Let us look at an example of computing the average visit length for
each page.
18
MapReduce Patterns
Numerical
Summarization
19
MapReduce Patterns
Numerical
Summarization
Average
• The Figure shows the data and the key-value pairs at each step of the
MapReduce job.
• The mapper function in this example break down each line of the input
and produces key-value pairs where the key is the URL and value is the
visit length.
• The reducer receives the list of values grouped by the key (which is the
URL) and finds the average of these values.
20
MapReduce Patterns
Top-N
• To find the top-N records, the mapper function produces key-value
pairs where the key is the field to group by and value contains related
items required to compute top-N.
• The reducer function receives the list of values grouped by the same
key, sorts the values and produces the top-N values for each key.
• In an alternative approach, each mapper emits its local top-N records
and the reducer then finds the global top-N.
• Let us look at an example of computing the top 3 visited page in each
month of the year 2014.
21
MapReduce Patterns
Top-N
22
MapReduce Patterns
Top-N
• Figure shows the data and the key-value pairs at each step of the MapReduce
job.
• The mapper function in this example break down each line of the input and
produces key-value pairs where the key is the URL and value is ‘1’.
• The reducer receives the list of values grouped by the key and sums up the
values to count the visits for each page.
• The reducer produces None as the key and a tuple comprising of page visit
count and page URL and the value.
• The second reducer receives a list of (visit count, URL) pairs all grouped
together (as the key is None).
• The reducer sorts the visit counts and produces top 3 visit counts along with
the page URLs.
• In this example, a two-step job was required because we need to compute the
page visit counts first before finding the top 3 visited pages.
23
MapReduce Patterns
Filter
• The filtering pattern is used to filter out a subset of the records based
on a filtering criteria.
• The records themselves are not changed or processed.
• Filtering is useful when you want to get a subset of the data for further
processing.
• Filtering requires only a Map task. Each mapper filters out its local
records based on the filtering criteria in the map function.
• Let us look at an example of filtering all page visits for the page
‘contact.html’ in the month of Dec 2014, in the web analytics service
log.
24
MapReduce Patterns
Filter
25
MapReduce Patterns
Filter
• The figure shows the data and the key-value pairs at each step of the
MapReduce job.
• The mapper function in this example break down each line of the
input, extracts the month, year and page URL and produces key-value
pairs if the month and year are Dec 2014 and the page URL is
’http://example.com/contact.html’.
• The key is the URL, and the value is a tuple containing the rest of the
parsed fields
26
MapReduce Patterns
Distinct
• Distinct pattern is used to filter out duplicate records or produce
distinct values of sub-fields in the dataset.
• Finding distinct records is simple with MapReduce as the records with
the same key are grouped together in the reduce phase.
• The mapper function produces key-value pairs where key is the field
for which we want to find distinct (maybe a sub-field in a record or the
entire record) and value is None.
• The reducer function receives key-value pairs grouped by the same key
and produces the key and value as None.
27
MapReduce Patterns
Distinct
28
MapReduce Patterns
Distinct
• Figure shows the data and the key-value pairs at each step of the
MapReduce job.
• The mapper function in this example break down each line of the input
and produces key-value pairs where the key is the IP address and value
is None.
• The reducer receives the list of values (all None) grouped by the key
(unique IP addresses) and produces the key and value as None.
29
MapReduce Patterns
Binning
• The Binning pattern is used to partition records into bins or categories.
• Binning is useful when you want to partition your dataset into
different bins (based on a partitioning criteria) and further process
records in each bin separately or process records in certain bins.
• requires a Map task only.
• In the mapper function, each record is checked using a list of criteria
and assigned to a particular bin.
• The mapper function produces key-value pairs where the key is the bin
and value is the record. No processing is done for the record in this
pattern.
• Let us look at an example of partitioning records by the quarter (Q1-
Q4) in the web analytics service log.
30
MapReduce Patterns
Binning
31
MapReduce Patterns
Inverted Index
• An inverted index is an index data structure which stores the mapping from
the content (such as words in a document or on a webpage) to the location of
the content (such as document filename or a page URL).
• Search engines use inverted indexes to enable faster searching of documents
or pages containing some specific content.
• To generate an inverted index, the mapper function produces key-value pairs
where key contains the fields for the index (such as each word in the
document), and the value is a unique identifier of the document.
• The reducer function receives the list of values (such as document IDs)
grouped by the same key (word) and produces a key and the list of values.
• Let us look at an example of an inverted index for multiple books.
32
MapReduce Patterns
Inverted Index
33
MapReduce Patterns
Inverted Index
• Let us assume we have all the content from all the books combined
into one large file with two fields separated by a pipe symbol (’|’).
• The first field is the filename and the second field contains all the
content in the file.
• Figure shows the data and the key-value pairs at each step of the
MapReduce job.
• The mapper function in this example break down each line of the input
and produces key-value pairs where the key is each word in the line
and value is the filename.
• The reducer receives the list of values (filenames) grouped by the key
(word) and produces the word and the list of filenames in which the
word occurs.
34
MapReduce Patterns
Sorting
• The sorting pattern is used to sort the records based on a particular field.
• Let us look at an example of sorting records in the web analytics service log by
the visit length.
• Figure shows the data and the key-value pairs at each step of the MapReduce
job.
• The mapper function in this example break down each line of the input and
produces key-value pairs where the key is None and value is a tuple
comprising of visit length and the rest of the fields in the record (as a nested
tuple).
• The reducer receives the list of values all grouped together (as the key is
None).
• The reducer uses the sorted function, which sorts the list of tuples by the first
elements in the tuples (which is the visit length).
35
MapReduce Patterns
Sorting
36
MapReduce Patterns
Joins
• When datasets contain multiple files, joins are used to combine the records in the
files for further processing.
• Joins combine two or more datasets or records in multiple files, based on a field
(called the join attribute or foreign key).
• Let take the example of joining two tables A and B.
• Figure 3.20 shows various types of joins.
• An Inner Join returns rows from both the tables which have the same value of the
matching columns or the foreign key.
• The output contains columns of both tables with the matching foreign keys.
• Any unmatched records from both tables are not included in the output.
• A Full Outer Join is another type of join which includes all the matched and
unmatched records from both the tables.
• Full Outer Join returns all the rows from both tables and returns NULL values in
columns of each table where no row matches. 37
MapReduce Patterns
Joins
38
MapReduce Patterns
Joins
• In Left Outer Join, the unmatched columns in the table of the left side
of the join are included and along with all the matched records from
both tables.
• Left Outer Join returns all rows from the table of the left side of the join and
returns NULL in columns of the table on the right where no row matches.
• In Right Outer Join, the unmatched columns in the table of the right
side of the join are included and along with all the matched records
from both tables.
• Right Outer Join returns all rows from the table of the right side of the join and
returns NULL in columns of the table on the right where no row matches.
39
MapReduce Patterns
Joins
• Let us look at examples of joins using MapReduce. For these examples,
we will use two datasets containing records of employees and
departments in an organization.
• The employees dataset contains fields such as Employee ID, Employee
Name, Department ID, Joining Data and Salary.
• The departments dataset contains fields such as Department ID,
Department Name and Number of Employees.
• The first field in the employees dataset contains the word ’Employee’
followed by the rest of the fields containing employee details.
• Similarly, the first field in the departments dataset contains the word
’Department’, followed by the rest of the fields containing department
details.
40
MapReduce Patterns
Joins
41
MapReduce Patterns
Joins
• Let us look at examples of joining the employee and department datasets by
the Department ID field.
• Figure shows the data and the key-value pairs at each step of the MapReduce
job.
• The mapper function in this example break down each line of the input and
produces key-value pairs where the key is the Department ID and value is the
complete record.
• The reducer receives the list of values all grouped by the Department ID.
• In the reducer, we check the first field of each value and if the field is
’Employee’, we add the value to an employees list and if the first field is
’Department’, we add the value to the departments list.
• Next, we repeat over both the lists and perform the join. 42
Next lecture
• NoSQL Databases
Assignment
• download any dataset and use one of MapReduce patterns to reduce it.
Deadline
Previous Deadline
( assignment 2)
43