Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

BDA Techmax (Searchable)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 150

SYLLABUS

SE

Course Code Course/Subject Name Credits

CSDLO7032 Big Data Analytics 4

Course Objectives (CO):

1. To provide an overviewof an exciting growingfield of big data analytics.

2. To introduce programmingskills to build simple solutions using big data technologies such as MapReduce and
scripting for NoSQL,andthe ability to write parallel algorithms for multiprocessor execution.
3. To teach the fundamentaltechniques andprinciples in achieving big data analytics with scalability and streaming
capability.
4. To enable students to haveskills that will help them to solve complex real-world problemsin for decision support.
5. To provide an indication of the current research approachesthatis likely to provide a basis for tomorrow's
solutions.

Course Outcomes: Students should be ableto -

1. Understandthe key issuesin big data managementandits associated applications for business decisions and strategy,|
1. Develop problem solving andcriticalthinking skills in fundamental enabling techniqueslike Hadoop, Mapreduce and|
NoSQL in big data analytics.

2. Collect, manage, store, query and analyze various formsof Big Data.
8. Interpret business models and scientific computing paradigms, and apply softwaretoolsfor big data analytics.
4, Adapt adequate perspectivesof big data analytics in various applications like recommender systems, social medial
applicationsetc.
5. Solve Complexreal world problemsin various applications like recommender systems, social media applications,
health and medical systems,etc.

Pre-requisites : Some prior knowledge about Java programming, Basics of SQL, Data mining ‘and machine learning}
methods would bebeneficial.

Module Detailed Contents Hrs.

Introduction to Big Data and Hadoop

1.1 Introduction to Big Data,

1.2 Big Data characteristics, Types of Big Data,

01 1,3. Traditional vs. Big Data business approach, 06


1.4 CaseStudy of Big Data Solutions,

1.5 Concept of Hadoop

1.6 Core Hadoop Components; Hadoop Ecosystem (Refer Chapters 1 and 2)

Scanned by CamScanner
Hadoop HDFS and MapReduce
Large-Scale
2.1. Distributed File Systems : Physical Organization of Compute Nodes
File-System Organization.
Tasks, Combiners,
2.2 MapReduce: The Map Tasks, Grouping by Key, The Reduce
- 10
02 Details of MapReduce Execution, Coping With NodeFailures.
er by edi e,
Multiplication
2.3 Algorithms Using MapReduce : Matrix-Vector
Relational-Algebra Operations, Computing Selections mt siterenne by
and UI
Computing Projections by MapReduce, Union, Intersection,
MapReduce
3)
2.4 Hadoop Limitations (Refer Chapter 3)

NoSQL
3.1. Introduction to NoSQL, NoSQL BusinessDrivers,
Column
3.2 NoSQL Data Architecture Patterns: Key-value stores, Graph stores, ctural 7
ons -of NoSQL archite
family (Bigtable)stores, Document stores, Variati
03 patterns, NoSQL Case Study
data problems;
3.3 NoSQL solution for big data, Understanding the types of big
Choosing distribution
Analyzing big data with a shared-nothing architecture;
handle big data
models: master-slave versus peer-to-peer; NoSQL systems to
problems. (Refer Chapter 4)

Mining Data Streams


of
4.1 The Stream Data Model : A Data-Stream-Management System, Examples
Stream Sources, Stream Queries, Issues in Stream Processing.
4.2 Sampling Data techniques in a Stream

4.3 Filtering Streams: Bloom Filter with Analysis.


04 4.4 Counting Distinct Elements in a Stream, Count-Distinct Problem, Flajolet-Martin 42
Algorithm, Combining Estimates, Space Requirements
4.5 Counting Frequent Items in a Stream, Sampling Methods for Streams, Frequent
Itemsets in Decaying Windows.
4.6 Counting Onesin a Window:The Cost of Exact Counts, The Datar-Gionis-Indyk-
Motwani Algorithm, Query Answering in the DGIM Algorithm, Decaying Windows.
(Refer Chapter 5)

FindingSimilar Items and Clustering =

5.1 Distance Measures : Definition of a Distance Measure, Euclidean Distances,


05 Jaccard Distance, Cosine Distance, Edit Distance, Hamming Distance. 08
5.2 CURE Algorithm, Stream-Computing , A Stream-Clustering Algorithm, Initializing
& Merging Buckets, Answering Queries (Refer Chapters 6 and 7)
Real-Time Big Data Models

6.1 PageRank Overview, Efficient computation of PageRank : PageRank Iteration


Using MapReduce,Use of Combiners to Consolidate the Result Vector.
06 6.2 A Model for Recommendation Systems, Content-Based Recommendations, 10
CollaborativeFiltering.
6.3 Social Networks as Graphs, Clustering of Social-Network Graphs, Direct
Discovery of Communities in a social graph.
(Refer Chapters 8, 9 and 10)
aoc
Scanned by CamScanner
Big Data Analytics(MU) Oe Table of Contents
2.8.1 Architecture of HIVE.

2.8.2 Working of HIV!

Syllabus : 2.8.3 HIVE Data Models.

Introduction to Big Data, Big Data characteristics, Types of Big


Data,Traditional vs. Big Data business approach, Case Study of
Big Data Solutions, Concept of Hadoop, Core Hadoop
Components, Hadoop Ecosystem Syllabus :

Chapter 1: Introduction to Big Data 1-1 to 1-11 Distributed File Systems : Physical Organization of Compute
Nodes, Large-Scale File-System Organization, MapReduce : The
14 Introduction to Big Data Management... Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
12 Big Data Details of MapReduce Execution, Coping With Node Failures,
Algorithms Using MapReduce : Matrix-Vector Multiplication by
1.3 Big Data Characteristics - Four Important
V of Big Data.... MapReduce, —Relational-Algebra_ Operations, Computing
Selections by MapReduce, Computing Projections by MapReduce,
1.4 Typesof Big Data.
Union, Intersection, and Difference by MapReduce, Hadoop
1.5 Big Datavs. Traditional Data Limitations
Business Approach

1.6 “Tools usedfor Big Data. Chapter 3: Hadoop HDFS and Map Reduce3-1 to 3-13

47 Data Infrastructure Requirements..... 34 Distributed File Systems.


1.8 Case Studies of Big Data Solutions.....
3.1.1 Physical Organization of Compute Node:
Chapter 2: Introduction to Hadoop 2-1 to 2-16
3.1.2 Large-Scale File-System Organization ....
24 Hadooy
3.2 MapReduce.....
2.4.1 Hadoop- Features... 2-2

2.1.2 Hadoop and Traditional RDBMG.... 3.2.1 The Map Tasks...

22 HadoopSystem Principles...... 3.2.2 Grouping by Key.


23 HadoopPhysical Architecture.
32.3 The Reduce Tasks.
24 Hadoop Core Components.

2.4.1 HDFS(HadoopDistributed File System) 3.2.4 Combiners...

24.2 MapReduce 3.2.5 Details of MapReduce Execution..

24.3 Hadoop- Limitation


3.2.6 Coping with NodeFailures.
25 Hadoop- Ecosystem...
33 Algorithms using MapReduce
26 ZooKeeper.....

27 | HBase.. 3.3.1 Matrix-Vector Multiplication by MapReduce

27.4 Comparison of HDFS and HBase... 3.3.2 Relational-Algebra Operations...


27.2 Comparison of RDBMSand HBase...
3.3.3 Computing Selections by MapReduce.....
273 HBaseArchitecture..

27.4 Region Splitting Methods. 3.3.4 Computing Projections by MapReduce....


2-12
2.7.5 Region Assignment and Load Balancing. 3.3.5 Union,Intersection and
2.7.6 HBaseData Model... Difference by MapReduce....

28 HIVE, 3.4 HadoopLimitations.

Scanned by CamScanner
Syllabus :
|
The Stream Data Model: A Data-Stream-
Management System,
Examples of Stream Sources, Stream Queri
es, Issues in Stream
Processing, Sampling Data techniques
in a Stream, Filtering,
Streams: Bloom Filter with Analysis, Coun
ting Distinct Elements in |
a Stream, Count-Distinct Problem, Flajolet-M
artin Algorithm,|
Combining Estimates, Space Requirements, Counting Frequent|
Items in a’ Stream, Sampling Methods for
Streams, Frequent |
Itemsets in Decaying Windows,Counting Ones
in a Window: The
Cost of Exact Counts, The Datar-Gionis-Indyk-Motwani Algorithm,
4-1 to 4-27
Query Answering in the DGIM Algorithm, Deca
NoSQL (Whatis NoSQL? ying Windows,
).....
4-1
Chapter 5: Mining Data Streams
NoSQL Basic Concepts.
wee2
4.3 5.1 The Stream Data Model.
Case Study NoSQL (SQL vs
NoSaL). wieS,
5.1.1 A Data-Stream-Management System
44 BusinessDrivers of NoSQL.
.
5.1.2 Examplesof Stream Sources....
45 NoSQLDatabase Types...
5.1.3 Stream Queries...
46 Benefits of NoSQL...
5.1.4 Issues in Stream Processing...
47 Introduction to Big Data Management...
5.2 Sampling Data Techniques in a Stream.
4.8 Big Data... 5.3 Filtering Streams...
y
4.8.1 Tools Usedfor Big Data..... 5.3.1 Bloom Filter with Analysis..

4.8.2 Understanding Types of Big Data Problems 5.4 Counting Distinct Elements in a Stream
....
5.4.1 Count - Distinct Probie:
49 Four Ways of NoSQLto Operate
Big Data Problems... 5.4.2 The Flajolet- Martin Algorithm

4.10 Analyzing Big Data with a 5.4.3 Combining Estimates...


Shared-Nothing Architecture.....
5.4.4 Space Requirements...
4.10.1 Shared Memory System.....
5.5 Counting FrequentItems in a Stre
am...
4.10.2 SharedDisk System..........
5.5.1 Sampling Methodsfor Streams.....
4.10.3 Shared Nothing Disk System.........ss00
5.5.2 Frequent Itemsets in Decaying Wind
ows.
4.10.4 Hierarchical System ....
5.6 Counting Ones in a Window..
4.11 ChoosingDistribution Models :
Master-Slaveversus Peer-to-Peer... soe18, 5.6.1. The Costof Exact Counts...

4.41.1 Big Data NoSQLSolutions... 4-20 5.6.2 The DGIM Algorithm


(Datar — Gionis — Indyk - Motwani)...
4.11.1(A) Cassendra....sssveses soeesaee 20,
5.6.3 Query Answering In the DGIM Algorithm.....
4.11.1(B) Dynamo DB 4-21
5.6.4 Decaying Windows...

Scanned by CamScanner
WF _Big Data Analytics(MU) Table of Contents

8.1.2 Links in Page Ranking...

8.1.3 Structure of the Web...


Syllabus :
8.1.4 Using Page Rankin a Search Engine...
Distance Measures: Definition of a Distance Measure, Euclidean
Distances, Jaccard Distance, Cosine Distance 8.2 Efficient Computation of Page Rank.
, Edit Distance,
Hamming Distance., CURE Algorithm, Stream-Computing, A 8.2.1 Representation of Transition Matrix...
Stream-Clustering, Algorithm, Initializing & Merging Buckets,
8.2.2 Iterating Page Rank with MapReduce
Answering Queries
8.2.3 Use of Combiners to Aggregate the Result Vector...
Chapter 6: Finding Similar Items 6-1 to 6-8
8.3 Link Spam
6.1 Distance Measures.
8.3.1 Spam Farm Architecture....
6.1.1 Definition of a Distance Measure...
8.3.2 Spam Farm Analysis...
6.1.2 Euclidean Distances...
8.3.3. Dealing with Link Spam.
-6.1.3 Jaccard Distance...
8.4 Hubsand Authorities
6.1.4 Cosine Distance
8.4.1 Formalizing Hubs and Authority...
6.1.5 Edit Distance...
Chapter9: Recommendation Systems
6.1.6 HammingDistanc
O41 Recommendation System...
Chapter 7: Clustering
9.1.1. The Utility Matrix...
7A Introduction... 9.1.2 Applications of Recommendation Systems.
7.2 CUREAlgorithm...
9.1.3 Taxonomyfor Application Recommendation System
7.2.1 Overview of CURE(Cluster Using REpresentative).......7-2
9.2 Content Based Recommendation.
7.2.2 Hierarchical Clustering Algorithm...
9.2.1 Item Profile...
7.2.2(A) Random Sampling and Partitioning Sample
9.2.2 Discovering Features of Documents.
7.2.2(B) Eliminate Outlier's and Data Labelling...
9.2.3 Obtaining Item Features from Tags.....
7.3 Stream Computing...
9.2.4 Representing Item Profile.
7.3.1 A Stream - Clustering Algorithm ...
9.2.5 UserProfiles...
74 Initializing and Merging Buckets..
9.2.6 Recommending Items to Users based on Content.
75 Answering Queries
9.2.7 Classification Algorithm...

9.3 Collaborative Filtering..

Syllabus: 9.3.1 Measuring Similarity...


ank :
PageRank Overview, Efficient computation of PageR 9.3.2 Jaccard Distance...
PageRankIteration Using MapReduce, Use of Combiners to
9.3.3 Cosine Distance
Consolidate the Result Vector, A Model for Recommendation
Systems, Content-Based Recommendations, Collaborative 9.3.4 Roundingthe Data...
Filtering, Social Networks as Graphs,Clustering of Social-Network
9.3.5 Normalizing Rating..
Graphs, Direct Discovery of Communities in a social graph
9.4. Pros and Cons in Recommendation System
Chapter 8: Link Analysis 8-1 to 8-22
9.4.1 CollaborativeFiltering
8.1 Page Rank Definition...
9.4.2 Content-basedFiltering.....
8.1.1 Importance of Page Ranks
‘Tech!
puplicatians

Scanned by CamScanner
Big Data Analytics(MU)
aaE Table of Contents
-,-7-,-/;|:2
Chapter 10 : Mining Social Network Graph 10-1 to 10-14 10.3.3 Betweenness... 10-7

10.1 Introduction... 10.3.4 The Girvan - Newman Algorithm.


210-1
10.2 Social Network as Graphs 10.3.5 Using Betweenness to Find Communities..
+ 10-2
10.2.1 Parameters Usedin Graph (Social Network) ...:.s0:0.10-3 10.4 Direct Discovery of Communities...

10.2.2 Varieties of Social Network 10.4.1 Bipirate Graph...


2 10-4
10.2.2(A) Collaborative Network... 10.4.2 Complete Bipirate Graph......
10-4

10.2.2(B) Email Network...... 10.5 Simrank.....


10-4
10.2.2(C) Telephone Network...... 10.5.1 Random Walker on Social Network...
10-4
10.3 . Clustering of Social Network Graphs..... 10.5.2 Random Walkswith Restart...
10-4
10.6 Counting Triangles using MapReduce......
10.3.1 Distance Measure for Social-Network Graphs ............10-5

10.3.2 Applying Standard Cluster Method ..... 10-5

Qo00

Scanned by CamScanner
Introduction to Big Data

Syllabus
Introduction to Big Data, Big Data characteristics, Typesof Big Data, Traditional vs. Big Data business approach, Case
Studyof Big Data Solutions ,

1.1 | Introduction to Big Data Management


of devices. Sending
— Weall are surrounded by huge data. People upload/download videos, audios, images from variety
text messages, multimedia messages, updating their Facebook, Whatsapp,Twitter status, comments, online shopping,
online advertising etc. generates huge data.

- Asaresult of multiple processing machines have to generate and keep hugedata too. Due to this exponential growth
of data, Data analysis becomes very much required task for day to day operations.

- The term ‘Big Data’ means huge volume,high velocity and a variety of data.

— This big datais increasing tremendously day by day.

— Traditional data managementsystems and existing tools are facing difficulties to process sucha Big Data.

Ris one of the main computing tools usedin statistical education and research. It is also widely used for data analysis
and numerical computing in otherfields ofscientific research.

°
Big data Analysis

Fig. 1.1.1
1.2 Big Data
TEE EE

- Weall are surrounded by huge data. People upload/downloadvideos, audios, images from variety of devices.
— Sending text messages, multimedia messages, updating their Facebook, Whatsapp, Twitter status, comments, online
shopping, online advertising etc.
— Big data generates huge amountdata.
— Asaresult machines have to generate and keep hugedata too. Due to this exponential growth of data the analysis of
that data becomeschallenging and difficult.

Scanned by CamScanner
* 1-2 Introduction to Big Data. ©
8 _Big Data Analytics(MU)
increasing tremendously
— The term ‘Big Data’ means huge volume,high velocity anda variety of data. This big data is
difficulties to process such a Big Data.
day byday.Traditional data managementsystemsandexisting tools are facing
critical to store and manageit. Big is a
- Big data is the most important technologies in modern world. It is really
collection of large datasets that cannotbe processed using traditional computing techniques.
of data. The data in it may be structured data,
— Big Dataincludes huge volume,high velocity and extensible variety
various tools, techniques and frameworks.
SemiStructured data orunstructureddata. Big data also involves

1.3 Big Data Characteristics - Four Important V of Big Data


Dec. 16,.Dec. 17

Big data characteristics are as follows:


Different
Scaleof data formsof data

Analysis of ee Uncertainty
é of data

Fig. 1.3.1 : Big data characteristics

1, Volume
applications.
— Huge amountofdatais generated during big data
very big in size.
The amountof data generated as well as storage volume is
2.5B 25 84% 4/5ths
gigabytes of new data _petabytes of data is of smartphoneusers of U.S, adult smartphone
generated every day. collected every hour check an app as soon users keep their
by a majorretailer. as they wakeup. phoneswith them
22 hours per day.
4/sths 2x
4,000,000,000,000 ‘ 5 minutes
: of the world's datals as manypeople in Ti
Connected objects unstructured, audio, 2013 were 19 response time users
and devices on willing to share thelr expect from
the planet generating —_—viedo,AFID data.
blogs.tweets.all geolocation data In retum| 2 ComPany once
data by 2015. for personalized offers they have contacted
ae _—reprosont new themvia social media.
3x areas to mine compared to tha
\ forInsights. previous year.
increase In data
transmitting transistors 84% of millennials say social 57% of companies
2017. and user-generated In 2014 expect
per human by 500M
8 worth of data content has devote
oead dally, an Influence more than 25%
on whattheybuy. of thelr IT
spending to
70% of boomors agreo, aystemsof
engagement. :
(almost
double the 4
Investment 3
one year ago.) }

Fig. 1.3.2

Scanned by CamScanner
& Big Data Analytios(MU) 43 Introduction to Big Data

2. Velocity

= For timecritical applications the faster processing is very important. E.g. share marketing, video streaming
data.
— The huge amountof datais generated and stored requires higher processing speedof processing
less timein future.
The amountofdigital data will be doubled in every 18 monthsand it repeats maybein

3. Variety

The type and natureof datais having greatvariety.

“Structured and
unstructured

Fig. 1.3.3

Veracity
at.
The data captured isis not in certain form
tly.
— Data captured can vary grea
.
on the veracity of the source data
= Soaccuracy of analysis depends

5. Additional characteristics

(a) Programmable
ic.
loreall types by programming log
— Itis possible with big data to exp
data.
ng can be use d to perf or! m anykindof exploration because ofthe scale of the
Progra mmi

{b) Data driven


ible for scientists.
— The data driven approach is poss

As data collected is huge amounts.

(c) Multi Attributes


ibutes.
s of data that consist of thousandsof attr
— Itis possible to deal with many gigabyte
ening ona larger scale.
— Asall data operations are now happ
Knowledge
caHeatlons

Scanned by CamScanner
YF _Big Data Analytics(MU) 1-4 Introduction to Big Data
(d) Iterative

The more computing powercan iterate on your models until you get them as per your own requirements.

1.4 Types of Big Data

1. Introduction

Hadoop,Hive,Pig, Cascading, Cascalog, mrjob, Caffeine, $4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
DB Group :
Structured data Unstructured data Semi- structured
data

Fig. 1.4.1 : Types of big data

2. Structured data
— Structureddatais generally data that has a definate length and format for big data.

— Like RDBMStables have fixed number of colums and data can be increased by adding rows.

Example :

Structured data include marks data as numbers, dates or data like words and numbers. Structured data is very
simple to dealing with, and easy to store in a database.

Sources of structured data

The data can be generated by human or it will be generated by machine.

(i) Human generateddata

1. Sensordata : Radio frequency ID tags, medical devices, and Global Positioning System data.

2. Weblog data; All kinds of data abouttheiractivity.

3. Point-of-sale data : Data associated with sales.


4, Financial data : Stock-trading data or banking transaction data.

(il) Machine generated data

1. Input data : Survey data, responses sheets and so on.

2. Click-stream data : Data Is generated every time youclick a Link on a website,

3. Gaming-related data : Every move you makeIn a gamecan berecorded.

Scanned by CamScanner
Big Data Analytics(MU) 15 Introduction to Big Data

{ili) Tools generates structured data

(i) Data Marts (ii) RDBMS

(ii) Greenplum (iv) TeraData

Un-structured data

— Unstructured data is generallydata collected in any available form withoutrestricting them for any formats.

— Like audio, video data, Web blog data etc.

Example :
Unstructured data include video recording of CCTV surveillance.

Sourcesof unstructured data

The data can be generated by humanorit will be generated by machine.

(i) Human generated data

1. Satellite images: This includes weather data or the data from satellite
etc.
2. Scientific data : This includes seismic imagery, weather forecasting data

a Photographs and video : This includes security, videoetc.


data
4. Radar or sonar data Thisincludes vehicular, oceanographic

(ii) Machine generated data


e-mails in company.
1. Text data : The documents, logs, survey results, and
platforms such as YouTube, Facebook etc.
zB Social media : Generated from the social media
messages and location information etc.
3. Mobile data : This includes data such as text
Flickr, or Instagram.
4. Website content Site data like YouTube,

(iii) Tools generates structured data


2. HBase
1. Hadoop
4. Pig
3. Hive
6. MapR
5. Cloudera

Semi-structured data
ctured data.
ctured data, there's also a semi-stru
Along with structured and unstru
ina RDBMS.
— Se mi-structured data is
information that doesn't reside
e in some cases.
tern whichis easier to analyz
- itt may organized in tre ‘ee pat
nts and NoSQLdatabases.
mpl es of sem i-s tru ctu red data might include XML docume
Exa

5. Hybrid data
ve advantages.
e use of bot h typ es of data to achieve competiti
mak
There are systems which will
e lot of data abouttopic.
is off eri ng sim pli cit y whe reas unstructuret d data will giv
— Structure’ d data
TechKaswledgt
n

Scanned by CamScanner
°. Introduction to Big Data

Big data - what's the difference av

Un-structured

Social medi 9 Emerging market data


~ 9 Loyalty
Chatter, Text,Anal
o E-commerce
Comments, Lil ° Other third party data
Followers, Social _o Weather
‘Authority, Clict © Currency conversion
Tags, etc. © Demographic
“Diattal, Video, @ © Panel
0 Audio © POS, POL,IR, EDI, RFID,
9. Geo: Spatial NEC, QA, IRI, Rsi, Nielsen,
other syndicated, IMS, -
MSA, ete

Fig. 1.4.2

1.5 Big Data vs.Traditional Data Business Approach

As a result, big data analytics is becoming


The modern world is generating massive volumes of data at very fast rates.
competitive advantage.
a powerful toolfor businesses looking to mining valuable data for

Classic Bl
ructured and repeatable anal

"Capture only what's


needed"

Big data analytics

“Capture only what's


needed”
Fig. 1.5.1

Techricarient
pup

Scanned by CamScanner
Big Data Analytics(MU)
Introduction to Big Data
TraditionalbusinessIntelligence

— There are many systemsdistributed throughout the organization.

- The traditional data warehouse and business intelligent approach required extensive data analysis
work with
eachof the systems and extensive transfer of data.

— Traditional Business Intelligence (BI) systems offer variouslevels and typesof analyses on structured data but
they are not designed to handle unstructured data.

~ For these systems Big Data may creates a big problems due to data that flows in either structured or
unstructured way.

- This makes them limited whenit comestodelivering Big Data benefits.

— Manyofthe datasourcesare incomplete, do not use the same definitions, and not alwaysavailable.

- Saving all the data from each system to a centralized location makesit unfeasible.

Business users
Determine what
question to ask ee Delivers a platform to
enable creative
discovery

It Business
Structuresthe he data
data tto... Explores what questions
answerthat question“ could be asked
Brand sentiment
product strategy»

nce
Fig. 1.5.2 : Traditional businessintellige

.
2. Big data analysis
be able to process
comp lex data set that trad itio nal data processing applications may no
— Big data meanslarge or
efficientely.
data capture, search, sharing, storage, transfer,
visualization,
olv es dat a anal ysis ,
The Big data analytics inv
querying and information security.
for predictive analytics.
— The term generally only used

Ww TechKnewledgé
puntieattens

Scanned by CamScanner
|
:
WF _Big Data Analytics(MU) 1-8 Introduction to Big Data 1

Traditional

BIG 4
0° Data
VKA

Fig. 1.5.3 ; Big data analysis

The efficiency of big data may lead to more confidentdecision making, and better decisions which canresult in
greater operationalefficiency, cost reduction and reduced risk.

Cloud-based platform can be used for the business world’s big data problems.

There can be somesituations where running workloads ona traditional database maybe the better solution.
SM Accelerating time-to-value
$175

Big data solution (MPP)


$150

$125
=
= $100 _~
Gs
6
gs
5
EZ3 $50
oO
$25 Data warehouse
appliance
(single SKU) See
$- oot
24 27 33 36
Months
$(25)
Fig. 1.5.4

3. Comparisonoftraditional and big data

Data source Mainly internal. Bothinside and outside organization including


traditional.

Data structure _|/Pre-defined structure. Unstructuredin nature,

ww TechKnowles
tications
Pun

Scanned by CamScanner
¥ Big Data Analytics(MU) 1-9 Introduction to Big Data

SEES
Data relationship |By default, stable and interrelationship. Unknownrelationship.
Data location Centralized. Physically highly distributed.
Data analysis After the complete build. Intermediate analysis, as you go.

Data reporting |Mostly canned with limited and pre-defined Reporting in all possible direction across the data
interaction paths. in real time mode.

Cost factor Specialized high end hardware and software.|Inexpensive commodity boxesin cluster mode.

ICAP theorem Consistency - Toppriority. Availability - Top priority.

4. Comparison between RDBMS and Hadoop

| TraditionalRDBMS
1 Datasize Gigabytes (Terabytes) Petabytes (Hexabytes)

2 Access Interactive and Batch Batch — NOTinteractive

3 Updates Read/Write many times Write once, Read many times

4 Structure Static schema Dynamic schema

5 Integrity ° High (ACID) Low

6 Scaling Non linear Linear

7 Query response time Can be near immediate Haslatency (due to batch processing)

1.6 Tools used for Big Data

MapReduce i
Kafka, Azkaban, Oozie, Greenplum
Hadoop, Hive,Pig, Cascading, Cascalog, mrjob, Caffeine, $4, MapR, Acunu,Flume,

Storage

$3, Hadoop Distributed File System

Servers
Heroku
EC2, Google App Engine, Elastic, Beanstalk,

NoSQL
Hbase, Hypertable, Voldemort,Riak, CouchDB
‘ZooKeepe 1 Databases, MongoDB, Cassandra, Redis, BigTable,

Processing
cSearch, Datameer, BigSheets, Tinkerpop
R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, Elasti

Tech!
Puntications

Scanned by CamScanner
|
|
¥ Big Data Analytics(MU) 1-10 Introduction to Big Data |
}
1.7 Data Infrastructure Requirements

Acquire

# Low, Predictable latency


igh transaction volume |
2 Flexible data structures’
High throughput
In-place preparation |
_»All data sources/Structures

Fig. 1.7.1 : Data infrastructure requirement

1. Acquiring data

High volume ofdata and transactions are the basic requirementsofbig data. Infrastructure should support the same.
Flexible data structures should be used for the same. The amount of time required for this should beas less as
possible.

2. Organizing data

As the data may be structured, semi structured or unstructured it should be organized in a fast and efficient way.

3. Analyzing data

Data analysis should be faster and efficient. It should support distributed computing.

1.8 Case Studies of Big Data Solutions

There are manycases, in which big data solutions can be used effectively,

1. Healthcare and public health industry

The computing powerofbig data analytics enables us to predict disease, allow usto find new cures and better |
understand and predict disease patterns.
Like entire DNAstrings can be decodedin minutes
Smart watches can be usedto apply to predict symptoms for various
diseases.
Big data techniquesare already being used to monit
or babies in a specialist premature and sick babyu
nit. By
recording and analyzing every heart beat and breathing
pattern of every baby, the unit was able to develo? |
algorithmsthat can nowpredictinfections 24 hours before
any physical symptoms appear.
Big data analytics allow us to monitor and predict
the developmentsof e pidemics and disease, Integrating dat®|
from medical records with social media analytics enabl
es us to monitor fe w diseases, . 4
Sports

All sports popularly performing all analysisin


big data analytics.
TheCricket and football matches youwill
observe many predictions which found corr
ect ihaximum times
Technet
panticatien

Scanned by CamScanner
Big Data Analytics(MU) 1-11 Introduction to Big Data

Exampleis 1BM SlamTrackertoolfor tennis tournaments

Also video analytics track the performanceof every playerin a.cricket game, and sensor technology in sports
equipmentsuchas basketball allows us to get feedback even using smart phones.
to track
Many elite sports teamsalso track athletes outside of the sporting environment using smart technology
nutrition and sleep, as well as social media conversations.

3. Science and research

Science and researchis currently being transformed by the new possibilities big data.
f data.
Experiments involvetesting lot ofpossibilities of test cases and generate huge amountso
distributed across many data centers
The many advanced lab uses computing powers of thousands of computers
worldwide to analyze the data,

It will help to leveraged performance areasofscience and research.

4. Security enforcement

Big datais applied for improving national security enforcement.


analytics to foil terrorist plots.
The National Security Agency (NSA) in the U.S. uses big data
attacks.
The big data techniquesare used to detect and prevent cyber
nies
and even predict criminalactivity y and credit card compa
Police forces use big data tools to catch criminals
use big data useit to detect fraudulent,transactions.

5. Financial trading
role.
ng (HFT) is a new area where big data can play
Automated Trading and High-Frequency Tradi
trading decisions.
Thebig data algorithmscé an be used to make
asingly take into account
the majo rity of equi ty tradi ng now takes place via data algorithms that incre
Toda y,
to make buy and sell decisions in split seconds.
signals from social media network ks and newswebsites

Review Questions

a.
Qt Write a short note on Big Dat
of Big Data applications.
Q2 Explain vari ious applications
Big Data.
a3 Give all characteristics of
a.
a4 Explain three vs of Big Dat
big data in details.
as Explain various types of
ch ?
aditional business approa
a6 Why tous: e Big Data overtr
data approach.
Q7 Compare Tradition al approach and traditional big
Bi ig Data.
as Explain various needs of
in Big Data.
ag Explain various tools used
:
Q.10 Write a short note on
(a) Typesof Big Data
ss approach.
b) Traditional vs Big Data busine aoa
®

Scanned by CamScanner
Introduction to Hadoop
Module - 4
S ee
Syllabus

Conceptof Hadoop, Core Had


oop Components, Hadoop Eco
system

2.1 Hadoop

Hadoop is an open-source, big data storage and processing softwar


e framework. Hadoopstores and processbig data
in a distributed fashion on large clusters of commodity mardwate, Massive
data storage and faster processingare the
two important aspectsof Hadoop.

Client —

Fig. 2.1.1 : Hadoopcluster

— AsshowninFig. 2.1.1 Hadoopclusteris a set of commodity machines networked together


in one locationi.e. cloud
— These cloud machinesare then used for Data storage and processing. From individualclients users can submittheir
jobsto cluster. These clients may be present at some remotelocations from the Hadoopcluster.

— Hadoop run applications on systems with thousands of nodesinvolving huge storage capabilities. As distributedfile
system is used by Hadoopdatatransfer rates among nodesarevery faster.

— Asthousands of machinesare therein a cluster user can get uninterruptedservice and node failureis nota big issue in
Hadoopevenif a large numberof nodes becomeinoperative.

— Hadoopusesdistributed storage and transfers code to data. This codeis tiny and consumesless memory also,
- This code executes with data thereitself. Thus the time to get data and again restore results is saved as the data iS
locally available. Thus interprocess communicationtimeis saved which makesit faster processing.
— The redundancy ofdatais important feature of Hadoop due to which nodefailuresare easily handled.
In Hadoopuser need notto worry aboutpartitioning the data, data and task assignment to nodes, communication
between nodes. As Hadoophandlesitall, user can concentrate on data and operationsonthat data.

Scanned by CamScanner
ww Big Data Analytics (MU) 2-2 Introduction to Hadoop

2.1.1 Hadoop - Features

1. Lowcost

As Hadoopis an open-source framework,it is free. It uses commodity hardwareto store and process hugedata. Hence
it is not much costly.

2. High computing power


can be
Hadoopuses distributed computing model. Dueto this task can be distributed amongst different nodes and
processed quickly. Cluster have thousandsof nodes which gives high computing capability to Hadoop.

3. Scalability
very little
Nodes can be easily added and removed. Failed nodes can be easily detected. For’all these activities
administration is required.

4, Huge andflexible storage


both structured and
Massive data storage is available due to thousands of nodes in the cluster. It supports
unstructured data. No preprocessing is required on data beforestoringit.

5. Fault tolerance and data protection


copies ofall data are
If any node fails the tasks in hand are automatically-redirected to other nodes. Multiple
on someother nodesalso. i
automatically stored. Dueto this evenif any node fails that data is available

2.1.2 Hadoop andTraditional RDBMS

Hadoop stores both structured and unstructured RDBMS stores datain a structural way.
1
data.
SQL can be implemented on top of Hadoop as the SQL(Structured Query Language) is used.
2.
execution engine.

Scaling out is not that much expensive as machines Scaling up (upgradation) is very expensive.
3.
can, be added or removed with ease and little
administration.
Basic dataunitis relational tables.
4. Basic data unit is key/value pairs.
and codes to With SQL we can state expected result and database
5. With MapReduce wecan use scripts
engine derives it.
tell actual steps in processing the data.
RDBMSis designed foronline transactions.
6. Hadoop is designed for offline processing and
analysis oflarge-scale data.

2.2 Hadoop System Principles


1. Scaling out
re, software resourcesI.e. scale up. In'Hadoopthis can be
In Traditional RDBMS itis quitedifficult to add more hardwa
easily donei.e. scale down.
Tech
Puntications

Scanned by CamScanner
°
¥F _Big Data Analytics (MU) 2-3 Introduction to Hadoop
2. Transfer code to data
In RDBMS, Generally data is moved to code andresults
are stored back. As data is moving there is always a securit
y
threat. In Hadoop small code is moved to data andit
is executed thereitself. Thus data is local. Thus Hadoo
p
correlates preprocessors and storage.
3. Fault tolerance

Hadoopis designed to cope up with nodefailures. As large number


of machines are there, a nodefailure is very
common problem.
4. Abstraction of complexities
Hadoop provides proper interfaces between componentsfor
proper working.
5. Data protection and consistency

Hadoop handles system levelchallenges asit supports data consist


ency.

2.3 HadoopPhysical Architecture

Running Hadoop means runninga setof resident programs. These resident programs
are also known as daemons.
These daemons may be running on the same server or-on the different servers in the network.
- All these daemonshave somespecific functionality assigned to them. Let us see these daemons.

Fig. 2.3.1 : Hadoopcluster topology


NameNode
The NameNodeis knownas the masterof HDFS.
Pp

DataNodeis known asthe slave of HDFS.


YP

The NameNodehas JobTracker which keepstrackoffiles distributed to DataNodes,


ee

NameNodedirects DataNode regarding the low-levelI/O tasks.


Yop

. NameNodeIs the only single point of fallure component.


DataNode
4. DataNodeIs knownasthe slave of HDFS. ]

2. The Da’ taNodetakesclient block addresses from NameNodes. a4

Scanned by CamScanner
WF _Big Data Analytics (MU) ’ 2-4 Introduction to Hadoop

3. Using this address client communicatesdirectly with the DataNode.


4. Forreplication of data a DataNode may communicate with other DataNodes.
5. DataNode continually informs local change updates to NameNodes.
6. To create, move or delete blocks DataNodereceivesinstructions from the localdisk.

Secondary NameNode(SNN)

1, State monitoring of cluster HDFS is done by SNN.


2. Every cluster has one SNN.

3. SNN resides on its own machine also.


4, On the sameserver any other DataNode or TaskTracker daemons cannotrun.

5. The SNN takes snapshots of the HDFS metadataat intervals by communicating constantly with NameNode.

JobTracker
etc.
1. JobTracker determinesfiles to process, node assignmentsfor different tasks, tasks monitoring

2. Only one JobTracker daemon per Hadoop clusteris allowed.

3. JobTracker runs on a serveras a master node ofthecluster.


TaskTracker
by TaskTracker.
1. Individual tasks assigned by JobTracker are executed

2, There is a single TaskTracker per slave node.


lelly by using multiple JVMs.
3. TaskTracker may handle multiple tasks paral
racker fails
acker. Within a specified amount oftime if the TaskT
4. TaskTracker constantly communicates with the JobTr
spondingtasks are
respo nd to JobTr acker then it is assu med that the TaskTrackerhas crashed. Rescheduling of corre
to
ster.
done to other nodesin the clu

TaskTracker

Cy)
|

Interaction
Fig, 2.3.2 : Jobtracker and tasktracker

TechKoowledge
puiteations

Scanned by CamScanner
¥ Big Data Analytics (MU) 2-5 Introduction to Hadoop |

2.4 Hadoop Core Components

HDFSisa file system for Hadoop.


It runs on clusters on commodity hardware. '
™ HDFS- Hadoopdistributed file system (storage)
= Mapreduce (processing)

Map reduce | Jobtracker Task tracker] --------7" Tasktracker

Admin node |
H DFS Cluster Name node| | Data node Data node |.

1 N

Fig. 2.4.1 : Hadoop core components

2.4.1. HDFS (HadoopDistributed File System)

DFSin
HDFSisa file system for Hadoop.

It runs on clusters on commodity hardware.


HDFShas following important characteristics :

© Highly fault-tolerant

o High throughput

©. Supports application with massive data sets -

co Streaming accessto file system data

© Canbebuilt out of commodity hardware.

HDFS Architecture

Fordistributed storage and distributed computation Hadoopuses a master/slave architecture. The distributed storage /
system in Hadoopis called as the HadoopDistributedFile System or HDFS. In HDFSa file is chopped into 64MB chunks °
: .
and then stored, knownas blocks.
As previously discussed HDFS cluster has Master (NameNode) and Slave (DataNode) architecture. Name Node |
¢
manages the namespace of thefilesystem.

Scanned by CamScanner
Big Data Analytics (MU) 2-6 Introduction to Hadoop

— inthis namespacethe information regardingfile system tree, metadata forall the files and directoriesin thattree etc.
is stored. Forthis it creates twofiles the namespaceimageandthe edit log and storesinformation in it on consistent
basis.
= Aclientinteracts with HDFS by communicating with the Name Node and Datanodes. The user does not know about
or
the assignment of Name Node and Data Nodefor functioning. i.e. Which NameNode and DataNodesare assigned
will be assigned.

1. NameNode

— The NameNodeis known as the master of HDFS. 7

— DataNode is knownasthe slave of HDFS.

= The NameNodehasJob Tracker which keepstrackof files distributed to DataNodes.


— NameNodedirects DataNode regarding the low-level I/O tasks.

- NameNodeis the only single point offailure component.

2. DataNode
— DataNode is knownasthe slave of HDFS.

— The DataNode takesclient block addresses from NameNodes.

— Using this address client communicates directly with the DataNode.

- For replication of data a DataNode may communicate with other DataNodes.

— _DataNode continually informs local change updates to NameNodes.

— To create, moveor delete blocks DataNodereceivesinstructions from thelocaldisk.


Metadata(Name,repli
Metadata ops__-- (¢home/too/dat

Block ops
Datanodes

Replication]

Rack1

2.4.2 MapReduce
i
17,8Mark)|
May
— MapReduceis a software framework. In Mapreducean application is broken downinto number of small parts.

TechKnowle
Puntieations,

Scanned by CamScanner
Big Data Analytics (MU)
2-7 Introduction to Hadoop -
— These small parts are alsoas called fragments
orblocks. These blocks then can be run on
any nodein the cluster. 4
— Data Processing is done by MapReduce. MapReducescales and runs an appli
cation to different cluster machines. i
— Required configuration changesfor scaling and Tunni
ng for these applications are done by MapReduceitself
. There are |
twoprimitives used for data Processing by
Mapreduce known as mappers and reducers.
|
— Mapping and reducing are the two important phases for executing an appli
cation program. In the mapping phase (
MapReducetakes the input data, filters that input data and then
transforms each data elementto the mapper.
— In the reducing phase, the reducer Processesall the outputs from the mapper, aggre
gatesall the outputs and then
Providesa final result.
— MapReduce uses lists and key/value pairs for processing of data.
|
MapReducecorefunctions
|
1. Read input

Dividesinput into small parts / blocks. These blocks then get assigned to a Map function
q
2. Function mapping

It converts file data to smaller, intermediate <key, value> pairs.

3. Partition, compare and sort

- Partition function : With the given key and number of reducers it finds the correct reducer. (i

— Compare function : Map intermediate outputs are sorted according to this compare function

4, Function reducing i
Intermediate values are reduced to smaller solutions and given to output. .

5. Write output
2
Gives file output.
Input Map Shuffle and sort} Reduce Output

Fig. 2.4.3 : The general MapReduce dataflow

us see one exa mple,


To understand howIt workslet
Hello Sumit"
File 1: "Hello Sachin

Scanned by CamScanner
° ,
¥¥ Big Data Analytics (MU) 2-8 Introduction to Hadoop

File2: "Goodnight Sachin Goodnight Sumit"

Countoccurrencesof each word acrossdifferentfiles.

Threeoperations will be there asfollows,

() Map
Map1 Map2 .
< Hello, 1> < Goodnight, 1>

< Sachin, 1> < Sachin, 1>

< Hello, 1> < Goodnight, 1>

< Sumit, 1> <Sumit, 1>

(ii) Combine

Combine Map1 Combine Map2


< Sachin, 1> < Sachin, 1>

<Sumit, 1> < Sumit, 1>

< Hello, 2> < Goodnight, 2 >

(iii) Reduce
< Sachin, 2>

< Sumit, 2>

< Goodnight, 2 >

< Hello, 2>

2.4.3 Hadoop - Limitation

tial access.
Hadoop can perform only batch processing and sequen

= Sequential access is time consuming.

= So anewtechniqueis needed to getrid of this problem.

2.5 Hadoop - Ecosystem

jadloop Ecosystem and briefly explain its'components, t


Explain Hadoop Ecosystem with.core components.

Tech

Scanned by CamScanner
Big Data Analytics (MU) 2-9 Introduction to Hadoop

_Introduction

Hadoopcan perform only batch processing and sequential access.

Sequential access is time consuming.

So a new technique is neededto getrid of this problem.


signs of slowing down.
The data in today’s world is growingrapidly in size as well as‘scale up and shows no

Statistics show that every year amountof data generated is more than previous years.
in rows and columns.
The amount of unstructured data is much morethan structured information stored
from websites, social media and email,
Big Data actually comes from complex, unstructured formats, everything
to videos, presentations,etc. ‘
rks like MapReduceand Google File
The pioneersinthis field of data is Google, which designed scalable framewo
System.
it is a framework that allows for the
Apache open source hasstarted with initiative by the name Hadoop,
distributed processing of such large data sets across clusters of machines.
: Zookeepr(coordination) |

Fig. 2.5.1 : Hadoop ecosystem

2. Ecosystem

Apache Hadoop,has2 core projects,

° Hadoop MapReduce
° HadoopDistributed File System
writing applications which can process V4 ty
Hadoop MapReduceis a programming model and software for ta
of computers.
amountsofdatain parallel onlarge clusters
s of data blocks and distributes them on compu
HDFSis the primary storage system, it creates multiple replica
extremely rapid computations.
nodes throughout a cluster to enable reliable,
HBase, Mahout, Sqoop and ZooKeeper.
Other Hadoop-related projects are Chukwa, Hive,

Scanned by CamScanner
*
W Big Data Analytics (MU) 2-10 Introduction to Hadoop

Apache hadoopecosystem

Ambari
Provisioning, managing and monitoring hadoop clusters

5 r
“uy
Fe

=f
<: e
,
l 223
tf

al
e ‘
a
gé 3 z2l| 5a 6
g ||
cL
os 9 = ge 2
BS a8 és a4

i!
aa

a
os == co za 2

@ YARN Map Reduce v2 gs


Distributed processing framework 28
Logcollector

Coordination
Zookeeper

©
Flume

HDFS
Hadoopdistributedfile system

Fig. 2.5.2

2.6 _ZooKeeper
ted applications used by Hadoop.
1. Zookeeper is a distributed, open-source coordination service for distribu
to implement higher level services
2. This syste m is a simple set of primitives that distributed applications can build upon
and naming.
for synchronization, configuration maintenance, and groups

Scer eee:

Fig, 2.6.1
such as race conditions and deadlock.
This Coordination services are prone to errors
use distributed applications.
4. The main goal behi ind ZooKeeperis to
ed hierarchical namespace
allo ws dist ribu ted proc esse s to coordinate with each other using shar
5. ZooKeeperwill :
organized as a standard file system.
files and directories.
called znodes, and these aresimilar to
6. The name space made up of of data registers
low latency.
means it can achieve high throughput and
7. ZooKeeper data is kept in-memory, which Knowledge
Queations

Scanned by CamScanner
2-11 Introduction to-Hadoop
WF _Big Data Analytics (MU)

2.7 _HBase

> HBaseis a distributed column-oriented database.


= HBase is hadoopapplication built on top of HDFS.
= HBaseis suitable for huge datasets where real-time read/write random accessis required.
- HBaseis not a relational database. Hence doesnot support SQL.
- It is an open-sourceproject and is horizontally scalable.

- Cassandra, couchDB, Dynamo and MongoDBare someother databasessimilar to HBase.


- Data can be entered in HDFSeither directly or through HBase.
- Consistent read and writes, Automatic failure support is provided.

- It can be easily integrated with JAVA.


= Datais replicated acrosscluster. Useful when somenodefails.

2.7.1. Comparison of HDFS and HBase

1. HFSis a distributedfile system suitable forstoring large files. HBaseis a database built on top of the HDFS.

2. HDFSdoes not support fast individual record lookups. HBaseprovides fast lookups for larger r tables. S

3. It provides high latency batch processing.


Low latency random access.

2.7.2 Comparison of RDBMS and HBase

1. RDBMSusesschema. Datais stored accordingtoits HBaseis schema-less. Only column fam


ilies are defined. ||
schema.

2 Scalingis difficult. Horizontally scalable.

Scanned by CamScanner
Big Data Analytics (MU) 412 Introduction to Hadoop

3. RDBMSistransactional. * No transactions are there in HBase.

4. It has normalized data. It has de-normalized data.

5. It is good for structured data. _| It is good for semi-structuredas well as structured data.

6. It is row oriented database. It is columnoriented database.

7. It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical Processing (OLAP).

2.7.3 HBaseArchitecture

= The Master performs administration, cluster management, region management, load balancing andfailure handling.
- Region Serverhosts and managesservers, region splitting, read/write request handling,client communication etc.
a Region contains Write Ahead Log {WAL). It may have multiple regions. Region is made up of Memstore and HFiles in
which datais stored. Zookeeperis required to manageall the services.

_ Java Client APIs ‘Extemal APIs (Thrift, Avro, RES st)


m

raf Write-Ahead Log (WAL)

Hadoop distributedfile system (HDFS)

Fig. 2.7.1 : HBase databasearchitecture

2.7.4 Region Splitting Methods


1. Pre splitting

Regions are createdfirst andsplit points are assignedat the timeoftablecreation.Initial set of region split points are
to be usedvery carefully otherwise loaddistribution will be heterogeneous which may hamperclusters performance.

2. Autosplitting

This Is by defaultaction. It splits region when oneofthe stores crosses the max configuredvalue.

Scanned by CamScanner
Introduction to Hadoop \
2-13
¥ ig Data Analytics (MU)
|
3. Manualsplitting :

Split regions whichare not uniformly loaded.

2.7.5 Region Assignment and Load Balancing

This information cannotbe changedfurther as these are the standard procedures.


On startup
Onstartup Assignment Manageris invoked by Master.
a wN He

assignments is taken by the AssignmentManager.


From METAthe information about existing region
If the RegionServerisstill online then the assignment is keptasitis.
ked. The
the LoadBalancerFactory is invo
If the RegionServer is not online the in for region assignment
DefaultLoadBalancer will randomly assign the region to a RegionServer.
by
The RegionServer starts functioning upon region opening
5, METAis updated with this new RegionServer assignment.
the RegionServer.

Whenregionserverfails

1. Regions becomeunavailable when any RegionServer fails.


2. The Master finds which RegionServer is failed.
process is followed for new region
The region assignments done bythat RegionServer then becomes invalid. The same
assignment as that of startup.

Region assignment upon load balancing


cer by moving regions around.
Whenthere are noregionsin transition, the cluster load is balanced by a load balan
period. The default value is 300000
Thus redistributes the regions on the cluster. It is configured via hbase.balancer.
(5 minutes).

2.7.6 HBase Data Model


lies, Columns,
The Data Model in HBase is madeofdifferent logical components such as Tables, Rows, Column Fami
Cells and Versions.
It can handle semi-structured data that may be varied in termsofdata type,size and columns. Thus partitioning and
distributing data acrossthe cluster is easier.
‘a

Screen MovieName Ticket Time Day |

01 Harry Potter 1 200 6.00 Saturday

02 Harry Potter 2 250 3,00 Sunday

Column Families
Fig. 2.7.2 : HBase data model

Scanned by CamScanner
W Big Data Analytics (MU) 2-14 Introduction to Hadoo
1. Tables

Tables are storedasa logical collection of rows in Regions.

2. Rows

Each rowis oneinstance of data. Each table row is identified by a rowkey. These rowkeysare unique and always
treated as a byte[].

3. Column Families

Data in a row are grouped together as Column Families. These are stored in HFiles.

4. Columns

— AColumnFamily is made of one or more columns.


- AColumnis accessed by, column family : columnname.
number of
— There can be multiple Columns within a Column Family and Rows within a table can have varied
Columns.

5. Cell

ACell stores data as a combination of rowkey, Column Family and the Column (ColumnQualifier).

6. Version
is the numberof versions are 3 butit can be
Onthebasis of timestamp different data versions are created. By default
;
configured to someother value as well.

2.8 HIVE
tool.
Hive is a data warehouseinfrastructure
data into tables, rows, columnsandpartitions.
It processes structured data in HDFS.Hive structures

It resides on top of Hadoop.


ysis of big data.
It is used to summarize big Data, anal
ication Processing.
It is suitable for Online Analytical Appl
SQLtype language called HiveQL or HQL.
It supports ad hoc querries.It has its own
uce operationsusing HIVE.
SQL typescripts can be created for MapRed
es,andStrings are supported by HIVE.
Primitive datatypeslike Integers,Floats, Doubl
etc. can be used.
Associative Arrays,Lists, Structs
store andretrieve data.
Serialize API and Deserialized API are usedto
processing.
HIVEIs easy to scale and hasfaster

Scanned by CamScanner
Wig Data Analyt
ics (Mu)
ee 2-15 Introduction to Hadoop
2.8.1 Architecture of HIVE

Fig. 2.8.1: Hive architecture


This information cannot be changedfurther as these are the standard components.
Userinterface
Hive supports Hive Web UI, Hive command line, and Hive HD through which user caneasily process queries.

2. Meta store

Hive stores Meta data, schemaetc.in respective database servers known as metasores.
3. HiveQLprocessengine -
HiveQL is used as querying language to get information from Metastore. It is an alternative to
MapReduce Java -
program. HiveQLquery can bewritten for MapReducejob.
. |
Execution engine
Querry processing andresult generation is the job of Execution engine.It is same as that of MapReduceresults.

5. HDFS or HBASE
Hadoopdistributedfile system or HBASEare the data storage techniquesto storedata intofile system.
2.8.2 Working of HIVE

JJobTracker| -

TaskTracker

Fig. 2.8.2 : Hive and Hadoop communication

Scanned by CamScanner
Big Data Analytics (MU) 2-16 Introduction to Hadoop
1. Execute Query : Command Line or Web UI sends query to JDBC or ODBCDriver to execute.
2. GetPlan : With the help of query compiler driver checks the syntax and requirementof query.

3. Get Metadata : The compiler sends metadata request to Metastorefor getting data.
4. Send Metadata: Metastore sends the required metadataasa responseto the compiler.

5. Send Plan: The compiler checks the requirement and resends the plan to the driver. Thus the parsing and
compiling of a query is complete.

6. Execute Plan : The driver sends the executeplan to the execution engine.

7. Execute Job: The execution engine sendsthe job to JobTracker. JobTrackerassignsit to TaskTracker.
7.1 Metadata Operations : The execution engine can execute metadata operations with Metastore.

8. Fetch Result : The execution engine receives the results from Data nodes.

9. Send Results : The execution engine sends those resultant values to the driver.
10. Send Results : The driver sendsthe resultsto Hive Interfaces.

2.8.3 HIVE Data Models

The Hive data models contain the following components,

1. Databases
2. Tables
3. Partitions

4. Buckets or clusters

Partitions

Table is dividedinto a smaller parts based onthevalueofa partition column. Then ontheseslices of data querries can
be madefor faster processing.

Buckets

Buckets give extra structure to the data that may be usedforefficient queries. Different data required for querries
joined together. Thus querries can be evaluated quickly.

Review Question
Q.1 Write a short note on Hadoop.

Q,2 Whatis Hadoop?

Q.3 Explain components of Core Hadoop.


Q.4 Explain Hadoop Ecosystem,

Q.5 Explain physicalarchitecture of Hadoop.

Q.6 Whatarethe limitations of Hadoop?

Scanned by CamScanner
EBBLEN

a d o o p H D F S a n d M a p R educe
H
Module - 2

Syllabus
Nodes, Large-Scale File-System Organization,
Distributed File Systems : Physical Organization of Compute tion,
Reduce Tasks, Combiners, Details of MapReduce Execu
MapReduce : The MapTasks, Grouping by Key, The
: Matrix-Vector Multiplication by MapReduce,
Coping With Node Failures, Algorithms Using MapReduce by MapReduce,
ions by MapReduce, Computing Projections
Relational-Algebra Operations, Computing Select
Union,Intersection, and Difference by MapReduce, Hadoop Limitations

3.1 Distributed File Systems


on a standalone computer.
In the early days of computing, most of the computations were done
consists of a single processor along with its main memory,
A standalone computer is also called a compute node.It
cache memory anda local disk.
bigger, the need for more computation power
As the computations got more complex and the data started getting
wasfelt. This gaverise to the concept of parallel computing.
ed hardware. This makes the
But the problem with parallel computing is that it requires many processors and specializ
wholeparallel processing arrangement very costly.
combinetheir
A cost-effective solution to this problem is to use the already existing standalone computers and
yet allow the compute nodes to
individual computing resources to create a big pool of computing resource and
operate more or less independently.
of compute nodesin general.
This new paradigm of computing is known as cluster computing. Fig. 3.1.1 shows cluster
Cluster 1 Cluster 2

Clustor 4 Cluster 3

Fig. 3.1.1 : Cluster of compute nodes

Scanned by CamScanner
Big Data Analytics(MU) 3 Hadoop HDFS and MapReduce
3.1.1 Physical Organization of Comput
e Nodes
The compute nodesare arranged in rack
s with each rack holdin; g around 8 to 64 compute nodes as depicted in
Fig. 3.1.2.

Networking
“-® Device
(Switch, Router, Hub)

Nodes

Rack 1 Rack 2 Rack 3 Rack n


Fig. 3.1.2 : Compute nodes arrangedin racks

There are two levels.of connections intra-rack and inter-rack. The compute nodesin
a single rack are connected
through a gigabit Ethernet andthis is knownasintra-rack connection. Additionally, the racks are
connected to each
other with anotherlevel of networkora switch which is knownasthe inter-rack connection.

- One major problem with this typeofsetup is that therearea lot of interconnected components and more the
number
of components, higher is the probability offailure. For example, single node failure or an entire rackfailure.

- To make the system more robustagainst such type offailures the following steps are taken :

- Duplicate copies offiles are stored at several computenodes.This is doneso that evenif a compute nodecrashes, the
file is not lost forever. This feature is known as “Data Replication”, Fig. 3.1.3 shows data replication. Here, the data
item D, is originally stored on Node 1 and a copy each is stored on Node 2 as well as Node 3, It means in total we have
three copies of the samedata item D,. This is known as “Replication Factor” (RF).

LN LN
| @s)
:
Node 1 Node 2 Node 3 Node 4
Fig. 3.1.3 : Data replication

— Computations are subdivided into tasks such that evenif onetask fails to complete execution, it may be restarted
withoutaffecting other tasks.

3.1.2 Large-Scale File-System Organization


— The conventionalfile systems present on standalone computers cannot take full advantage of cluster computing. For
this reason a newtypeoffile system Is needed. This newfile system Is called Distributed File System or DFS.

Scanned by CamScanner
ST"
=

Big Data Analytics(MU) 33 Hadoop HDFS and MapReduce

3
Thetypesoffiles which are mostsuited to be used with DFSare :
3.
© Very largesized files having size in TBs or more.
2
© Files with very less numberof update operations compared to read and append operations.
usually of 64 MBin size. Each chunkis normally
In DFS,files are divided into smaller units called “chunks”. A chunkis

It is also ensured that these three compute nodes are}
replicated and stored in three different compute nodes.
one copy of the chunk is available.
members ofdifferent racks so thatin the eventof rackfailure at least
.
adjusted by the user based on the demands ofthe application
Both the chunk size and the replication factor can be
acts
separate file called master node or name node.This file
All the chunks of a file and their locations are stored in a
_
The master nodeis also replicated Just like the individual}
asan indexto find the different chunks of a particular file.
chunks.
a directory. This directory in turn is replicated} _
Theinformation aboutthe master nodesand their replicas are storedin
of the locations wherethe directory copies reside.
ina similar fashion and all the participants of the DFS are aware
as : -
There are manydifferent implementationsof the DFS described above such

© Google File System (GFS),

© Hadoop Distributed File System (HDFS),



©. Colossus, which is an improved version of GFS.
3.
3.2 MapReduce

example.
Explain conceptof MapReduce using an
an example.
hat is the MapReduce? Explain therole of combinerwith the help of
in parallel, on large clusters of
MapReduce can be used to write applications to process large amounts ofdata,
which is easily available in the local
commodity hardware (a commodity hardware is nothing but the hardware
market)in a reliable manner.
amming modelfor distributed computing based onja\4|
MapReduce. is a processing technique as well as a progr
or java framework.
prog! ramming language
m contains two importantfunctions, namely Map and Reduce :
The MapReduce algorithi
pairs. How th
accept one or more chunks from a DFS and turn them into a sequence of key-value
o The Maptasks
for the Map function.
is determined bythe code written bythe user
input data is converted into key-value pairs
d keys are hel
alue pairs produced by the Maptasks. These sorte
o Amastercontroller collects and sorts the key-v
aiue'paits having
g the Reduc eta sks. This distri bution is done in such a wayso thatall the keywv
‘divided amon
e Reducetask.
samekeyare assigned to the sam
written by the user fort
s associated with a particular key. The code
o The Reduce tasks combineall of the value
combinationis done.
Reducefunction determines how the

Scanned by CamScanner
Big Data Analytics(MU) 3-4 Hadoop HDFS and MapReduce

3.2.1. The Map Tasks

Theinput for a Maptask is an elementand the outputis zero or morekey-value


pairs. An elementcould be anything
such asa tuple or an entire document.

Someuseful assumptions are made which are:

© Anelement is stored entirely in one chunk. That means one elementcannotbestored across multiple
chunks,
© Thetypes of keys andvalues both are arbitrary,

© Keys need notbe unique.

Let us understand the MapReduceoperations with an example.Let us suppose weare givena collection of documents
and thetask is to compute the counts of the numberof times each word occurs in that collection.

Here each documentis an input element. One Map taskwill be assigned one or more chunks and that Maptaskwill
processall the documentsin its assigned chunk(s).

— Theoutputwill be of the form:

(Wa 2), (Wa, 4), (Way 4), + (Wey 2)


Where wy, W2, W3,... Wp, are the wordsin the documentcollection thatis assigned to the Maptask.
— Ifa particular word w appears c times, then the outputwill have c number of(w,1) pairs.

3.2.2 Grouping by Key

— After the successful completion of all the Map tasks, the grouping of the key-value pairs is done by the master
controller.

— The numberof Reducetasks is set by the userin advance. The mastercontroller uses a hash function that maps each
key to the range0 to r-1.In this step all the key-value pairs are segregatedin files according to the hash function
output. Theser files will be the input to the Reducetasks.

— The mastercontroller then performs the grouping by key procedure to produce a sequence ofkey/list-of-values pairs
for each key k which areof the form (k,[Vs, V2, Va, -.- » Val), where (k, Vi), (K, V2), (Kk,V3), ».-, (ky Vn) are the key-value pairs
*
which were produced byall of the Maptasks.

3.2.3 The Reduce Tasks

— The input to a Reducetask is one or more keys and theirlist of associated values. And the output produced by the
pairs produced
Reducetask is a sequenceof zero or more key-value pairs which may bedifferent from the key-value
Maptasksare of the
by the Maptasks. But in most cases both the key-value pairs produced by the Reduce tasks and
sametype. : .

— Inthefinal step the outputs producedby all of the Reduce tasks are combined ina singlefile.
outputs will be of the form (w, s)
— {nour word count example, the Reducetaskswill add all the values for each key, The
pairs, where w Is a word andsls the numberoftimesIt appearsin the collection of documents.

— Fig. 3.2.1 showsthe varlous MapReduce phasesfor the word frequency counting example.

TochKnowladg’
Pusticatiess

Scanned by CamScanner
Te
|

W_Big Data Analytios(MU) Hadoop HDFS and MapReduce


——
Input Shufffil
ling Reducing . Findut
outp

IN ‘Bear, 2:
Car, 3

Deer

River, 4 River, 2

Fig. 3.2.1 : MapReduceprocess for word frequency


count
3.2.4 Combiners

S ag
- Acombineris a type of mediator betw
een the mapperphaseand the reducer
phase. The use of combinersistotally
optional. As a combiner sits between the
mapperand the reducer, it accepts the
Output of map phase as an input and
passesthe key-value pairs to the reduce operation.

— So to avoid such congestion we


can put someof the work of red
ucersin to the Map tasks. For
task If c number of (w, 1) key-va examplein a single Map
lue pairs appear having the sam
e key, they can be combinedi
wis the key and cis the number nto a pair (w, c), where |.
oftimes the word w appearsin
the set of documents handled
bythis Map task.

Scanned by CamScanner
Big Data Analyti
3-6 Hadoop HDFS and MapReduce
Combiner

Generation Y
of key-value <ky, V4>
pair <kp,Vg>

Combiner <—4

Reducer.

Fig. 3.2.2 : Position and Working Mechanism of Combiner

3.2.5 Details of MapReduce Execution

- Inthis section we will discuss in details how a MapReducebased program is executed. The user Program first creates a
Master controller process with the help of fork commandusing thelibrary provided by the MapReduce system as
depicted in Fig. 3.2.3:
- In addition to the Master process, the user program also forks a numberof workerprocesses. These processes run on
different compute nodes.

— The Masterhasthe following responsibilities :

© Creation of Map and Reducetasks,

© Assignmentof Map and Reducetasks to the Workers,

© Keeping track of the Map and Reducetasks(idle/executing/completed).


:
— AWorker process can beassigned either a Map task or a Reducetask but not both the tasks.

Scanned by CamScanner
uce
Hadoop HDFS and MapRed
Big Data Analytics(MU)

assign’,
_ Reduce

data
file
Intermediate
files

Fig. 3.2.3 : MapReduce program execution


ks. So for example in
— Every Maptaskcreates one intermediatefile in its local compute node for each of the Reducetas
two intermediate files which will be
the Fig. 3.2.3 there are two Reduce tasks, thus each of the Maptasks create
passed on as input to the Reducetasks.
educe based program.
All the Reducetasks together will producea single file as the final output of the MapR

3.2.6 Coping with Node Failures

— There are three types of nodefailures :

o Master nodefailure,

o Map worker nodefailure,

© Reduce worker node failure.

- Ifthe Master nodefails then the entire processhasto berestarted. This is the worstkindoffailure.
- Ifa Map worker nodefails then the Masterwill assign the tasks to someothi er available
i worker node evenif the task
had completed.

- Ifa Reduce workernodefails then the tasks are simply rescheduled on some other Reduce worker lat
er later,

3.3. Algorithms using MapReduce


- MapReduceIs notthe solution for every problem. . The The distribute
scharewayreslousdened distri y m only makes sensefor extremelylargefiles
file syste

- The main motivation behi ind the creation


i of Google’s implementation of MapReduce wastheir Pa;
which requiresvery large dimensional matrix-vector multiplications. eeRank algorithyy
—_ Inthis section wewill study a few algorithms whichfit nicely into the MapRed
uting.
uce style of comput i

Scanned by CamScanner
3-8 Hadoop HDFSand MapReduce

May 16, May19}

Let us consider a Matrix M ofsize n x n. Let mj denotethe elementin row i and columnj. Let us also consider a vector
v of length n and the j" th element of the vectoris represented as ve
The matrix-vector multiplication will Produceanother vector x whoseith elementx; is given by the formula:
n
Kedj =1my

In reallife applications such as Google’s PageRank the dimensionsof the matrices and vectors will be in trillions. Let us
at first take the case where although the dimension is large but it is also able to fit entirely in the main memory of
the compute node.

The Map task at each Map worker node works on a chunkof M and the entire v and produces the key-value pairs
(i, my). All the sum terms of the componentx; of the result vector of matrix-vector multiplication will be getting the
samekeyi.

— The Reducetasks sum all the valuesfor a particular key and producetheresult(i, x;).
In the case the wherethe vector v is toolarge tofit into the main memory of a computenode,an alternative approach
as shownin Fig. 3.3.1 is taken. The matrix M is divided into vertical stripes and the vector v is divided into horizontal
stripes havingthe following characteristics:

© The numberofstripes in M and the numberofstripes in v must be equal,


oO. The width ofall the stripes in M must be equal,
© The heightof all the stripesin v must be equal.

© The size of a stripe in v must be such thatit canfit conveniently in main memory of a compute node.
- Nowit is sufficient to multiply the jth stripe of M with the jth stripe of v. A chunk of a stripe of M and the
proceed as described earlier.
corresponding entirestripe ofvis assigned to each Maptask and the calculations
Panes

Matrix M Vector V
Fig. 3.3.1 ; Division of matrix and vectorInto stripes
Te Kaemtedgi<
Puntieatians:

Scanned by CamScanner
edu |
Hadoop HDFS an id Map Red
9
¥ Big Data Analytics(MU)

3.3.2 Relational-Algebra Operations

re ig! a
dby Rel
Me Nites

s. The schema ofa


A row ofthis table is called a tuple and the column hea dersare called attribute
— Arelation is a table.
her elation
, Aq), where is the name oft
relation is the set of its attributes. A relation is represented by R(A, Az ~
and A; Az, ..., Anare the attributes of R.
section we are going to
tions that can be performed onrelations. In this
— © Relational algebra defines the standard opera
discuss the following relational algebra operations :

o Selection,
© Projection,

o Union,

© Intersection,

o Difference.

1. Selection operation
applied on every tuplein the relation.
In case of a selection operation a constraint which is denoted by ‘Cis
the system as output.
Only those tuples which satisfy the specified constraint 'C’ will be retrieved and shownby
Selection operation in the relational algebra is represented by G¢(R).
Where, o —> representsselect operation
C > represents condition/constraint

R-— represents the relation.

2. Projection operation

Consider ‘S’ to be a subsetof columns/attributesfor a relation R.


E.g. Ifa relation contains a total of10 columns then consideronlyfirst 5 columnsfor processing.
From all the tuples/rowsonly the specified attributes/columnsare retrieved and shownas output.

The projection operation is represented in relational algebra as

mls (R)
Where, % represents project operation
S — represents subset
R represents therelation

3. Unlon,Intersection and difference operations

All the three operations operate on the rows of twodifferent relations. The basic requirementis that both of the
relations mustbe having the same schema.

Scanned by CamScanner
Big Data Analytics(MU)
Hadoop HDFS and MapReduce

Ve Contents (tuples)
from both relations Contents (tuples)
common in both relations
OO. |
AuB
AnB
Contents from A
contents from B

Fig. 3.3.2 : Union, Intersection and difference Venn diagra


ms
3.3.3 Computing Selections by MapReduce

MapReduceis way too powerful for selection Operations. Selections can


be done completely either in the Map phase
alone orin the Reduce phasealone. Here weshall discuss the former.

Generation of
key-value pair ——+ Output
(tt) !

(forward to console)
x o
Map phase Reduce

Fig. 3.3.3 : MapReduce Implementation ofselection operation

In the Map task, the constraint C is applied on each tuple t of the relation.

If Cis satisfied by a tuple t, then the key-valuepair(t, t) is produced for that particular tuple. Observe that here both
the key as well as the value are the tuple itself.

If Cis notsatisfied by t, then the Mapfunction will produce no outputforthattuple t.

As the processing is already finished in the Map function the Reduce function is the identity function. It will simply
forward on the key-value pairs to the output for display.

The outputrelation Is obtained from either the key part orthe value part as both are containingthe tuple t.

3.3.4 Computing Projections by MapReduce

In the Map task, from each tuple t in R the attributes not present in S are eliminated and a new tuple t’ is constructed.
‘The output of the Maptasks Is the key-value pairs (t’, t’).

The main job of the ReducetaskIs to ellminate the duplicate t’s as the output of the projection operation cannot have
duplicates.

Teed .
Publications

Scanned by CamScanner
Big Data Analytics(MU)

Tuples | construct ; weacies


new output
tuple asa
t key value pair
Relation R }
iy y
Eliminate |
Map phase entries from t
whose attributes
are notin S

Reduce phase

Fig. 3.3.4 : MapReduce implementation of projection operation

3.3.5 Union,Intersection and Difference by MapReduce

Union with MapReduce

- For the union operation R US, the tworelations R and S must have the same schema.Thosetuples which arepresent]
in either R or S or both mustbe present in the output.

- The only responsibility of the Map phase is of converting each tuplet into the key-valuepair (t, t).

- The Reducephaseeliminates the duplicates just as in the case of projection operation. Here for a key t there can be|
either 1 valueif it is present in only oneofthe relations or t can have valuesifit is presentin both the relations.In
either case the output producedby the Reducetaskwill be (t, t).

Convert
it
—Tuplet (t, Value n
7 to key-value i
i: pair t
i 4
Value 2
of relation ' key t
R ' can have Value 4
& A 1 or 2 values JQ
t
Mapphase ' Reduce phase
Fig. 3.3.5 : Union operation with MapReduce

Intersection with MapReduce


Forthe intersection operation RS,both the relations R and $ musthave the
same schema. Onlythose tuples whl
are present in both R andS should bepresentin the output.
The responsibility of the Map phaseIs sameasthat of
union operationi.e. conversion of a tuple ‘t’ in a given
rela
‘R’Into the key-value pair format(t, t).
Reducerwill produce the output (t, t) onlyIf both R and S haveth
etuple t. This can be donebychecking the nu
of values associated with the keyt. If the key t hasa list of two
values [t, t], then the Reduce task will produce #
output (t, t). If the key t has only one value [t], then the Reduce
rwill produce nothing as output.

Scanned by CamScanner
Convert
it

i to key-value '
pale WK Itkey't
''
ofrelation ' hasvalueinlist [t, t]
R | then generate(t, t)
& _A___olse NULL “y

Map phase t Reduce phase


Fig. 3.3.6 : Intersection operation with MapReduce

Difference with MapReduce

- For the difference operation R — S, both the relations R and S must have the same schema. The tuples which are
present only in R and notin will be present in the output.

- The Map phase will produce the key-value pairs (t, R) for every tuple in R and(t, S) for every tuplein S.

- The Reduce phasewill produce the output (t, t) onlyif the associated value of a keyt is [R].
aapevnnenennnnnRenennencn enn Picncen

Produce key-value
Pair (t, R)
For each key
Relation R
if we have
associated list
[R] thenoutputis
Produce key-value key-valuepair (t, t)
caved: Pair (t, S) else NULL

Relation §
% J
Map phase Reduce phase
Fig. 3.3.7 : Difference operation with MapReduce

3.4 HadoopLimitations
— Hadoopis a collection of open source projects created by Doug Cutting and Mike Cafarella in 2006.It was inspired by
Google’s MapReduce programming framework. Hadoopconsist of the following core modules:

© HadoopDistributed File System (HDFS) whichis the storage module and


© MapReduce programming modelwhichis the processing module.

— Thevariouslimitations of Hadoop are:


© Not fit for small files : Hadoop was designed to workwith big sizedfiles and it cannotefficiently handle small files
evenif there are a huge numberof smallfiles. A file is considered small if its size is less than the HDFS block size
which by default is 128 MB.
© Slow speed : Hadoop works with massive amounts ofdistributed data which slows downthe processing.

Scanned by CamScanner
Big Data Analytics(MU) 3-13 Hadoop HDFS and MapReducg
© Noteasy to use: The developer needsto write the code forall the operations which makesit very difficult
to use,
© Security : Hadoop does not support encryption which makes it vulnera
ble.
° Real-time data processing not supported : Hadoopis designed to support only batch processing and
hencereajJ
time processingfeatureis missing.

Noiteration support : Hadoopis not designed to support the feeding of the output of one stage to the inputof
the next stage of Processing.

© No caching: Intermediate results are not cached and this brings downthe performance.

Qt Explain the large-scale file-system organizationin detail.

Q2 Explain MapReduce frameworkwith suitable example.

as Whatare the different phasesinvolved in MapReduce technique ? Explain with example.

Q4 Whatare combiners ? Explain the position andsignificance of combiners.

Qs Whatis Relational Algebra ? Explain thedifferent operations performed


by Relational Algebra
Q.6 Explain selection and projection operation using MapReduce functionality.

Q.7 Explain Union, Intersection and Difference operations with MapReduce techniques.

Scanned by CamScanner
NoSQL

Syllabus
i
Introon
7 Nest NoSQLBusiness i i
Drivers, NoSQL Data Architecture Patterns: Key-value stores, Graph stores,
NoSQL architectural patterns, NoSQL Case Study,
tosa. in {Bigtable)stores, Document stores, Variations of
Analyzing big data with a shared-nothing
on solution for big data, Understanding the types of big data problems;
NoSQLsystems to handle big data
architecture; Choosing distribution models : master-slave versus peer-to-peer;
problems.

4.1. NoSQL (Whatis NoSQL?)

History
Carlo Strozzi in the year 1998.
The term NoSQL was first used by
noprovision of SQL Query
rce Database system in which there was
— He mentioned this name for his Open Sou

interface.
and actu ally comesin practice.
in USA, NoSQL was comesinto picture
In the early 2009,at conference held

2. Overview
em).
onal Database Management Syst
NoSQLis a nota RDBMS(Relati
or large amountofdatastored in dist
ributed environment.
NoSQL is specially designedf
trictions like RDBMS. It gives options to
fea tur e of NoS QLi s, it isn ot bounded by table schema res
The important sent in table.
thereis no suc! h column is pre
store somedata evenif
join operations.
NoSQLgenerally avoids

3. Need
book, Google+, Twitter and
qu irem ents are cha nge d lot . Data is easily available with Face
In real time, data re
others. user-generated
information, social graphs, geographic location data and other
The data that includes user
content. h can operate
ata ,it is nec ess ary to work with a technology whic
ntresources andd
To makeuse of such abunda
such data.
data.
y designed to operate such
— SQL databases are not ideall
e amount ofdata.
designed for operating hug
NoSQLdatabases specially

Scanned by CamScanner
Big Data Analytics(MU) 42 Nosay
4. Advantages
(i) Good resource Scalability.
(ii) Lower operational cos
t.
(iii) Supports semi-stru
cture data.
(iv) Nostatic schema.

(v) Supportsdistributed computing


.
(vi) Faster data Processing.

(vil) No complicated relationship


s.
(vili) Relatively simple data models.

5. Disadvantages

(i) Nota defined standard.

/ (ii) Limited query capabilities.

6. Companies working with NoSQL

(i) Google (ii) Facebook

(iii) Linkedin (iv) McGraw-Hill Education

4.2 NoSQLBasic Concepts


Theorem (Brewer’s Theorem) for NoSQL

CAP theorem states three basic requirements of NoSQLdatabases to design a distributed architecture.

{a) Consistency

Database must remain consistentstate like before, even after the execution of an operation.

(b) Availability

It indicates that NoSQLsystemis alwaysavailable without any downtime.

(c) Partition Tolerance

This meansthat the system continuesto function even the communication failure happens betweenserversi.e. if oné
server fails, other serverwill take over.

There are many combinations of NoSQLrules:

1, CA

Itis a single site cluster.

All nodesare alwaysin contact.

Partitioning system can block the system.

Scanned by CamScanner
Big Data Analytics(MU
43
2. «CP

Some data may 'Y not beaccessible always still it may


be consistent or accurate.
3. AP

— System is available under


Partitioning.
— Some part ofthe data
May be inconsistent.
4 BASE model
Relati
elational
databases have some rules to decide behaviour
of database transactions.
ACID model maintains the atomicity,
consistency, isolation and durability of database transactions.
— NoSQL turns the ACID model to the BASE
model.
5. BASE offers some guidelines

— Basic availability
— Soft state
— Eventual consistency

6. Datastorage

— NoSQLdatabasesuse the conceptofa key / value store.


— There are no schemarestrictions for NoSQL database.
— It simply stores values for each key and distributes them across the database,it offersefficient retrieval.

7. Redundancy and Scalability

— Toadd redundancy to a database, we can add duplicate nodes and configure replication.
- Scalability is simply a matter of adding additional nodes. There can be hash function designedtoallocate data to
server.

4.3 Case Study NoSQL (SQL vs NoSQL)


database.
SQL databases are Relational Databases (RDBMS); whereas NoSQLdatabase are non-relational

Data storage
as document based, key-value pairs, graph
SQL databases stores data in a table whereas NoSQLdatabases stores data
databases or wide-column stores.
some rows.
SQL data is stored in form of tables with
documents or graph based data with no standard schema
NoSQL data Is stored ascollection of key-value palr or
definitions.

Database schema
a which cannot be change very frequently, whereas NoSQL databases have
SQLda tabases have predefined schem
* nge any tlme for unstructure d data.
dynamic schema which can be cha
Complex querles
form for running complex query.
- SQL databases provides standard plat
for running complex queries.
— NoSQL doesnot provide any standard environment
as SQL query language.
- NoSQLare not as powerful

Scanned by CamScanner

Full form is Structu red Query Language. - is ional database.
Full form Not Only SQL or Non-relationa
2. |SQlisa declarative query language.
This is Not a declarative query language.
3. |SQL databases works On ACID properties,
NoSQLdatabase follows the Brewers CAP " eorem,

|Atomicity
Consistency
Consistency
‘Availability
Isolation
Partition Tolerance _
Durability
4, _|Structured and organized data
Unstructured and unreplicable data
5.
j
|Relational Databaseis table based. Key-Value pair storage, ColumnStore, DocumentStore, Graph|
databases.
6. Data andits relationshipsare stored in separ
ate|Nopre-defined schema.
tables,
7. |Tight consistency. Eventual consistency rather than ACID property.
8. |Examples :
Examples :
MysaL
MongoDB
Oracle
Big Table
MS SQL Neod4j
PostgreSQL Couch DB : L
SQLite Cassandra
DB2 IHBase

4.4 Business Drivers of NoSQL

enSate
WE
3

1. The growthof big data

— Big Data is one of the main driving factor of NoSQL for business.

— The hugearrayof datacollection actsas driving force for data


growth , i
2. Continuousavailability of data
,
The competition age demandsless downtime forb
etter companyr eputation.
— Hardware failures are possible but NoSQL data
base environments ar € built with
a distributed architecture 50
there are nosingle pointsoffailure.
— Ifone or more databaseservers goes down, the
other nodesin the system are able
to continue with operations
withoutanyloss ofdata.
So, NoSQLdatabase environments are
able to provide continuous availability.

Scanned by CamScanner
Big Data Analytics(MU) 45 NoSaL
3. Location independence

- Itis ability to read and write to a database regardless of wherethatI/O operation is done.
— The master/slave architectures and database sharding can sometimes meetthe need for location independent
read operations.

4, Moderntransactionalcapabilities

The transactions conceptis changing and ACID transactionsare no longer a requirementin database systems.

5. Flexible data models


— NoSQL has moreflexible data model as comparedto others.
— * ANoSQLdata modelis schema-less data modelnotlike RDBMS.

6. Better architecture
- The NoSQLhas morebusinessoriented architecture for a particular application.
= So, Organizations adopt a NoSQLplatform that allows them to keep their very high volume data.

7. Analytics and businessintelligence :


— Akey driver of implementing a NoSQLdatabase environmentis the ability to mining data to derive insights that
offers a competitive advantage.
- Extracting meaningful business information from very high volumesof data is a very difficult task for relational
database systems.
Modern NoSQLdatabase systemsdeliver integrated data analytics and better understanding of complex data sets
whichfacilitate flexible decision-making.

4.5 NoSQL Database Types

fferent ‘data architecture pal et

tadit joront NoSQL data architecture patterns.


$ any two. architecturalpeersof fescee

Different Architectural Patterns in NoSQL


, Amazon DynamoDB.
— Key-Value databases examples: Riak,Redis, Memcached, BerkeleyDB,upscaledb
astore, OrientDB , RavenDB
~ Documentdatabases examples : MongoDB, CouchDB,Terr
, HyperTable.
Columnfamily stores examples: Cassendra, HBase
, FlockDB.
- Graph Databases examples : Neo4j,InfiniteGraph

1. Key-value store databases


- Thisis very, simple NoSQLdatabase.
e data.
Itis specially designed for storing data as a schemafre
g with indexed key. . .
Such datais stored in a form of data alon ¥ =
Pupiications

Scanned by CamScanner
Examples
— Cassandra
— Azure Table Storage (ATS)
— DyanmobdB

Fig. 4.5.1
Use Cases
This type is generally used when you need quick performance for basic Create-Read-Update-Delete operations any
data is not connected.

Example
- Storing andretrieving session information fora Web pages.

- Storing userprofiles and preferences


— Storing shopping cart data for ecommerce

Limitations

It may not workwell for complexqueries attempting to connect multiple relations of data.

If data containslot of many-to-manyrelationships, a Key-Valuestoreislikely to show poorperformance.


2. Column store database
— Instead of'storing datain relational tuples (table rows), it is stored in cells groupedin columns.
It offers very high performanceanda highly scalable architecture.

cassandra

T®:
ELBASE

OM) nyPertaBe«
Amazon SimpleDB amazon
.

Fig. 4.5.2
Scanned by CamScanner
NoSQL
w Big Data Analytics(MU)

Examples

(i) HBase (ii) Big Table (iii) Hyper Table


Use Cases

Some commonexamples of Column-Family database include eventlogging and blogslike documentdatabases,


but the data would bestored in a different fashion.
have each row key formatted in such a way to
In logging, every application can write its own set of columns and
Promoteeasy lookup based on application and timestamp.
way to count or
Counters can be a unique usecase.It is possible to design application that needs an easy
incrementas events occurs.

3. Document database
Document databases works onconcept of key-valuestores where “documents” containsa lot of complex data.
Every documentcontainsa uniquekey,usedto retrieve the document.
tured
Key is used forstoring, retrieving and managing document-oriented information also known as semi-struc
data.

fee}) IBM Cloudant* ,

§ mongoDB
yar

AR .
{ea}
terrastore
JrientDB
Gd .
Couchbase
mR

RAV geet Aneel OS

Fig. 4.5.3

Examples

(i) MongoDB (ii) Couch DB

Use Cases

— The example of such system would be eventlogging system foran application or online blogging. ~
— Inonline blogging useracts like a document; each post a document; and each comment,like, or action would °

be a document.

Scanned by CamScanner
: me, post ¢ ontent, or timest
t ampof :
All docu ents would Id cicontain ntain info:
information about the type of data, userna

documentcreation.
Umitations
~ It's challenging for document store to handle a transaction that on multip
iple documen ts.
— Document databases maynotbegoodif data
is required in aggregation.
4. Graph database

Datais stored as a graph andtheir relationshipsare i acts like a node.


stored asa link between them wl hereas entit y
Examples

(i) Neo4j (ii) Polyglot

Neo4j

InfiniteGraph

neniDB

twitter / flockdb
Fig. 4.5.4
Use Cases

friends, friends of friends, likes,


and so on.
The Google Maps can help you
to use gra Phs
to easily model their
shortestroutesfor directions, data for finding
close locations or
building
Many recommendation
systems makeseffecti
ve use of this model,

Scanned by CamScanner
- NoSQL
Big Data Analytics(MU) 49

Limitations
variations.
- Graph Databas:
P es maynot beoffering better choice over other NoSQL
- pplication needstoscale horizontally this may introduces poor performance.
If applicati

- i needs to updateall nodes with a given parameter.


Notv ery efficient whenit

5. Comparison of NoSQLvariations
lability.
Key value store database

‘ Columnstore database High Moderate

Document store database High Variable (High) High

Graph database Variable Variable High |

4.6 Benefits of NoSQL


1. Big data analytics
rity of NoSQL.
Big datais one of main feature promotes growth and popula

— NoSQLhas good provision to handle suchbig data.

2. Better data availability


ronments.
NoSQL database workswith distributed envi
iple data servers.
ld provide good availability across mult
NoSQLdatabase environments shou
formance.
NoSQL databases supply high per

3. Location independence
i ion of database operation.
write database regardless oflocat
NoSQLdata base can read and
Management
47 Introduction to Big Data s: Sending
a. Peo ple upl oad /do wnl oad vid leos, audios, images from variety of device
y huge dat
Weall are surrounded b sApp, Twitter status, comments, online
edia messages, UP dating their Faceboo! k, What
text messages, multim
huge data.
ising etc. generates
shopping, online advert l growth of data the analysis of
era te and kee p hug e data too. Due to this expon entia
e to gen
As a result machines hav
becomes challeng' in
g and difficult.
that data dously
gh ve lo ci ty an d a va ri et y of da ta . This big datais increasing tremen
me, hi
’ means huge volu
- The term ‘Big Data
day by day. h a Big Data.
ande xis tin g tool s are faci ng difficulties to process suc
gement systems
- Traditional data ma! ina used for data analysis
ion and research. It is al iso widely
tistical educat
on e of th e ma in co mputing tools usedin sta
- Ris research.
l co mp ut in gi n oth er fields of scientific
~ and numerica

Tech!
Puptications

Scanned by CamScanner
ig Data Analytics(MU)
4.8 Big Data
We all are surrounded by hug

edata. People upload/downl
oad videos,audios, images from variety
Sending text Messages, mu
ltimedia messages, updati “eeut
shopping, online adv
ng the ir Fa ce bo ok , WhatsApp, Twiter a
ertising etc, “
Generates huge data. As a result machines have to generate and keep huge data too Du
. e to this exponentiai l grows
of data the analysis of that da
ta becomes challenging anddiff
icult.
The term ‘Big Data’ Meanshuge volume, high vel
ocity anda variety of data. This big data is increa
sing tremendous}
dayby day. Traditional data Managementsystemsandexistingtools are facing difficulties to process
sucha Big Data,
Big data is the Most important
technologies in modern world. It
is really critical to store and manage
Collection oflarge datasets that cannot It. Bigis
be processed using traditional computing
techniques.
Big Data includes huge volume, high velocity and extensibl
e variety of data. The data in it may be structured
Semi Structured data or Unstructured data. data
Big dataalso involves various tools, techniques
and frameworks.
Four Important of Big
Data

Scale of data Different


forms of data

Analysis of Uncertainty
of data

Fig. 4.8.1 : Four V ofbig data


1. Volume
Huge amountof data is generated during big
data applications,
2. - Velocity
Fortimecritical applications the faster
Processing is very important. E.g. share market ting, vi
deo streaming
3. Variety

The data maybe Structured, SemiStructured or Unstr


uctured.
4. Veracity
Data is not certain. Data captured can
vary greatly, So accuracy of analysis depen
ds on thever:
acity of the source data.
4.8.1 Tools Used for Big Data
1. Map Reduce
Hadoop,Hive, Pig, Cascading, Cascalog, mrjob, Caffe
ine, S4, MapR,Acunu,
Flume, Kafka, Azkaban,
2. Storage Oozie, Greenplum
$3, HadoopDistributed File System

Scanned by CamScanner
W_Big Data Analytics(MU) an NoSQL
3. Servers

EC2, Google App Engine, Elastic, Beanstalk,


Heroku
4. - NoSQL

Zookeeper, MongoDB, Cassandra, Redis, Big Table, Hbase,


Hyper table, Voldemort, Riak, Couch DB
5. Processing
R, Yahoo! Pipes, MechanicalTurk, Solr/Lucene,ElasticSearch, Datameer, BigSheets, Tinkerpop

4.8.2 Understanding Types of Big Data Problems


1. Acquiring data

High volumeof data and transactionsare the basic requirements of big data. Infrastructure should support the same.
Flexible data structures should be used for the same. The amountof time required for this should be asless as
possible.

Eee

, Predictablelatency
transactionvolurr
lexible data structures~

2. Organizing data

As the data maybestructured, semistructured or unstructuredit should be organizedin a fast and efficient way.

3. Analysing data

Data analysis should befaster andefficient. It should support distributed computing.

4.9 Four Ways of NoSQL to Operate Big Data Problems


4. Key-value store databases

- This is very simple NoSQL database.

— Itis specially designedforstoring data as a schemafree data.


Such data is stored in a form of data along with indexed key.

Fig. 4.9.1 : Exampleof unstructureddata for user records

Scanned by CamScanner
Big Data Analytics(MU)

Working

The schema-less format of a keyvalue databaseis required for data storage needs.
Thekey can be auto-generated while the value can beString.
,
The keyvalue uses a hash table in which thereexists a unique key and a pointer to a particularitem of data.

logical group of keys called as bucket, There can beidentical keys in different buckets.
It will improve performance becauseof the cache mechanisms that accompany the mappings.
is hash (Bucket+ Key),
Toread any value you need to know boththe key and the bucket becausethereal keyis a ha Y
Read Write values

Get(key) : It will returns the value associat


ed with key.
Put(key, value) : It will associates the
value with the key.
Multi-get(key1, key2,.., keyN) : It will returns the list of values associated with thelist of keys
.
Delete(key) : It will delete entry for the key from
thedata store.
2 Column store database

Instead ofstoring data in relationaltuples (table rows),i


t is stored in cells grouped in columns.
It offers very high performance and a highl
y scalable architecture.

Row-oriented database

2001-01-01

:
[275]200502-07 [tones[aim]
[S14 amme001_[ Young[sue]
Emp_no| Dept_id | Hire_date Emp_in] Emp_in
1 1 2001-01-01 Smith Bob
2002-02-01 Jones Jim
@}alalaln

olmimfafa

2002-05-01 Young Sue


2003-02-01 Stemie| Bil
1999-06-15 Aurora} Jack C Column-oriented database
2000-08-15 Jung Laura} |. ae Z

Scanned by CamScanner
Data Analytics(MU
4-13
created at runtime.
Read and write Is do
ne Using columns.
It offers fast search/
access and data aggr
egation,
Data Model

ColumnFamlly : Single struct


ure that can group Columns and
SuperColu mns
Key : The permanent nameof
the record having different
numbers of columns.
Keyspace : This define
s name of the applicati
on.
- Column It has an orderedlist of elem
ents with a name anda value defined.
Examples

(i) HBase (ii) Big Table (iii) Hyper Table


Documentdatabase

Document databases works on conceptofkey-value stores


where “documents” contains lot of complex data.
— Every document contains a unique key, used
to retrieve the document.
— Key is used for Storing, retrieving and managing docume
nt-oriented information also known as semi-structured
data.

Relational data model Documentdata model


Highly-structured table organization Connection of complex documents
with rigidly-defined data formats with arbitrary, nested data formats
and record structure. andvarying "record" format.

Fig. 4.9.3 : Document database

Working

- This type ofdataIs a collection of key value pairs Is compressed as a documentstore quite similar to a
key-value
store, , but the only differenceIs that the values stored Is known as “documents” has some defined structure and
encoding.

- The above example showsdata valuescollected In Colum family.

= JSON and XML are common encoding as above.

- Its schemaless data makes easy for JSON to handle data.

Examples

(i) Mongo DB
(ii) Couch DB

Techitnowledg’
Puniications

Scanned by CamScanner
g Data Analytics(MU)
4. Graph database

Data is stored as a graph and their relationships are stored as a link


between them whereasentity acts like a node.

Data graphs

Fig. 4.9.4 : Graph database

Working j
In a Graph NoSQLDatabase, a flexible graphical representation is used with edges, nodes and properties which.
provide index-free adjacency.

Data can beeasily transformed from one modelto the other using a Graph Base NoSQLdatabase.
These databases use edges and nodesto representandstore data.

These nodesare organised by somerelationships, whichis represented by edges between the nodes.

Both the nodesandtherelationships have properties.

Scanned by CamScanner
Big Data Analytics(MU) 415 NoSQL

Examples

(i) Neo4j (ii) Polyglot

4.10__Analyzing Big Data with a Shared-Nothing Architecture


Parallelism in databases represents oneof the most successful instances ofparallel computing system.
Types :

1. Shared Memory System (UNIX Fs)

2. Shared Disk System (ORACLE RAC)


3. Shared Nothing System (HDFS)

4. Hierarchical System

4.10.1 Shared Memory System


(a) Architecture details

- Multiple CPUs are attached to a common global shared memory via interconnection network or communication
bus.

- Shared memory architectures usually have large memory caches at each processor, so that referencing of the
shared memory is avoided whenever possible.

- Moreover, caches need to be coherent. That meansif a processor performs a write to a memory location, the
data in that memory location should be either updated at or removed cached data.

Interconnection network

P = Processor
‘Commonglobal shared memary(M) D=Disk
M= Memory

Fig. 4.10.1 : Shared memory system architecture

(b) Advantages

- Efficient communication betweenprocessors.

Data can be accessed by any processorwithout being moved from oneplaceto other.
y writes.
— Aprocessor can send messages to other processors muchfaster using memor
(c) Disadvantages

— Bandwidth problem

Notscalable beyond 32 or 64 processors, since the bus or interconnection networkwill get into a bottleneck.
ng time of processors.
~ More numberof processors can increase waiti

Scanned by CamScanner
4.10.2 Shared Disk System
(a) Architecture details
But,Y everyY processor has| local
— Multiple processors can access all disk directly via inter communication networ! k.
memory.

Shared disk has two advantages over shared memory.

Each processorhas its own memory; the memory busis not a bottleneck.

System offers a simple wayto provide a degreeoffault tolerance.


The systemsbuilt aroundthis architecturearecalled clusters.
'@]4— Local Memory

Shared memory disk


Fig. 4.10.2 : Shared memory disk architecture
(b) Advantages

Each CPU orprocessor hasits own local memory, so the memo


ry buswill not face bottleneck.
High degree offault toleranceis achieved.

Fault tolerance : Ifa processor(orits memory)f


ails, the other processor can take overits
databaseis present on disks and are accessibleto all Processors, tasks, since the

If one processor fails, other processors


can take over its tasks, since database
accessible from all processors. is On shared disk that can be

(¢) Disadvantages

— Some Memory loadis added to each processor.


— Limited Scalability : Not scalable
beyond certain point. The shared-disk
large amounts of data are shipped architecture faces this problem
through the interconnection network. because;
subsystem is a bottleneck. So now the interconnection to thedisk
— The basic problem with the
shared-memory and shared
added, existing CPUs are slowed
down
bandwidth

(d) Applications

Scanned by CamScanner
4.10.3 Shared Nothing Disk Syste;
(a) Architecture details

EacJ hproces: sor has iti s
own local memory and
reine “e local di
A processor at one no nn municate with another processorus
de ma y communni ing high speed communicatio net
i ost n work.
3
le whi.ch functiionsas Server for data that is stored on loc
al disk.
— Moreover,the ii interconnection net‘tworks for shared nothing systems areusually designed to bescalable, so that
wecan increasetr: ‘ansmi 7 :
ission capacity as more nodes are added to the network

Disk
M= Memory

Local Memory

Shared memory
Fig. 4.10.3 : Shared nothing architecture

(b) Advantages
interconnection network queries which
— In this type of architecture no need to go through all 1/0, Onlya single
ough the network.
access nonlocal disk can pass thr
.
erof CPU anddisk can be connected asdesired
- Wecana chieve High degree of parallelism.i.e. Numb
ure syst ems are morescalable andcaneasily support a large number ofprocessors.
Shared nothing architect

(c) Disadvantages
res since sending data
n and of nonlocaldisk access is higher than other twoarchitectu
= Cost of communicatio
n at both ends.
involves so’ ftware interactio
partitioning.
— Requires rigid data

(d) Applications
ase architecture.
tabase machin e uses shared nothing datab
~ The teradata da
otypes.
Gamm a research prot
~ Grace and the

TechKnsledga
punticat

Scanned by CamScanner
¥ Big Data Analytics(MU)
4-18
4.10.4 Hierarchical System

Architecture details
.
The hierarchical architecture comes with
combin: ed characterist
‘stiics of shared mem
ory, s hared disk and shred NOthini
architectures,

At the top level, the system


Consists of nodes connecte
d b y an interconnection network and they donot share diskg
Memory with one another,

This architecture is at
tempts to reduce the
com plexity of programming such systems yields to dis
memory architectures,
where logically there
is a si ingle shared memory, the memo
tributed virtua
system software, allows ry mapping hardware coupled witl
each Processor to view
th disjoint memories as a single virtual mem
ory.
The hierarchical architecture is also
referred to as nonuniform memory
Hierarchical arch architecture (NUMA).
itecture

IP Network
IP Network

Scanned by CamScanner
Big Data Analytics(MU)

Distribution
——— ofwork load
to slave nodes

databases,
As all nodes has the samepr
iority, so the requests fro
m data base users will be rec
irrespective of work load eived by any of the nodes
distribution.

plication mechanism where data pack


ets will be replicated for certain
Part of the data basewill get cras
hed.
Fig. 4.11.2 showsPeer to Peer
model.
Node 2

Node 1

Node 6

Node 4

Userrequest Node 5

Fig. 4.11.2 : Peer to peer model

Scanned by CamScanner
Big Data Analytics(MU) 4-20 NoSQu
4.11.1 Big Data NoSQL Solutions

There are many NoSQLsolutions to Bigdata problemsaslisted below,


1. Cassendra 2 DynamoDB

3. Neo4J 4. MongoDB

5. Hbase 6. Big Tables

4.11.1(A) Cassendra
1. Introduction

Cassendra is a distributed storage system mainly designed for managing large amount of structured across
multiple servers.

Cassandra run with hundreds ofservers and manages them.

Cassandra provides a simple data modelthat supports dynamic control overdat


a.
Cassandra system was designed to run with economic hardware to handle
large amount ofdata.
For Example, Facebook runs on thousandsofserverslocated in manydata
centres using Cassandra.
The cassendrafall in the category of NOSQL data bases. In NOSQL
the data bases doesn’ t have the schema ie. a
schema less data bases.

The Main aim behind the development of cassesndra


is to deal with the complexities involved in the Big
Data
which makes useofno. of nodes aswell asit tries to
avoid thecritical part known as Single Point Failure
.
Cassendra has DataReplication asits major
features, in which the data packetcan
duplicated no. of times.This
duplicationis generally limited to a pre defi
ned.threshold value. For Replication a Goss
ip Protocolis used.
Nodesor work stations Participating in
the database will have equal priority,
access Privileges yet they all are
connectedto each other with dedicated communic
ation links.
When userof a data base requests
to have read / write Operation then
any available node can receive
request andprocessit. this

Basic building blocks of cassesndra

Anode: The Node represents the


place wherethedat a is actually stored,
Thedata center : The Data centeri
s theset of nodes which proces
s same Category ofdata
The cluster : A group of data
centers is known as Cluster,
The commit log ; The Co
mmit log is a maintenanc
etool which recovers the data if some
failure occurs.

Apache’s HBase

MongoDB

Scanned by CamScanner
Features of Cassandra

W Scalability
Cassandrais highly scalable system; it al: .
ement.
Ai) 24X7Availability SO allow to add more hardwareas per data requir

- Cassandra less chancesoffailure.


-_ Itis always available for
business applicati, lons.
e
(iii) Good performanc
i ly, which means,it increases your outputs as you increase the number of
Cassandra system canbescale d uplinear
nodesin the cluster.
(iv) Good Data storage
Cassandra can si tore almost all possi1 ble data formats including: structured, semi-struc
tured, and
-
unstructured.

It can manageall changeto yourdatastructures as per your need.

(v) Data distribution


rs.
per need of data replication across multiple data cente
Cassandra system providesflexible data distribution,as
(vl) Transaction support
which is properties of
city, Consistency,Isolation, and Durability
Cassandra supports properties like Atomi
transaction i.e.ACID.

— Faster write Operations


de signed for using low cost .
- Cassandra system was mainly
ciency.
te ope rat ion s an d can st ‘ore huge amountofdata, with very good readeffi
Itis designed faster wri
4. History of Cassandra
g facebook inbox.
atF acebook for searchin
- Cassandra was developed
08.
was open sourced
byF acebookin July 20
- Cassandra system
tedinto Apache Incubator in 2 009.
- Cassandra wasaccep

4.11.1(B) Dynamo DB <>

ry JJ amazon
DynamoDB
Fig. 4.11.3

is ' .
1 Data Model table is a collection of various items and each item is a
D namoDB in form of
base called
Amazon’s NOSQL data WH lettnentetet
:
collection ofattributes.
Scanned by CamScanner
Big Data Analytics(MU)
ofits ci columns with dat a
f d sch
sch ema of
oO tab les
le: w th p primary ry key key andlist
tablele ha a fixed
al abase, a a tab
na relational dat
types.
- Alltuples are of same schema.
— DynamoDBData modelcontains,
|
o Table
o Items
o Attributes
requireto defineall of the attrijbute nam
es and data types:
— DynamoDBrequires only a primary key and does not
in advance.

— Each attribute of DynamoDBin an item is a name-valuepair.

Primary Key

identifiesifies item
ich h ident
In orderto create table we mustspecify primary key column name whic i n table uniquely.
set iin

Partition Key

- Asimple primary key hasonly one attribute knownasthe partition key. 4

— DynamoDBusesthepartition key as input to hash function forinternal use. / |


Partition Key and Sort Key

— Acomposite primary key is madeoftwoattributes.



— Thefirst attributeis the partition key, and the secondattributeis the sort key.
All items with the similarpartition key are stored together, sorting is done in sorted
orderusing sort key value.
DynamoDB Namespace

evils)ehcp)
(Hash’key) Ig Allnbutes

Cisse ce}
(Range key) CNtual oly ces

Scanned by CamScanner
Big Data Analytics(Mu)

1015
njectName = "JDB"
SBN = "111-111"
thors = ['Author 3"] ”
rice = 1543
ageCount = 5000
bl cation = TechMax

3. Data Type
ous data types
— Amazon DynamoDBsupports vari
Binary, Boolean, and Null.
- Scalar types : Number, String,
Map.
— Document types + List and
Set.
Number Set, and Binary
- Set types : String Set,

CRUD Operations
(a) Table Operations

(i) CreateTable
your account.
te new table on
— Itisused to crea mmand.
bl e w e ca n us e Describ eTable co
of ta
— Tocheck status

TechKnomledgi
Publications

Scanned by CamScanner
a
"ProjectionType": ‘string"

'ProvisionedTh ugh
"ReadCapacityUni nbe :
"WriteCapacityUnits": number

(ii) Readtables

The readtable operation used to read the tables which


are created b Y create table command
diagnost
with the help &
ic tools such aslist tables. Returnslist of table namesass
ociate cd with the current account.
(iii) DeleteTable
ibaaa

The DeleteTable operation deletes a


table anditems associated with it.
(iv) UpdateTable

Modifies throughput, global sec


ondary indexesfor a given tabl
e.
SE
Scanned by CamScanner
Big Data Analytics(MU)

{b) Item Operation


(i) BatchGetitem

(ili) Deleteltem

It will delete a single


PSE

item withhelpofits
pri mi
(iv) Getitem
_
; GetIte
; m operation returns a setof fF attr ites
ibu for i item with
it the give
i giver n prima
i ry ry key.
Senate et halasibeak

key

It wil ll create a new iitem, orreplaces an old item with


a iew item.
(vi) Updateitem
Edits an existing item's attributes to add new item to the table if it does not alreadyexist.

_ (c) Others
(i) Query
A Query operation uses the primary keyofa table to directly access itemsfrom thattable.
(ii) Scan
TheScan operation returns one or moreitems anditem attributes.
DynamoDB

Fig. 4.11.5

5. Data Access nn port layerservices.


es HTTP and HTTP:
S as a trans
a we l b se rv ic e us
DB is
- Amazon Dynamo ag e serialization fo
rmat.
be ;us ed as a me ss
ottaa’tion (JSON) can
— JavaScript Object No DB w eb service API.
s re qu e! st s to the Dynamo
de make
- Application co DKs).
Software Devel lopment Kits (S
| to use AWS ion, and connection management.
- Itis possible authel nication, serializat
of request
aries take ¢ are
- DynamoDB API libr

Scanned by CamScanner
Data Analytics(MU)

MapSuite
Map Data DynamoDB Extension Consume
Map Data

6. Fig. 4.11.6
Data Indexing
4
— In orderto have
efficient access to
data ina table, q
Primary key attrib Amazon Dynamo
utes, DB creates and
maintains indexe
s for the|

with attributes ot
herthan ‘the Prim
ary key,
Secondary Indexe
s

Scanned by CamScanner
Bitig Data Analytics(MU)
(MU) az NoSQL,
lo
Types of secondary indexes ;
{i) Global Secondary index
Anindex witha partition key andsort key,different from index onthe table.
{ii) Local Secondary Index

Anindexthat has the same Partition anddi


fferent sort key.

Q.1 Write a short note on Big Data.


Q.2 Compare SQLand NoSQLdatabases.
Q.3 Write a short note on Casssendra.

Q.4 Write a short note on DynamoDB.

Q0a0

Scanned by CamScanner
Mining Data Streams

Syllabus

The Stream Data Model: A Data-Stream-Management System, Examples of Stream ee


Queries, Issues in Stream Processing, Sampling Data techniques in a Sian, Filtering aedaeMMari
Filter with Analysis, Counting Distinct Elements in a Stream, Count-Distinct Problem; y Bendis
Algorithm, Combining Estimates, Space Requirements, Counting FrequentItems in 2 Stream, c im "
Methodsfor Streams, Frequent Itemsets in Decaying Windows, Counting Onesin a Window: The °
Exact Counts, The Datar-Gionis-Indyk-Motwani Algorithm, Query Answering in the ‘'DGIM Algorithm,
Decaying Windows.

5.1__ The Stream Data Model


ce T

A data streamis a flow of data which arrives at uncertain intervalsof time.

Thedataavailable from a stream is fundamentally different from the data stored in a conventio
nal databasein the
sense that the data available in a database is complete data and can be processed
at any time we wantit to be
processed. On the other hand,stream data is not completely available at any
one point. of time. Instead only some
datais available with which wehaveto carry on ourdesired processing.

Anotherproblem with stream data is the storage of


the entire data received from the stream. The rate
of arrival is so
rapid as well as the volumeof data is so hugethati
t is not Practically possible to save the entire data
in a database
Ratherit is necessary to summarize the stream data by taking approp
riate samples.

Generally, we have to perform the following fundam


ental operations to handle input data stream
:
(i) Gatherinformation from the input data stream,

(ii) Clear orfilter the data,

(iii), Apply standard modeling techniques,

(iv) Deploy the generatedsolutions,

Diagrammatically we may represent the


above four Operations as shownin Fig.
5.1.1

Scanned by CamScanner
ay ——=
Big Data Analytics(MU)
Mining Data Streams
—=—_=>——_— 5-2
3

Fig. 5.1.1 : Stream data operations


a database or processed
Whendata stream ordataflowarrivesat the compute node,then eitherit has to be storedin
immediately. Otherwise the datais lost forever.
= The major issuesin this storage/processing task are :

{i) The incomingrateofarrival of data is tremendous, and

(ii) Different streamscontainingdifferent categories of data items


compute node.
(images, text, alphanumeric, encoded, video data etc.) may be coming to the same
storagewill neverbe sufficient to accommodate it.
It is practically not feasible to store the entire stream data as the
data in some wayor other. Every algorithm dealing
- Thus,it becomes necessary to summarize the incoming stream
ed to
samples of the stream are taken andthe stream is filter
with stream data uses summarization. For this purpose,
computations.
remove those portions which are not required for the
in the stream using just the samples obtained in the
- The next step is the es timation of all the different elements
storage requirements.
previousste, p. This drastically reduces the
gories :
sified into the following two cate
- The summarization algo rithms may be clas
ms described above which make use of good sampli ng techniques, efficient
(i) Thefirst category involves algorith'
am elements.
(removal of irrelevant data items), and estimation of the stre
filtering
sed on thi e concept of window. A window is nothing but the
most
(ii) In the second cate gory, algorithms are ba a re la ti on in a
. Qu er ie s ar e th en ex ec ut ed on the window considering it as
recent n elements 0 f the st
ream
necessary to even summariz e the
aba se. But the siz e and/ or the number of windows may belarge, so it is
dat
windows.

5.1.1 A Data-Stream-Management System LVVOR OSes PAYEDaL

er similar to convention al relational database-management system.


system.
is vi ystream-management
A data-stream-management system
of 8 dat
Fig. 5.1.2 represents the organization s TechKnomledgi
Publications

Scanned by CamScanner
Big Data Anal tics(MU)

Ad - hoc querias 1
® input I
streams
1.2,3,4,5,6 ©*

ASD3N216P a
i z sooam
pres a
Data streams precessing
0110 101010010 processor
Time factor <—

©
Active / working

Limited
In size

Fig. 5.1.2:A data-stream-management system


Let us understand the Purpose of each
of the components of the system :
Input streams : The input streams
have the following characteristics:
© There can be one or more numbero
finput streams entering the system.
© The streams can have diffe
rent data types.
© The rate of data flow of each
stream maybedifferent.
© Within a stream the time inte :
rval between the arrival of data item
s maydiffer. For example,
data item arrives after 2 ms from thear suppose the second
rival ofthe first data item, then it
is not necessary that the third
will also arrive after 2 ms from thea data item
rrival of the seconddata item. It may
arrive earlier or even later.
2. Stream processor : All types of
Processing such as sampling,
cleaning,filtering, and queryi
are done here. Two types of quer ng on the input stream dat
ies are supporte d which a re standing querie a
s and ad-hoc queries. Wes |
both the query typesin details in the hall discuss ||
upcoming section,
|
}
|

Be are;
queries directly on the archival store
is not supported. Also,
comparedto thefetching ofdata fro . the fetching of data from this store tak
m the workingstore, es a lot oftime as
‘5. Output streams : The output cons
ists of the full

The difference between a conven


tional database.
in case of the database-man
agement system all of th

necessary Precautionary meas


ures,

Scanned by CamScanner
gig Data Analytios(MU)
6 Examples of Stream Sources Mining
ing DataData Streams
Strea!

Sensors are the


devices whichare sponsible for: read
r, i ing and sending the measur
ements of various kinds of phys
ical
» Wind speed,
Pressure, moisture
content, humidity, pollution level, surface height,

- High resolution image stream


s are relayed to the earth stat
i
of image data per day. Many such high
“resolution images are released for the public from time to time
by NASA as
well as ISRO.
- Lower resolution image streams are Produced by the
CCTV cameras Placed in and around important places
and
shopping centres. Now a days most of the Public places and some
ofthe private properties are under the surveillance
of CCTV cameras 24x7.

3. Internet services and web servicestraffic

- The networking components such asswitches and routers on the Internet receive streams ofIP packets and
route
them to the proper destination. These devices are becoming smarter day by day by helping in avoiding congestion and
detecting denial-of-service-attack,etc. . w
~ Websites receive manydifferenttypesof streams.Twitter receives millions of tweets, Google receives tens of millions
of search queries, Facebook receivesbillions of likes and comments,etc. These streams can be studied to gather
useful information such as the spreadofdiseasesor the occurrence of some suddeneventsuch as catastrophe.

51.3 Stream.Queries

Standing queries
Ad-hoc queries

Scanned by CamScanner
Big Data Analytics(MU)
Mining Data Stream,’

_Stansingqueries : eischeequeries.
. Fig. 5.1.3 : Query types
1. Standing queries
|
— Astanding query isa query which is stored in a
designated place inside the stream processor. The
standing queries arg
executed whenever the conditions
for that particular query becomes tru
e. |
For example,if we take the case of a temperature senso |
r then we might have the following standing queriesin
‘i the |
stream processor : |
|
© Whenever the temperature exceeds i
50 degrees centigrade, output an alert.
|
© Onarrival of a new temperature
reading, producethe averageof all the readings
arrived so far starting from the
beginning. |

© Outputthe maximum temperature ever recor


ded by the sensor, after every newreadingarrival. |
2. Ad-hoc queries

An ad-hoc query is not predefined andi |


s issued on the go at the currentstate of the j
ad-hoc queries cannot be determined streams. The nature ofthe
in advance.
To allow a widerange of arbitrary ad-hoc queri
esitis necessary tostorea sliding windowof
all the streams in the |
working storage. A sliding window is nothi
ng but the most recent elements in the stre
elements to be accommodated in theslidin am. The numberof
g window has to be determined beforehand. |
elements arrive, the oldest oneswill be removed
As and when new |
from the window and hence the name slidi
ng window.
— Instead of determining thesize of the sliding window |;
in advance we mayalso take another
the unit of time. In this approach the sliding window approach based on|
maybe desi igned to accommodate
say all the stream data |
for an houror a day or a month, etc. | |
— For example, a social networking website like Face |
book maywantto kn jow the number of
over the past one month. unique active users |

5.1.4 Issues in Stream Processing


ae

Theissues in stream Processing mainly


arise becauseof the following twobasic
reasons :
1. Therapid rateofarrival of stream data, and

The hugesize of the data when all of the inpu


t streamsare considered,
Becauseof the rapid arrival of stream
data , the processing speed also must match
the arrival speed of the data. T0 |
achieve this,

Scanned by CamScanner
2 Sampling Data Techniques
5. = ae 3 q ina Stream
ya sampling dt
A sample _ an a stream which adequately represents the entire stream. The answers to the queries on the
sample can be considered asthough theyare the answers to the queries on the whole stream.
Let us illustrate the conceptof sampling with the example of a stream of search engine queries. A search engine may
be interested in learning the behaviour:ofits users to provide more personalized search results or for showing
relevant advertisements. Each search query can be considered as a tuple having the following three components:
(user, query, time)
Obtaining a representative sample

- Thefirst step is to decide whatwill constitute the sample for the problem in hand. For instance, in the search query
stream wehavethe following two options:

o Take the sample of queries of each user, or

o Take the sampleof users and includeall the queries of the selected users.

Option number2 as a sampleisstatistically a better representation of the search query stream for answering the
queries related to the behaviour of the search engine users.

The next step is to decide what will be the size of the sample compared to the overall stream size. Here wewill
assume a sample sizeof 1/10th of the stream elements.
- Whena new user comesintothe system,a random integer between0 and is generated. If the numberis 0, then the
user’s search query is added to the sample.If the number is greater than 0, then the user’s search query is not added
to the sample. list of such usersis maintained which shows which user’s search query is to beincluded in the
sample.

> When a new search query comesin the stream from anyexisting user, then the list of users is consulted to ascertain
to beincludedin the sampleor not.
Whether the search query from the user is
For this method to work efficiently the list of users has to be kept in the main memory because otherwise disk access
a time-consuming task.
will be necessary to fetch thelist which is
ution to this
But as thelist of users growsit will be come adom
.
problem to accommodate it into the main memory Onesol
numbergenerator. The hash function will map a user to a number
Problem is to use a hash function as the rani hue
ry of the useris added to the sample and otherwiseitis not
between 0 to 9. Ifa user hashes to 0 then the search q
added,
ize of any rationalfraction a/b by using a hash function which maps a user to a
'n genera | we can create a samples
P i addedto the sampleif the hashvalueis less than a.
limber between 0 and b-1 andthe user's query Is
TechKnewledga:

Scanned by CamScanner
Each tuple in
the stream Co
nsi ists of n comport nents out of which a subset
of componentscalled key on
forthe sample is based, Which
For instance,
in the search
query ex ‘ample the key cons
User, query an ists of only one component useroutof t he three ¢;
dtime, Butit oMPONey
t is not always necessary to consideronly use
key or even th r as the key, we coul Id even make que
query a By
e pair (user, quer
y) as the key. Sthe|

the hash function has to do the


extra wol rk of Comb
ining th e

Scanned by CamScanner
Data Analytics(MU)
=8 ta
ing Data: Streams
i wi
o The criteria involve the look
in,
~
set is
huge andca nnot be stored in the mainSP of set
Memor y. Membership. In this case thefiltering becomesharderif the
.
Bloomfiltering is a filtering techni Que whichj i |
the criteria. of the tuples which donot satisfy
'S usedfor eliminating orrejecting most
example offiltering

_ Let us consider the exampleof s Pam email“ filtering. Let S be the set of safe and trusted email addresses. Assume that
the size of Sis onebillion email a
ddresses and the stream
consist of the pairs (email address,
j i email message). ;
_ The set S cannot be accommodated address is of minimu m 20 bytesin
ited in main memory because on average an email
size. So, to test for set membersh n S, jit becomes
ipip iin
ersh necessary to perform disk accesses. But as discussed earlier a disk
access is manytimes slowerthan main Memory access

- We can do the spamfilteringusing only the main memory and nodisk accesses with the helpof Bloomfiltering. In this
technique the main memoryis usedas bit array.

= Say we have 1 GB of main memory availablefor the filtering task. Since each byte consists of 8 bits, so 1 GB memory
contains8 billion bits. This means we havea bit arrayof8 billion locations.
- We now needa hash function h which will map each email addressin S to one ofthe billion locationsin the array.All

those locations which get mapped from areset to 1 and theotherlocationsareset to 0.

- As there are 1 billion email addresses in S and billion bits in main memory, so approximately 1/8th of the total
available bits will be set to 1. The exact countofbits that are set to 1 will be less than 1/8th because more than one
email address may hashto the samebit location.
to which it is
- Whenanewstream elementarrives, we simply need to hashit and check the contents of the bit location
the bit is 0, then the
hashed. If the bit is 1, then the emailis from a safe and trusted sender. On the other hand,if
email is a spam.

5.3.1 Bloom Filter with Analysis

Bloom filter are asfollo


WS >

The components of a
. / ;
Anarray of n bits initialized to O's.
ons :
ch maps 2 key to one of the bit locati
Asset H of k hash functions each of whi
hy hy, .., hy
keys- scantibonsei
Aset S consisting of m number of ~
ter al
Fig. 53.1illustrates the block diagram of Bloo™ al

Scanned by CamScanner
Mining Data
5-9
ig Data Analytics(MU) Input data stream

Elements
whose
keys are in'S'

Fig. 5.3.1: Bloom filter


s the set S andrejects all other tuples.
- Bloom filter accepts all those tuples whose keys are member of

— Initially all the bit locations in the array are set to 0. Each key is taken. from S and one by oneall of the k hash |
functionsin H are applied onthiskeyK.All bit locations producedby h,(K) are set to a.

hy(K), ha(K), --he(K)


— Onarrivalofa test tuple with keyK,all the k different hash functions from H are once again applied on thetest key kK|
If all the bit locations thus produced are 1, the tuple with the key K is accepted. If even a single bit location outof|
theseturns out to be 0, then the tuple is rejected.

Analysis of bloomfiltering

— The Bloomfilter suffers from the problem offalse positives. This means evenif a key is not a
memberofS,, thereisa - |}
chancethat it might get accepted by the Bloom filter.

Tofind the probability of a false positive we need to use the model


of a dart game in which darts are thrown at
targets. Assumethat thereare x darts and y targets and the probab
ility of anydart hitting any targetis equally likely,
then according to the theory of probability :

© The probability of a given dart not hitting a giventarget


is xt
x

© The The probal


probability of mo dart htingagven
itting agi targets:xy
x y=0-3
x xx
1
— Aswe know, (1-€) @ = 1
When € is small. Thus, wecan
say that the probability of no dart hitting
a given target is”
0-3x 4). (y
4
*<
ee
Scanned by CamScanner
unting
ng on the stre am dati a is that of co
le and important processi
Apart from sampling andfiltering, one more simp
the number of elementswhicharedistinct in the given data stream.
hm we can
mem ory . But with the help of hash ing and randomized alg orit
This also requires a large amount of main
amount of memory.
achieve @ good approximationof the countusing only a small

544 Count - Distinct Problem

ebookor Goog le which


Letu: illustrate the problem
ofcount-distinct with the example of a websitelike Amazon or Fi ac
their server and
to know the number of monthly active unique users. These numbers are useful in preparing
wants
well as for generation of advertisini g revenues.
otherinfrastructureforefficient load handling as
h have the
e assume that the elemen tsof the stream belong to a universal set. The universal set for sites whic
- Here wi Twitter,
rds). Examples of such sites are Facebook,
login facility will be the setofall logins (usernames and passwo
for searching, the universalsetwill be
the set of all
Amazon, etc. Butforsiteslike Google which do not require a login
IP addresses.
red on
keeptheentirelist of elements (users) that have appea
One wayof performingthedistinct user count will be to
structure such as hash table or search tree.
the stream in the main memory arrangedin some efficient
r she is already there in the list or not. If she is not there
— Whenever a new uservisits the website,it is checked whethe
, then no action is taken.
in thelist, she is addedto the list. If sheis already in thelist
le main memory. If the number of users
- This approach worksfinetill the number of users can easily fit the availab
simultaneously, thenit starts becoming a problem.
growsor there are a number of streams to be processed
of different ways :
~ This problem can besolved in a number
ementation.
nodes butit increases the complexity of impl
© Byusing a greater number of compute
ses the time complexity by a hugefactor.
© Bystoringthe list structure in secondary memory butit increa
Thus,instead of trying to find the exact countofthe distinct elements wecantry to find an approximate count. This
will ensure that the task is finished quickly using only a small portion of the main memory.

542 The Flajolet- Martin Algorithm


Wests MeNaera
:
in Flajolet-Martin (FM) algorithm to count distinct elementsina stream, —
attin Algorithm.
count diatinct elements in a stream? Explain Flajolet-M
InFlajolet-Martin algorithm in detall.
T he Flajolet-Martin
‘ algorithm is used for estimating the numberofunique elementsin a stream in a single pass. . Th The
ioe
complexity of this algorithm is O(n) and the space complexity is O(log m) where is the numberof elements iin

© stream and mis the number of unique elements.
Tech! <
Publications

Scanned by CamScanner
The major componentsofthis algorithm are :

© Acollection of hash functions, and


© A bit-string of length L such that 2'> n. A 64-bitstringi
ir s sufficient
i
for rr most cases.
Each incoming elementwill be hashed using all the hash functii ons. Highertl theis numberofdistin
ct elements
int,
stream,higherwill be the numberofdifferenthash values. , scat ve comthy,
Onapplying a hash function h on an elementof the stream e,tl ticeliste re :
he hashvalue h(e) is produced. into
equivalent binary bit-string. This bit string will end in
in some number ofzeroes. . For instance,
with 1 zero and 10001 ends with
no zeroes.
, is knownasthetaili length. If R denotes the
This count ofzeroes
maxiimumt
m ail ,len;igth el
of any element e Encounters,
thusfar in the stream, then the estimate for the
numberof unique elements in the stre
s
Nowto see thatthis estimate makes sense
we haveto use the following arguments usin
g Probability theory :
© Theprobability of h(e) havinga tai
l lengthof atleast ris 2°.
° The Probability that none of the
m distinct elements have tail leng
th ofat leastr is (ce
© The aboveexpression can also
be written as ((1—27F)27)™2—"
© And finally, we can reducei
t to em-rag (1-2-1)e-
1
Ifm>>2', the Probability of
finding a tail of lengthatleas
t r approaches 1.
Ifm<<2', the Probability of
finding a tail of length at least r appr
oaches0,
Wecan conclude that the
estimate of 2" is neither
going to be too low nor
Let us now understand too high.
the working of the algorith
m with an example
:
Stream:5, 3, 9,2, 7,11

Hashfunction :
nh) = 3x41 mod
32
h(5) = 3G) +1 mo
d 32.= 16 mod 32 =
16 = 10009
h3). = 33)+1 mod3
2= 10 mod 32 = 10= 01
nh) = 30) + 1 mo 01 0
d 32.= 28 mod 32 =
28 = 11109
A(2)= 9Q) +1 mod
32 =7 mod 32=
7 = 00111
h(7) = 97) +1
mod 32 = 22 mo
d 32 = 29 = 10
AC) = 911) 119
+1 mod 32 =34
Tail lengths: {4, mod 32 = 2 = og
1,2, 0, 1, 1} o19

Scanned by CamScanner
pig Data Analytics(MU)
5-12 : "Mining Data Streams
43 Combining Estimates
5.

Si
_ Thereare three approa _
Pproaches for combining the estimates from the different hash functions :
o Averageofthe estimates, or
o Median ofthe estimates, or
o The combinationof the
above two.
_ Ifwe take the averageof the estimate
. sto arrive at thefinal estimate thenit will be problematic in those cases where
oneor a few estimates are very large as compared to therest of the others.
Suppose theestimates from the various
hash functions are 16, 25, 18, 32, 900,23. The occurrence of 900will take the average estimate to the higher
side
although most ofthe other estimates are
notthat high.
- The median is notaffected by the problem described above. But a median
will always be a powerof2. So, for example
the estimate using a medianwill jumpfrom 2° = 256 to 2° = 512, and there
cannot be any estimate value in between.
So,if the real value of m is say, 400,then neither
256 nor 512 is a good estimate.
- Thesolutionto this is to use a combination of both the average and the median.
The hashfunctions are divided into
small groups. The estimates from the groups are averaged. Then the median of the averagesi
s calculated whichis the
final estimate.
- Noweven if a large value occurs in a group and makesits average large, the median
of the averageswill nullify its
effect on the final estimate. The groupsize should be a small multiple of log, m so that any possible
averagevalue is
obtained andthis will ensure that wegeta close estimate by using a sufficient number of hash functions.

5.4.4 Space Requirements

~ We do not need to store the elements of the stream in main memory.

~ The only data that needs to be stored in the main memory is the largesttail length computed so far by the hash
function on the stream elements.
> So, there will be as manytail lengths as the number of hash functio
ns and eachtail length is nothing but an integer
value,
~ Ifthere is only a single stream,millions of hash functionscan be used onit. But a million hash functionsare far more
than whatis necessary to arrive at a close estimate.
~ Only when there are multiple streams to be processed simultaneously, we haveto limit
the numberof hash functions

Per stream. Even in this case the time complexity of calculating the hash valuesis a bigger concern than the space
Constraint.
ms ina Stream
a Counting FrequentIte!

There are two main differences between 2 streal


m and file :

at somepoint.
© Avstream has no end while every file ends
ee SS

Scanned by CamScanner
Big Data Analytics(MU) Mining Data Strea
os Mg
© The time of arrival of a stream element cannot be predictedin advance whili e the da ta in a file is already avai, able, |
~ Moreover, the frequentitems in a stream at some point of time
5 i
may be diffe rentfrom the freq4uent items in the Same
stream at someother pointof
time.
~ i
To continue our discussion, we need to understand the concept of an itenee je market-basket modelof data
l init bockitlondledth aterne We
have twotypes of objects.Oneis items and the otheris baskets. Thesetofitemsina i
— In the next section we consider someof the sampling meth 5 i uent items in a
odsavailable for counting the freq Stream,
Wewill consider the stream elements as baskets of items.

5.5.1 Sampling Methodsfor Streams


— Thesimplest techniquefor estimating the frequent itemsetsin a stream is to collect
a few baskets and save them in i
file. Onthis file we can run any frequent-itemsets algorithm. This algorithm will producethe estimat
e of the current
frequent itemsetsin the stream. The other stream elements which arrive during the
execution ofthealgorithm can be
stored in a separate file for processing later.

— After the completion of the first iteration we can run another iterati
on ofthefrequent-itemsets algorithm with :
© Anew file of baskets, or

© The old file collected during the execution of the firstite


ration.
— Wecan go onrepeating the above Procedures for
moreiterations. This will result in a collection of
Ifa particular itemset occurs in a fraction of the frequent itemsets,
baskets thatis lower than the threshold, it can
collection. be droppedfrom the

© Anew segmentofthe baskets fro


m the stream can beusedas an input to
the algorithm in some iteration.
© Adding a few randomitemsets in the
collection and continuingtheiterations
,
5.5.2 FrequentItemsetsin Decaying
Windows
— .To use the concept of decaying
windows for finding the frequent
itemsets we need to keep the
following points in

© The stream elements are not


individual items, rather they are
baskets of items. This means
appearing together is considered a sing that manyitems
le elementhere.
© Thetarget here is tofind all of
the frequent itemsets and not j
other words, for each itemset receiv
ed weneed toinitiate the

unmanageable,

— To solve the first issue, we


have to considerall
each suchitem:

Scanned by CamScanner
iq pata Analytics(MU)
EB 5-14 ing Data Streams
fo solve the secondissue, onh Only thoseite:
3 -
cored sets are scored whoseall immediate proper subsets are already being
st = '

6 Counting Onesina Window

56.1 The Cost of Exact Counts


_ Again, there are two majorappr
oaches to the counti ng problem :
4, Exact count

2. Approximate count

- For the exact count approach, we needtostore the entire N-bit window in the main memory. Otherwiseit will not be
possible to compute theexact countofthe desired elements. Let us try to understandit with the following arguments.

- Let us suppose instead of storing the N-bit window in main memory, we store an n-bit representation of the N-bit
window where N> n.

- Now, number ofpossible N-bit window sequences= 2".

and, number of possible n-bit window representations = 2".

Since, N > n,it implies 2"> 2".


- Clearly we can see that the numberofrepresentations are notsufficient to representall possible window sequences.
Hence by the pigeon-hole principle there exists at least two different window sequences p and q which are
represented by the same window representation in main memory.

~ As p #q, it means they y must differ in at least one bit position. But since both of them are having the same
representation, the answt er to the query of the numberof 1’s in the last k bits is going to be the same for both p and q,
whichis clearly wrong. ,

~ Thus, from the above discussion we can cont clude that it is totally necessary to store the entire N-bit window in
Memory for exact counts.

56.2 The DGIM Algorithm (Datar - Gionis — Indyk - Motwani)


MU — May 16, May 18, Dec. 18, May 19
Rie

ion oof DGIM algorithm an N-bit window is representedusing O(log” N)bits.


simpl version
'n the simplest 7
. ber of 1's in the windowin caseofthis algorithm is 50%.
The maximum errorin the estimation of the num

roneet

Scanned by CamScanner
The twobasic components
ofthis algorithm are:
© Timestamps, and
© Buckets,

Size=8
Storage Space
requirements

gorithm can be
1. Asingle bucketis determined as
represente follows
d using O(log N) bi
2. ts. .
Number Of buck
etsis (log N)
,
3. Total space Fequir
ed = (log? N). "

5.6.

Scanned by CamScanner
its, th
! . one e size - en weobsey rvetl it 7
the size 1 buck ets,
inside poth ,
Thise
et.a e
meansin this case the oldest
4 bu tt , , 2 bucket and ath ees
e size 4 buckh
ine fom
timestam p buck eti s the size
qhusthe estimate of the number
o} f 1s Sin
i the latest
16 bits = (4/2) +2'
+1.+1=6, But the actual number of 1's is 7-
6.4 Decaying Windows
Windows?
Sanee
Ex
irreVeena

The problem with thesliding


outside the window.
the use of exponential
. is
A solution to this
eigh tage s to ly deca
i
ying win d lows whiich take into accountall of the ele
i
stream butassigndifferent wei them. The recent elements are given more al
ae ered the

older elements.
This type of windowis suitabl le for answeriing the queries on th @ most common recent ele!
most popular current movies, , or the eome”
Most popularitems bought on Flipkart recently,’ or the im
etc.

Supposethe elementsof a stream are :


Cty Cpr O3y oor Ct

wheree,is the oldest element ande,is the mostrecent.

Then the exponentially decaying windowis the following sum :


t-1 :
Jeo eit“
where cis a small constant such as 10°.

OO
vs decaying window
Fig. 5.6.1: Fixed length window
fixed len gth sliding window and an exponenti
ally decaying window of equ al i
s Fig. 5.6.1 showsthe difference bel tween a ow.
s the fixed length wind
weight. The rectangular box repre: sent
following steps are taken:
When a new elem entets:arrivesin t he stream, the

© Multiply the current sum by the term (1-c}, and


© Add ers. ut of of
at we don’t need to worry aboutthe older elements going out
ow is th
jo r ad va nt ag e of the decaying wind
ie ma
f new elements: -
e window onarrival o
uestions
int he context of Big Data.
Qy Whatis data stream ? Explain data stream operations
Q2 What is stream Data Model ? Explain in d otal.
Ra Saky with neat diagram.
lain the data-stream- management system

Scanned by CamScanner
ities
Big, Data Analytics(MU; 17 SESi
Q.4 List and explain various Datastream sources.
Q.5 What are stream Queries ? Explaindifferent Categories of stream Queries.
Q.6 Discuss different issuesin Data stream processing:
Q.7
‘ i 2
Whatis sampling ofDatain a stream ? How do weobtain representative sample ?
size?
Q.8 Explain General Sampling problem. Whatis effect on stream if we vary the sample
Q.9 Explainthefiltering process ofdata streamswith suitable example. *
Q. 10 Whatis bloomfilter ? Explain Bloomfiltering process with neat diagram.

Q.11 Whatis bloomfilter ? Analyze the Bloom filter for performance.

Q.12 Explain countdistinct problem with suitable example.

Q. 13 Explain Flajolet-Martin algorithm in detail.

Q. 14 Explain the process of combining the Estimates. Also commenton space requirements.

Q .15 Howfrequentitemsin a stream are counted ?


Q.16 Whatis cost of exact counts ?
Q.17 Explain Datar-Gionis-Indyk-MotwaniAlgorithm in detail.
Q.18 What is Decaying windows ? Explain in detail.

Scanned by CamScanner
Finding Similar Items

syllabus it
r e s : D e f i n i t ion of a Distance Me: a
7
e a n 5: stances, Jat card Distance, Cosine Distance, Ed
u
Distance Meas i"ng - stance. sure, Eucli d Di
tance, Hamm Di
Dis

61 Distance Measures

in the
tance measur e. Let x and y be two points
cal led a spa ce. A spa ceis necessa ry to define any dis
Asetof poi nts is input, and produces the
d a: s a fun cti on wi hic h tak es the two points x and y as
define
space, then a distance measureis ction is denotedas :
The distance fun
points x and y as output.
distance between the two

dx, v) the following axioms:


by the function dis a real number which satisfies
- The output produced
can never be negative.
stan ce be! tween any two points
1. Non-negativity : The di

d(x, y) 20 is zer o.
stanc .¢ betwee n a point and itself
2. Zero distance : The di

d(x, y) = Oiffx=¥ from to x.


x to y is s a me as the distance
he di stance from
3. Symmetry: T
ween
ual to the distance bet
d(x, y) = dly, x) ce between x andy is al
wa ys sm al le r th an or eq
test path between two
_ 4 gle inequ
Triangle uality :
ineqality rd s, di st an ce measureis tlhe length of the shor
wo
x and y via another
points x and y.
z
d(x, y) < d(x, z) + d(2, ¥)

xee- re Y
lity
iangle Inequa
Fig. 6-1-1 : Tr

Scanned by CamScanner
Big Data Analytics(MU)
oat in de details :
measuresin
distance
In this section we shall discuss about thefollowing

(1) Euclidean Distance (2) Jaccard Distance

(4) Editit Dista


(3) Cosine Distance Distance

(5) Hamming Distance

6.1.1 Euclidean Distances

distance measures.
— The Euclidean distance is the most popularout ofall the di ifferent
they
on the i
Eucli dean space . If ie consi der an n-d ime nsional Euclidean space
— The Euclidean distance is measured
sider the two-dimensional Euclidean
each point in that space is a vector of n real numbers. For example, if we con
real numbers.
space then eachpoint in the spaceis represented by(xs, 2) where x, and x; are re
i
which Pace
in the n-dimensional space |s
— The most familiar Euclidean distance measure is known as the L,- norm
defined as:

zi
(Ext, X2,---Xnb [Yas Yar-Yal) = > jz, 21 Ci
— For the two-dimensionalspace the L,- norm will be :

(Exp, x2], [yi y2}) = 1-1)? + 2-2)?


— Wecan easily verify all the distance axioms on the Euclidean distance :

1. Non-negativity : The Euclidean distance can never be negativeasall the sum terms(x;— yi) are squared and the
square of any number whetherpositive or negative is always positive. So the final result will either be zero
ora
positive number.
2. Zerodistance: In case of the Euclidean distance from a pointto itself all the x/'s will
be equalto the y/s. This in
turn will makeall the sum terms(x,—y,) equalto zero. So thefinal result
will also be zero.
3. Symmetry :(x,- y,)*will always be equalto (y; x;)”. So the
Euclidean distance is always symmetric.
4. Triangle inequality : In Euclidean space, the length ofthe side of a
triangle is always less than or equal to the sum.
ofthe lengthsof the other twosides.
— Someother distance measuresthatare used on the
Euclidean space are:
1. L,-norm where ris a constant :

4 (xtX25Yi YoYnl) = (z. by)


i=l
2. L,-norm whichis commonly known
as Manhattan distance :
4 ([x1,.X2,...Xh [15 Yon ‘Yal) = x" sy;
i=
3. L.=-norm whichis defined as
:
4 (D1, X2,.--%nh [Yt Youe-Jql) = max (Ix; — y,)) Vi

Scanned by CamScanner
Finding SimilarItems.
=

L-norm = (0-6+ @—7p


w

» a+{s
el gi
+
wn

Q L,-norm = 10-614+4—71
443
= 7
4-71)
L.-norm = max (110-61,
8)
max (4, 3)
4

6.1.2 Jaccard Distance

Jaccard distance between twosétsis definedas :


- Jaccard distance is measured in the space of sets.

d(x, y) = 1—SIM(x, y)
sures the closeness of twosets. Jaccardsimilarity is given by the ratio of
SIM( x, y) is the Jac car d simi lari ty whi ch mea
of the unionof the sets x and y.
the size of the intersection and the size
nce :
~ Wecan verify the distance axio! ms on the Jaccarddista
ion of two sets can never be mor ethan thesize of the union. This means
1. Non-negativity : Thesize of the inters ect d(x, y) will never be negative.
than or equal to 1. Thus
the ratio SIM(x,y) will always bea value less
en xU X= XO x =x. In this caseeSSIM(x, y) = x/x = 1. Hence, d(x, y) = 1-1 = 0.In other
2. Zero distance : If x= y, th
is zero.
ce betwee n the same set and itself
wordsthe Jaccard distan Jaccard
ricx Uy =yUxand xy =y Ax, hence
bo th un io n as we ll as intersectio! n are symmet
3. Symmetry : As
distance is also symmetric d(x, y) = 4(¥- x). minhash function
ca ce ca n al so b e con: sidered as the probability that a random
quallity : Jac rd distan
Triangle ineequa
sets x and y to the same V alue.
does not map both the
(z) # (yD)
Pih(x)# h(y)] < Plh(x) #b@)] + Pth n.
io
ndom minhashfunct
Whereh is the ra
Techiinomledgi
Pubtications
Scanned by CamScanner
Big Data Analytics(MU) Finding Similar

6.1.3 Cosine Distance

The cosinedistance is measured in those spaces which have: dimensio i


ns. Examplesofs uch spa
— ces are :
Pi
1. Euclidean spaces in which
the vector components are
real numbers, and
2. \
Discrete versions of Euclidean Spa
cesin which the vector compon i
ents areinteg
Boolea n (0 (0 and 1) 1). :
ers or
~ Cosine distanceis the angle made by the two vectors from theor 5
igin to the twopoininttsin
s i
the sp: ace. The ranBege of thi
angle is between 0 to 180 degrees.
— Thesteps involved in calculating the cosine distance given
twovectors x and y are :
1. Find the dotproduct xy:

(Cr ractah Diese3ed = De" my


i=
2 Find the L-norms of both x and
y,
3. Divide the dot product xy bythe L,— norms of
x and y to get cos 8 where6 is the angle between x and y.
4. Finally, to get @ use the cos? function.

— Let us illustrate the concepts


with an Ex. 6.1.2.
Thedistance axioms on the Cos
ine distance may be verified as
follows :
1 Non-negativity ; The Tange oft
he possible values is from 0
to 180 degrees. So there is
distance. no question of negative

rotation from x to y.
Ex.6.1.2: Consider the following two
vectors in the Euclidean
Space ;
x=[1,2,—1], andy =[2, 4, 4),
Calculate the cosine dis
tance between x and
y.
Soin. :
,
Given x=[1,2,-1] ;y=[2,1,
q
@ XY = [2] + [2x1] + (C1) x1]
2424C1)=242-]

= 4-133
Gi) Tynorm forx = OC
H ye
L,normfory = VORPO
_ - :
P
= 8 = Ge e
», @- no rm of y) “6

Scanned by CamScanner
61

The points in the space for edit dis


tance ate strings
There are twodiffere
nt Methodsfor defi
ining and Calc
ulating edit di
1. The classical method stances ;

~The Edit distance between


two strings X andy
Only twoedit operations are all is the least number of edit operationsr
equired to convert x into y.
owed:
1. Insertion ofa single charac
ter, and
2. Deletion ofa single character
.
-. Toillustrate let us take the following twos
trings :

x= JKLMN

y = JLOMNP
~ For calculating the Edit distance between x and y we have to convert string x into string y using the edit
Operationsofinsertion and deletion.
~ Atfirst compare the character sequences in both thestrings :

x = J K LM N

y= Jeb Oo M N P

bee ddd -
@ @@ ® © © — Positions

@are having different characters in x and y. So we need to make the ni ecessa ry


, Cearly, positions®,@and@are
insertions and deletions at these three Po:sitions.
kK Loo
Lo P >
pid
@ oo
haracter K. The characters following K will be shifted one
to delete the cl
From position @ ofstring x, We ha ve
Position to the left.

Scanned by CamScanner
Finding Similar
Big Data Analytics(MU)

status of thestring xis:


— After the first edit operation (deletion)the

x= J LM N

@2OoO®
character L and before the character M,
Now the character O hasto be inserted at position 3 i.e. after the

After the secondeditoperation(insertion) the statusof thestring x is:

x= J L OM N

budibib
®©2O@OO ®@O

In the finalstep, the character P hasto beinsertedin thestring x at position 6.

Afterthe third andfinal edit operation (insertion) the status of the string is :

X= J LO MN P
Llib
9®@OO®O @

Sothe Edit distance betweenx andy is :

d(x, y) = Numberofdeletions + Numberofinsertions


= 14+2=3

(ii) Longest Common Subsequence (LCS)

The longest common subsequenceof twostrings x and y is a subsequence of maximum length which appears in
both x andy in the samerelative order, but not necessarily contiguous.

Letusillustrate the conceptoffinding the Edit distance using LCS. method with the sameset of strings as in ‘the
previous method:
x= JK LMN
y=JbtLOMN P

The longest commonsubsequence in x and y = JLMN.

Theformula ofEdit distance using LCS is :

d(x, y) = length ofstring x + length ofstring y — 2 x(length of LCS)


Here,

length ofstring x =
un

length ofstring y
Fn
tt

length of LCS

Scanned by CamScanner
ua
Big Data Analytics(MU) 6-7 Finding SimilarItems

So,

dQ, y) = 54+6-2% (4)


= 11-8

=3

- The distance axioms on the Edit distance maybeverified as follows :


‘i
vity : fas or moreinsertions and/or deletions are
1 gativity To convert onestring into another string atleast zero
Non-negati
necessary.So, the Edit distance can never be negative.

ro distance : Onlyin the case of two identical strings, the Edit distance will be zero.
Symmetry : The edit distance for converting string x into string y will be the same for converting string y into
string x as the sequenceofinsertions and deletions can be reversed.

Triangle inequality : The sum of the numberof edit operations required for convertingstring x into string z and
for converting the string
then string z into string y can neverbe less than the numberof edit operations required
x directly into the string y.

6.1.5 Hamming Distance

Hamming distance between twovectors is the numberof


Hammingdistance is applicable in the space of vectors.
other.
componentsin which they differ from each
following two vectors :
For example,let us consider the

x= 1 0001 1

1 1 1 0 1 0
y

Ibe i did

© ®@ © @ © © Positions

tsat positions 2, 3 and6 aredifferent.


ta nc e be tw ee n thi e above twovectorsis 3 because componen
~ The Hammingdis :
distance may be verified as follows
The distance axi ioms on the Hamming the Hamming
diffe r in at least zero or more component positions. So,
ors will
1, Non-negativity : Any two vect
negative.
distance can never be
stance will be zero.
identical vectors, the Hammingdi
Zero ro dist at ance + Only In the case of two
dist

be the same whetherx is com pared with y ory is compared with x.


Symmetry : Th e Hamming di stancewill rences between z and y
z, plus the numberofdiffe
T riangle ineq jual lity : The number
of differences between x and
f differences betweenx and y.
can neverbeI jess than the
number 0

Scanned by CamScanner
Big Data Anal MU, 6-8

Q.1 Whatis Jaccard similarity 7

Q.2 Whats distance Measure ? Explain different criteria's regarding distance measures.
Q.3 What do you mean by Euclidean distance ? Explain with example.
Q.4 Whatis Cosine distance ? Explain with suitable example.
Q.5 Consider following are the two vectors in Euclidean space X = [1, 2, - 1] and Y=
[2,1,1]. Calculatethecosine distance
between X and Y. ao i
Q.6. What is Edit distance ? Explain with classical method.
Q.7 Whatis Edit distance ? Explain with Longest Common Subsequence (LCS) method.

Qo0

Scanned by CamScanner
Clustering

Syllabus
CURE Algorithm, Stream-Computing AS P “
itream-Clustering, Algorithm, Initializing & Merging Buckets, Ans
wering
Queries

Cluster Using REpresentative i.e. CURE is very efficient data clustering algorithm for specifically large databases.
CUREis robustto outliers.

Traditional clustering algorithm :


point considered asa clusteri.e. clusters centroid
- In traditional clustering, it selects for any one point andit is only
approach.
se
o otherdatapoints of any otherclusters.It works in eclip
- Points in a cluster appearclose to each other comparedt
shape in better way.
to outliers and a
cluste ring algorit hmi s all-po ints approach makesalgorithm highly sensitive
= Drawback oftradit ional
oO
points.
minute change in position of data
ary shape.
nts approach not work on arbitr
~ Cluster centroid and all poi

7.2. CURE Algorithm

spherical clusters.
in s ic al as well as non-
s better ph er
~ CURE algorithm work ge database : sudipto Gul ha, Rajeev Rastogi,
Kyuseok Shim.
algo rithm for I: ar ch.
CURE: Anefficient cluste ring r an all-points or centroid approa
as re pr es ' entative cluste th
ed
h are 5 catter
It prefers a set of points whic eed uP clusteri
ng.
g to sp
sa mp li ng an d partitionin!
CURE uses ra ndom

Scanned by CamScanner
Big Data Analytics(MU) . 72

7.2.1 Overview of CURE (Cluster Using REpresentative)


Data

+
Make random sample
4
Makepartitioning of sample
J
Partially cluster partitions

L
Eliminate outliers

L
Clusterpartial clusters

L
Labeldatain disk

7.2.2 Hierarchical Clustering Algorithm

A ccentroid-based point‘c’ is chosen. All remaining scattered points are just at a fraction distance of to get shrunk

towardscentroid.

Such multiple scattered points help to discoverin nonspherical clusteri.e. elongated cluster.

Hierarchicalclustering algorithm uses such space whichis linearto inputsize n.


— 2 .
Worst-case time complexity is O(n’ long n) and it may reduce to O(n’) for lower dimensions.

CUREalgorithm : CUREcluster procedure


— It is similar to hierarchical clustering approach. Butit use sample point variant as cluster representative rather than
every pointin the cluster.
First set a target sample numberC. Thenwetry to select C well scattered samplepoints from cluster.
The chosen scattered points are shrunk towardsthe centroidin a fraction of « where0 <a <1.
3

ee) 4

Fig. 7.2.1
— These points are usedas representative of clusters andwill be usedas point in d,,,, cluster merging approach.

— After each merging, C sample points will be selected from original representative of previous clusters to represent '
newcluster.
we tool,
Scanned by CamScanner
W.. Big Data Analytics(MU) Clusterin
il fires
r mergingwill be stoppedunt
=
Cluste er is found.

Nearest| _Merge =

Nearest}

O
Fig. 7.2.2
i

7.2.2(A) Random Sampling andPartitioning Sample

g algorithm random samplingis used in case oflarge data sets.


Toreduces ize of input to CURE’s clusterin iency and.
- random samples, it provides tradeoff between effic
Good clusters can be obtained by moderate size
accuracy.
eachpartition get clustered
oning sampl e reduc es time requir ed for execut ion becaus: e beforefinal cluster made
Porti
eliminated outliers.
whenever it is in pre-clustered data format at
elling
7.2.2(B) Eliminate Outlier’s and Data Lab

er in cluster.
nerally less than numb
= Outliers points are ge m each cluster are labelled with data
set
re d, mu lt ip le re pr esentative points fro
gets cluste
- As random sample
remainders. all-points
ed to centroid or
app roa ch fou nd most efficient compar
sc attered point ie. CURE
— Clustering based on m.
itional cl justering algorith
approach of trad
thm)
RE (c lu st eri ing algori
Pseudo function ©' f CU

Scanned by CamScanner
¥ Big Data Analytics(MU) : 7-4 Clusterin
—— 9

1) >dist (wt

Scanned by CamScanner
y :
ata Analytics(MU)

s
Stream computingis usefuli in real time
_ system like count of items placed
on a conveyor belt.
_~ 1BM announce
saanestions to d stream
ees computit
puting '& sy: system in
i 2007, which
i runs 800
it microprocessors and it enables to software
pplicat Get split to task and rearrangedata into
answer.

_- AT1 technologies
iawrlatercy derives stream
CPU boresolve computing
r
tational phical Processors (GPUs) working
i with Graphical wiiwith high performance
irmi with
i i ji

- AT1 preferred stream computing to run application on GPU


instead of CPU.
Ekaboan_©
Keyboard: ‘stdin EDor a!

@stder
Fig. 7.3.1 : Standard stream for input, output and error

in approach to give guaranteed performance evenin worst


- BDMOAlgorithm has complexstru ictures andit is designed
case.
n.
Datar, R. Motwaniand L. OCallagha
~ BDMO designed byB. Bahcock, M.
Details of BDMO algorithm
of two.
ly partitioned andlatet ; summarized with help of bucket size and bucketis a power
(i) Stream of data areinitial
ket may start
buckets are one or twoofeach size within limit. Required buc
(i) Bucket size has few restricti ‘onssize of 48 and so on.
for example bucket size required are3,6, 12, 24,
with sized or twice to previo us
s mostly O (log N).
are rest rain ed in som e scenario, bucket
(ii) Bucket size od etc.
e,, timestamp, numb er of points in cluster, centri
(i) Bucket consists with conten ts like siz
Few well - known algorithm for data stream clus tering ar=:
A b) BIRCH
(b)
(a) small—Spaces algorithm
(a). C2IcM
(c) COBWEB

Scanned by CamScanner
Big Data Analytics(MU; 7-6 Cluster

A smallsize ‘p’ is chosen for bucket wherep is powerof2. Timestampofthis bucket belongs to a timestamp Of most
recentpoints of bucket.
Clustering of these points done by specific strategy. Method preferred for clustering at initial stage provide the
centriod or clustroids, it becomes recordfor eachcluster.
Let,

* ‘p’ be smallest bucket size.

* Every point, creates a new bucket, where bucketis time stampedalong with clusterpoints.

* Any bucket older than N is dropped

* If number of buckets are 3 ofsize p

P— mergeoldest two

— Then propagated merge maybelike (2, 4p»).

While merging buckets a new buckedcreated by review of sequenceofbuckets.

If any bucket with more timestamp than N time unit prior to current time, at such scenario nothingwi
ll be in window
of the bucketsuch bucketwill be dropped.
if we created p bucket then twoofthree oldest bucketwill get merged. The newly merged
bucketsize nearly ay as we
needed to merge bucketswith increasingsizes.
To merge two consecutive buckets we needsize of bucket twice thansize of 2 buckets
going to merge. Timestamp of
newly merged bucketis most recent timestamp from 2 consecutive buckets. By computing few paramet
ers decision of
cluster mergingis taken.

Let, k-meansEuclidean. A cluster represent with numberof points (n) and centriod (c).

Put p =k,or larger — k-meansclustering while creating bucket


nyc, +n,C,
To merge, n=n, +n, c= ny +n,

Let, a non Euclidean, a cluster represented using clusteroid


and CSD, To choose newclusteroid while merging, k-points
furthestare selected from clusteroids.
CSD,, (P) = CSD, (P) +N, (d” (P,c,) +d, (cy, c)) + CSD,(c,)
7.5 Answering Querles
— Given m, choose the smallest set of bucket such that It
covers the most recent m points. At most 2m points.
— Bucket construction and solution generation are the two
steps used for quarry rewriting in a shared — variable
bucket
algorithm,oneof theefficient approaches for answering
queries.

Scanned by CamScanner
Clustering
ata Analytics(MU)

orithm ?
What is clustering alg

What is CURE ?
ter algorithm.
Write procedure of CURE clus
sampli ng and partiton sampling.
Whatis sampling ? Explain random
r orithm.
Write pseudo function of cluste alg
cluster in CURE.
Write procedurefor merging
g?
Whatis stream computin
, stddir ?
Whatis stdin, stdout
m.
Explain BDMO algorith
ering ?
Wh atis bu ck et , howit is used forclust
q.10
g of bucket.
initializing and mergin
a.11 Explain in brief

Scanned by CamScanner
_Link Analysis

Syllabus
rs
PageRank Overview, Efficient computation of PageRank : PageRank Iteration Using MapReduce, Use of Combine
to Consolidate the Result Vector

8.1 Page Rank Definition

product of the Google i.e. search engines,


Google™ is one of the giants in Information Technology. The major
dominate theall other webservices.
m they exhibit is not up to
Before Google’'s search engine there were manysearch engines available but the algorith
the mark. These search engines were worked equivalent to a “web-crawler”.

Web-crawler is the web component whoseresponsibility is to identify, and list down the different terms found on
every web page encountered byit.

This listingofdifferenttermswill be storedinside the specialized data structure knownas an “inverted Index’
‘An inverted index data structure haslisting of different non-redundanttermsandit issues an individual pointer to all
available sources to which given termis related.

— Fig. 8.1.1 showsinvertedindexfunctionality.


Inverted
index

Fig. 8.1.1 : Inverted Index

Every term from the inverted indexwill be extracted and analyzed for the usageofthat term within the web page-

Scanned by CamScanner
4
ig Data Analytics(MU) 82 Link Analysis
Big DS “ ;
fag in aweb
a Percentage Within the given web page According to percentage of usage of terms
“has someusage
Every! term

pagetw! eiranked Tits willbe achieved by firing a search qu ery.


pagein which this term occurs proves
Examp! le : Ifa 8 given t erm ‘x’ is appeared in the Header of a web page thenthe
Ki A

to be muchrelevantrather than the term appeared in a paragraph text.

g.1.1 Importance of Page Ranks

n becomes a crucial activity a5 D2?


in the Information Technology world, the storage and retrievalofthe informatio
generation from every sectorincreases exponentially.
category web pages their arrangement by
Every day if world will face new challenges for managing the different
category,their ranking by searchcriteria etc.
1500 +
150 million web pages and by today we have
According tostatistics in 1998 World Wide Web has around
million web pages.
because every page contains following associated
Itis very muchdifficult to manage these huge number of web pages
parameters:

(i) Number of terms involved

(ii) Category of web page

(iii) Topics involved in a given web page

(iv) Usage ofinvolved topics by other web pages

(v) Quality of web pages etc.


pages
ed as, “A classical method used to arrange the web
Page Ranking : The term page Ranking can be defin
data
according to its objective and th
e usage of termsinvolved in it on the world wide web by using any link
- 7
structure.”
part
page and sergey Brin. This page-Ranking mechanism wasa
~ The page Ranking mechanism was developed byLarry .
a functional prototypein 1998
which was started in 1995and result into
of their research project
the Google.
- After that, shortly they founded

8.1.2 Links in Page Ranking

Ginn webpa gesexists in the somepart of world wide


web then all pages may
If we considerthat, there a re 150 million
b page-
1.7 billi on li nk s to different we
have approximately
Example th
Cinagiven domain of websites. They have intercon‘ nectionlinks between n them as
Suppose wehave 3 pages A, By
Shown in Fig. 8.1.2
~~ a

Scanned by CamScanner
Big Data Analytics(MU) Link Analysis
83
Webpage A

Webpage B

Fig. 8.1.2
The numberoflinksexists between two or more webpagescan becategorizeasfollows:

1. Backlinks

2. Forwardlinks

1. Backlinks
With reference to Fig. 8.1.2 A and are the Back links of web page‘C’ i.e. Backlink indicates given web pageis
referred by how many numberof other web pages.

Forwardlink

Forwardlink represents the fact that, how many webpageswill be referred by a given web pages.

Clearly,out of these twotypesoflinks backlinks are very important from Ranking of documents perspective.
‘A web page which contains numberofbacklinks is said to be important web page and will get upper position in
Ranking.

A page Ranking in mathematical formatcanbe represented as,

Rw =e Ve>By RM
Ny

Where,
: Represents the web page By
Ny: It represents numberofforwardlinks of page v.
C: It represents the Normalizations factor to make

IRhy = 10 Ri) =IR) +R, +... RD

A world wide web can be considered as the ‘Di-graph’ i.e. Directed graph Any graph
‘@’ is composed of tw?
fundamental componentsvertices and Edges.

W Kaenielt
Scanned by CamScanner
Big Data, Analytics(MU) Link Analysis
54
Gs,

| Lie

; ; Vertices/nod
Here,vertices or Nodescan be mapped to Page: .
Ss
o Ifwe consider a small 7
| Part of worl d wide web containi ng 4 web pages namedas P,, yy Pz,Par P3, Pa4 -
s and forward links to othe r
o Every page i has Back link pages.
o Fig. 8.1.3 showsthe above mentionedstructure.

Fig. 8.1.3

P,, P;and P, respectively.


In Fig. 8.1.3 page P, has Forwardlinks to page
3.
o Page P,has links to page and pageP

o Page 3 haslink to page 1 and


and page 3.
o Page 4 haslinks to page 2
web page Py haslinks to page
P,, P;and P,
h page P, in above
© Ifauserstarts surfing wit

Fig. 8.1.4
to 1/3.
oO
at P a B e P , / P3/ p, is equal
user W! ill be
© Probability that
2 itself is ‘0’.
user will be at page
© Probability that
then ©
oseu se! h a s ch os en page Pz .
© Supp
is 1/2, i
tha t use ! wi ll be at PABE Py
© Probabili ty ‘
be at page Pa is 1/2.
© Probability that user will
° Probabililiity that use! r will be an be represented using special structure known as “Transition
b surfing bya
These possibilities of we
Matrix’. ed of 1" pages ‘n’ rows and ‘n’ columns. Twopointerc andj will be to

n
In general, the transitio s Fr
row and colu! mn
Tepresent the current eee
Pe!

Scanned by CamScanner
Big Data Analytics(MU) 8-5 Link An;

Anygiven element can be represent as mj.

mj = 1/kif and onlyifthe page at j* column has forward links.


Additionally one of the forwardlinks to samepageitself.
The transition Matrix for above web can be represented as,

A BC D
A 0 12 1 «0
B 130 «O 2
M =
Cc 13. 0 O 1/2
, pL12 0 0
Matrix should be seen column wise Example 2
v2

Fig. 8.1.5

~. The transition Matrix for above graph can be represented as,

xX Y Z
x V2 2 0

M =Y 1/2 0 1

Z 0 1/2 0
x V3
y] =/13
Z. 1/3.
i -|i2 12 i [3]
1/2 v2 0 1/418 ...For first iteration
V6. o 12 o1 Li ae
[is] [i 1/2 i [|
V3} =/12 0 1] 172 iterati
\d iteration
14 o 12 of Liv6, .--For
Eee
Hence, with simplified page Rank algorithm critical problem has evolved ie. during eachitera
tion, the loop
accumulates the rank but neverdistributes rank to other pages.
To identify the location at which theuser will in near future one must have a Probability with a specialized function
knownas “A page Rank”.

All Transition Matrixeswill work on column vector orj” component.


Considera user “xyz” wantto surf the web consists of ‘n’ web pages. Every Page in a web has equal probability that
user willvisit that page at next instance.

Scanned by CamScanner
Link Analysis
pig Data Analytics(MU)
e a/n with
page will b
th
vector. component. The probability that the user will be at"
an
Consider a vector V, as 1

sam e initial vector V,. seation ¢ be


7 i
distribution of this situation can
__ At next instance, user will at one of ‘n’ available pages. The probability
represented as, MVgon next instanceit will be M(M.V,) and process continues.
o vectors
_ The probability that a userwill be present at i node on giveninstance is equal to elements position int
value.
Probability (x) = m;:V,

where,

(i) m, represents the probability of user movement atgiven instance from j" location to i™ location.
(ii) Vjrepresents the probabilities that useris at j" position for previous instance.

Traditionally, this process is known as “MarkerPrinciple”


~ Tohave Markovdistribution a system of graph should satisfy following constraints.
(i) Graph under consideration should be “strongly connected”i.e.

Every nodeis accessible other available nodes.

(ii) There should not be any dead ends.

8.1.3 Structure of the Web

The nodescan be also termed as


The webis nothing but the composition of number ofindividual independent nodes.
ed systems together.
workstations. We can imagine the webas setof different distribut
- Inevery distributed systems we haveclusters, cluster means a group of similar objects for the clustering in the context
a given distributed system is the
ofdistributed systems, the term object can be replaced with nodei.e. A clusterin
collection ofsimilar nodes.
d by analyzing the concept suchas :
~ Additionally thetermsimilarity can be determine
figuration
(i) Nodes having same hardware con
n
tem and other system and application software configuratio
(i) Nodes having same operating sys
es in w' eb should always be connected. This can be achieved theoreticall
~ Itis always recommendedthat, all the nod 'y
the cas e alw ays .
butpractically this is not
ld Wide W eb where, every nodeis connected with other nodes.
~ Fig. 8.1.6 shows the part of Wor

Scanned by CamScanner
¥ Big Data Analytics(MU) 87 Link Analysis
ES,
In practice any given webstructure is composed of 4 types of components :

1. Strongly Connected Components (SCC)


2. In- components

3. Out components

4. Disconnected components
. . ther for for tl the
1. Astrongly connected components Is nothing but the components whicharedirectly connected to each other
data exchange andtheyalso has forward and backwardlink to each other.
2. In-components : In-components are the integralpart of where it exhibit the relation with SCC suchthat,

Not recognized
from SCC
Fig. 8.1.7
3. Out-components : Out-components are the structures which shows following properties.
Reachable from
scc
®———————>.” Out components
Not recognized
to SCC

The in-componentand out components can have tendrils which represents in and out components.
Tendrils Tendrils

(A)

Fig. 8.1.9

In real time two majorissues wewill encountered:

(i) Dead end (ii) Spider traps

Scanned by CamScanner
if
rix is known aS ‘stochasticity and
The property of having sum = 1 for Most
e columnsin given transition mat
of

there are dead ends then someof the ans have ‘0’ entries.
ConsidertheFig. 8.1.10.

Fig. 8.1.10 : Node 3 is dead end


searching if we encountered on the
g Fig. 8.1. 10 we cam e to kno w that, Mode 3 is des ad end and while
By referrin Node 3.
asthere is not a single out link from
web surfingwill struck at that pag' e or node
Node3.i.e. at dead end then
Fig. 8.1.10 is,
Hence, transition matrix for 17200
o
138 0 O 12
M =
130 O 17

2 12 0 0

ends :
in g ar e th e wa ys to deal with dead ming links.
Follow that nodeby removing their inco
n delete
with dead el nds we ca ed th e same approachin
First approa' ch to deal mor e dea d ends whic! h has to be solv wi th
it will int! roduce
an ta ge o f th is approach is
Disadv
Nodes which are
recursive manner. a giv en gra ph to or web will be keptas it is the
total page rank for r th e calculation of
we del e te th e node but e set of oth er no des which acts as predecessors fo
Tho ugh consider th
‘e ee phG , bu t we, can
notavailable in gr!
page rank. ‘or nodes and h
8
consid!
Additionally we can
r th is pr oc ed ur e, some node:
Afte
ulations.
predecessorscalc Pa
io ns all ni odes has their
After some iter at
node deletion order.

Scanned by CamScanner
- .
Link Analysig
. 5 i
— Suppose we have graph containing nodes and these nodes are arranged in following ™: anner as shown | in
Fig. 8.1.11.

Fig. 8.1.11
If we observe the Fig. 8.1.11 to calculate the Page rank. Wefin
d that, Node is the dead end asit doesn’t have any
forwardlinks i.e. the links going out from
Node 5.
— So hence, to avoid the dead ends, delete
the Node5 andits corresponding are coming from Node 3. So now
the graph
G becomes.

Fig. 8.1.12 : Grapha after deletion


of node 5
— By observetheFig. 8.1.12 we cameto know that
now ‘node 3’ is ‘dead end’.
— Now as weare avoiding the dead ends. Hence delet
e ‘Node 3’and it is res
pective in coming edges.
becomes, Now, Graph G
,

Scanned by CamScanner
above graph will be,
the transition matrix for

if 0 120
M =| 120 1
12 12 0
lows =
we can have componentvector representation for above matrix as fol
18
3 ~ (1) Iteration 1
18
1/6

3/6 —(2) Iteration 2


- 26
32
Ssi2 — (3)Iteration 3
4n2
2/9

Final value for componentvector will be, 4/9


3/9
olw CIA oly

Pagerank for node 1

Page rank for node 2


I

Page rank for node 4 =

> We have to calculate the page rank for Node 3 and Node 5 with the exact opposite order of node deletion. Here
of predecessors.
Node 1, Node 2, Node 4 are in therole
rank of Node 3
~ Numberof successor to Node 1 = 3, Hence, the contribution from Node 1 for calculating the page

is 1/3
Node5 for calculating the page rank of node 3 Te
For Node 5 it has 2 successors. Hence the contribution from
X%9) + 2*9) =54 4,2) +(4x3) -8
Page rank of Node 3 =(3

For calculating the page rankof Node 5, Node 3 plays @ crucialrole. As Node3 has, numberof successors = 1 and a
of Node=5 node
Thas node 3 asits predecessor. Hence, we can conclude that Node 5 waspagerank same as that

As the aggregateof their page rankis greater than 1, so It doesn’t indicate the distribution for a Biven ;
user whois.
Surfing through that web page.Still it highlights the Importance of web pagerelatively.

Another Way to deal with dead endsIs configure the process for a given user by having ——
a “Taxation”. n that it is
“sumed to be moved through web known as
Is knownas “spider traps”,
Texation methodpoints to other problem also which
~N

Scanned by CamScanner
8-14 . Link Analysis
wv Big Data Analytics(MU)

(i) Spider traps


outlinks but they never goingto link with any
Spider traps is nothing butset of web pagesall of them containing the
otherpage.

ie. Spider Trap = Set of web pages with no dead ends but no edge goingoutside also (no forward link)
traps in realtime in a
Spidertraps can be sowed in the web with or withoutintention. There can be multiple spider
8.1.14 which showsthe part of web containing only
given webpage.Set but for demonstration purpose, consider Fig.
one spidertrap.

Fig. 8.1.14 : Graph with one spider- trap

The transition Matrix for the Fig. 8.1.14


0 2 0 0

13 0 O 1/7
=
"

138 0 O 1f

1/3 122 0 0
heultimate
If we proceed further by the same method stated in previous section for calculating the page rankthent
result that wegetwill be,
1/4

1/4
— (1) Iteration (1)
1/4

1/4
3/24
5/24
— (2)Iteration (2)
11/24
5/24
5/48
148
~ (3) Iteration(3)
29/48
MA8

Scanned by CamScanner
i Analytics(MU)
8-12 Link Analysis
a =.
21/288
31/288
205/283 ~ 4) Iteration (4)
31/288

At last iteration we get,

1
0
Thehighest page rankwill be given to Node3 as, there is no link which goes outfromit, but if has thelink to inside it.
So, user is going to struck at Node 3. As it is represented that numberof user are there at Node 3 so Node 3 has
greater importance.

To have a remedyto this problem we just configure the methodof calculating the page rank by injecting a new
concept known as‘teleporting’ or morespecifically a probability distribution of “teleporting” and we are not following
the links going out from given node.

To calculate the teleporting probability — calculate new componentvectorV,.,, for estimating page ranks suchthat,

Voew, = BM: V+(1-B)-e/n


Where,

B = Constant (Ranges from 0.8 to 0.9)


1
e = Aggregate vector ofall vectors with value
graph
n = Number of web pages/ nodes ina

If we do not dead endsin the graph then,


user _ Probability of not to choose out-link for the current page
Probability of introduction of New by same user
to a given web
entof Dv.
Anotherpossibility is user wil II not be able to move to any page as (1-B)e/n termis independ
ed with dead end s then,
ie. when we don’t encounter

Xv<1 but XV=0.

81.4 Using Page Rankin a Search Engine


e,
deciding the overall efficiency of the search engin
The pagerankcalculation plays crucialrole in
A basic crawling is done to have PaBe ranking tofetch the required information and the page,
When a user submit some query OF 3 request’ Seeeal abiieaderaticht an & secret algorithm
on a predefined criteria.’is
is based
triggered for the execution which fetches different web pag of if words *
of single word or collection
The user query generally in the form
a
teations

Scanned by CamScanner
8-13 Link Analysis
Big ig Data Analytics(MU)
iB the fetched web
riteria to arrange
— For exam) ple, the most popular search engine Google has 250+ such predefinedcriteria
pagesin some particular order. v rch
ae of user’s search query.
~ Every pageonthe web should possess minimum one wordor one phraseinit. Same asthat
s 5 i it a page will havet
— If the given web page doesn’t contain any word or phrase then thereis less probability that pag he
highest page rank.
Svanc
also m atters eomliwil
e.g. / ew
In page rankingcal culatiion, the place on the web page whe! re the pl phraseis appeared is
a ; the phrase appeare!
(phrase appears in the headerwill have more importance than in footer,
average importance)

8.2 Efficient Computation of Page Rank

— In previous sections we have studied that, how to calculate the page rank of the given webpagein a given web
structure.
— The efficiency in such complexcalculation is achieved as we have taken a small part of web i.e. for 4-5 nodesor pages
only. 7
— But,if scale-out this small for concept to a real-time condition billions of web pages V, a matrix — vector multiplication
we haveto computeoforderatleast 70-80 times till a component vector will stop changing its value.
— Forsuch real time complexity the solution proposed is use of Mapreducetechnique studied in Section 3.2, but such
usageis notthat straight forward,it has two handles to cross.
(i) The most important parameter that how to represent the transition matrix for such huge number of web pages.
If we try to represent the matrix for all available web pages whichare underconsideration then it is absolutely in
efficient for performing the calculations. One wayto handle this situation is to indicate the non-zero elements
only. .
(ii) One morethingis if we go for an alternative to mapreducefunctionality for performance and efficiency concerns
then we maythink for ‘combiners’ explained in Section 3.2.4.
— The combiners generally used to minimize the data more specifically an intermediate data result to be transferred
to
reducer task.

iit
KiVy KpV2 KoVo KyVg KoVy KVe

(Ki list(Vys Vor Vg) (Ke,list (Vp, V4, Vg)

\_Z
Fig. 8.2.1: Use of combiners

Scanned by CamScanner
Link Analysis
814
«Date. Analytics(MU)
ch more impact to reduce the effect of “thrashing”
iso, the striping concept doesn't have mu ‘
postin of env iro nme nt use d i.e. Distributed computing
U itself irrespect ive
All ©! putations performed by the CP / CPU is is to execu! te the
instructions
ti
DCE) or standalon e computing. Hence, the main task of processor
environment (DCE)
ta from the secondary storag
e. her
e featuring of da ry storage rat
and ni ot th
y in just fetching the data from seconda!
tuation processor /CPU is bus
ifin some commonlyoccurredsi
ionis knownas “Thrashing”
than executing the instruction then suchsituat
n Matrix
24 Representation of Transitio of links going ou
t from
and num be! r
deal with arebillions
web pages that we are going to
‘As we knowthat, number of
10 on an average. web
agiven web pages are
ind ica te the tra nsi tio n mat rix is to have a list of different
o. The best way to
entry ‘1! in billion pages is not zer
associatedvalues.
page which has entries ‘non-zero’ with
ke,
Thestructure will lookli
aro ontty,

~--» 4 - bytes —> forinteger


inates
co - ord
»8 - bytes —> for value
with double-precision

bytes
Abytes + 4 bytes + 8 bytes 16

Fig. 8.2.2

d of quadratic.
has linear nature instea
So, space required here :
1/number of oflink
links goin, ig
wise representation of nonzeroentriesi.e. . out
Wecan apply more compression by column ;
from a given web page
Acolumn is representedas ,
degree
linteger > to represent —? out
n-zero entry in tha t column
integer > torepresent — for every no

L
ry lo cation
yields a row number ofent

Scanned by CamScanner
8-415 Link Analysis
¥ Big Data Analytics(MU)
ectors depicted by V and View
Asingle pass of pagerankcalculationincludes calculation two componentv
View = B:M-V+(1—B):e/n i
Where, B = Constant (Ranges between0.8 to 0.9)
€ = components vectorofentries 1

n = Numberof web pages

M = Transition matrix
When‘n’ has small value then V andV,,.,, can be stored in primary memory or main memory for Map task.

If in real —timeV is big in size so that if can’t befit into main memory then we can gO forstriping method.

8.2.3 Use of Combiners to Aggregate the Result Vector


The pagerankingiteration with MapReducetask is not proven to be sufficient because '
at Map phase
(i) If we wantto adddifferent terms V,.wi-e. the i element of new resultant vector V, computed
combines the different values
This is equivalent to the usage of special structure “combiner” which actually
according to their key as shownin theFig 8.2.3.

(ii) If we are not going to use MapReduceatall.

Hence depending on the requirement situation and complexity of problem a method to be used should be decided.
Combiner

Generation
of key-value <kKy, Vy> <kj, vp
pair <ky.V5>

Combiner +—

Fig. 8.2.3 : Combiner working mechanism

Scanned by CamScanner
= ral nk question arises thi i
when page : giants
en the mation
in informat i. technology such i
as google will develop som ——
tot. Additionally there are somesecurity related issues such as “The spams”
gut :we do have
‘i destructive minded people
: in society whowill
i alwaystry -
to affect the system by performing seme
malicious activities. Hence for page ranking calculations “spammers” are catia into existence.
spammers haveintroduce thetools and techniquesthrou
; oe h which
i f i i Kk can be ;increas en
aselected multiple such intentional rise in the value of care rankis ae “ Pele m
“_ forlink spam spammersintroduce the webpagesitself forlink spamming.

93.1 Spam Farm Architecture

- The malicious web pagesintroduced by the spammersis known as spam farm.Fig. 8.3.1 showsthe basic architecture
of spam farm.

Sensetive
pageS Targated web page farm
forlink spam

Basic architecture of spam farm

Scanned by CamScanner
W_Big Data Analytics(MU) 8-17 Link Analysis
— Ifweconsider the spammersperspective then Fig. 8.3.1 can bedividedinto 3 basic blocks

1. Non-sensitive 2. Sensitive

3. Spam farm

1. Non-sensitive

Thesearethe pages which are generally not accessible to the spammerfor any spammingactivity. As these pagesare
not accessible to spammersso, theywill not affect by any activity performed by the spammer.

Mostof web pagesin a given webstructure will fall in this category.

2. Sensitive

Theseare the web pages which are generally accessible to the spammers for any spammingrelatedactivity. As these
Pages are accessible to the spammersso, they will get affected easily by spamming activity performed by the
spammer.
The effect of spamming on these pagesis generally indirect as these pages are not manipulated by the spammers
directly.

3. Spam farm

The spam farm is the collection of malicious web pages whichare usedto increase the numberoflinks pointed to and
coming out from a given web pageso, ultimately the page rank of a given pagei.e. target web pagewill increase
dramatically. There are other category web pages which supports the spamming activity by aggregating page ranking
i.e. a part of term (1 - B).

8.3.2 Spam Farm Analysis

In spamming activity we do page rankingcalculation by consideringsensitive, nonsensitive and


actual malicious spam
farm pages.
— The basic page rankingcalculation is done and byalternate method ‘B’ term is calculated which is also knownas
Taxation method.

The term ‘f’will depicts the fact that, how a part ‘of Page rankis segregated amongthe successor nodes for the
next
iteration. Actually B is the constant term ranges between 0.8 to 0.9 generally (0.85).
— We knowthat, there are some web pages whosupports the spamming activity.
- Soa pagerankof oneof such supporting pagecanbe calculated with thehelp of following formula
Pris) = B-Y¥/m+(1-8)/n
Where,

Pr(s) — It representthe page rankof randomly selected Supporting web page.

6 > Itis the constantrange from 0.8 to 0.9.


Y —Itrepresent page rank oftarget web pagesay ‘t’.

m — Itrepresent numberof web pages supporting spammingactivity,


n__— it representtotal number of web pages in a given web Structure,

Scanned by CamScanner
g pa ge with B multip|
(ii) page rank of supportin ‘iples.

B- (Pr (s)
where, Prd) = B-Y/mt(1-fyn
» We can concludethi at the page rank‘Y’ rt of target webpage‘t’ will bein the e fform o' if

Y =x+B-m(Bt,+28)
m” oo
_= x+B 2 y+B-(-B)xq
ma

a constant Q,
Here we can introduce

nk Spam
8.3.3 Dealing with Li page rank system.
fu nd am en ta l pr im ary things related to
on the
ffect oflin k spam h the link spammingthe
different search
- Aswe have seen thee ly he ncce
e to dea l wit
ystem complete
wil l di st ur b th e pag! e rank 5) izing the effect of lin
k spam.
- Link spam he lp in minim
which will
erent solu! tion,
engine thought of diff m ing they are
as follows:
with l i n k s p a m
io nd eventually
call y the re are t w o ways to deal wi ll ha ve to p- view of whole scenar a
- Basi m
h engine algorith ructure.
c! ini dexing st
() A traditional approa them from the
link
k spams and find the alternate way to do the
algorithm win find suchlin r w e b p age the sp
al mmerwill
pamme
m de le tes the s
But as soon as al go ri th
nk with reference to
ure for cal culation of page ra
spamming. ify the proced

(i) A modern way to dea}


below gradelink spams.

() Trust ranking

(ii) Spam mass ming that tt those web pages are


not the
pages by assu

.
\
In trust ranking the system !5
Part of spam farm.

Scanned by CamScanner
Big Data Analytics(MU) 19 Link Anal
— Such set of web pagesis termed as “topic”.
Consider a spam farm page wantto increase a page rank of trusted web page. So, spam page can.have a link to
trusted pagebut that trusted page will not established a link to spam page. .
(il) Spam mass

— In spam astechnique, the algorithm of page rankingwill calculate the page rank for every web page also the part of
thepage rank (affected part) whose contributor is spam page will be analysed. This analysis is done with the helpof
comparison between normalpage rank and pageranking obtainedthroughtrust ranking mechanism.
This comparison can be achieved throughfollowing formula :

pr(Sm) ==Pa
Where, P((S,) = page ranking by spam mass technique
P, = page ranking bytraditional method
P,t, = page ranking bytrust ranking method
If P.(s,,) < 0 i.e. negative

Or
Pr (s,,) > 0 but < 1 i.e. not close to 1 then that pageis not a spam pageelseit is a spam page.

8.4 Hubs andAuthorities

— The hubs and authorities is an extension to the concept of page raking. Hubs and authorities will add more
preciseness to the existing page rank mechanism.
= Theordinary,traditional page rankalgorithm will calculate the pagerankforall the web pagesavailablein a given web
structure. But user doesn’t wantto examine orviewall of these web pages. He/she just wantfirst 20 to 50 pages in an
average case. :
— Hence,the idea of hubs-and authorities will cameinto existence to haveefficiency and reduce workload calculating
page rank.
|
|
— Inhubs and authorities page rankwill be calculated for only those web Pages whowill fetch in resultant set of web }
i
pagesfor a given search query.
= Itisalso knownas, hyperlink induced topic search abbreviated as HITS.
— .The traditional pagerankcalculations have single view for a given web page. But hubs and authorities algorithm will
have twodifferent shadesof viewsfor a given web page.
1. Some web page has importance as theywill present signification information of given topic so these web pages
are knownasthe authorities,
tion of any randomly selected ‘spits nel <
2. Somewebpages has importance because they gives us the informa
theywill direct us to other web pagesto collect more information about the same. Such web pages known as
hubs.

Scanned by CamScanner

o5(MU) Link Analysis
r
1 rormalizing Hubsand Authority
a
af a page can be viewed.
section hubs andauthorities these are the two shades with which a web
i earlier
stated in
a given web page.
so, we can allot,2 typesofscoresfor

Hubbiness Authority
score score

How much good


!
How much good
:
aweb page in aweb page in
hubrole. authority role

Fig. 8.4.1
score
— represents hubbiness
h
ity score
a — represents author
. s of j" page-
th ge s ‘h’ wi ll gi ve measure of Hubbines
t b pa
> }"componen of a we
mea: sure of author
ity of j* page.
a web page ‘a’ will give
J" component of
~ jr
web.
pages in agiven
‘LM’ for web
ider link matrix
> Tohave thenotion of‘h’ and ‘a’ cons
resent as LMij
~ Any elementof LM can be rep abl ished from i" pageto j”
page.
LMij = 1if ali nk is est

LMij =09 otherwise

ange the result :


The transpose of LM will ch ae
ose of LMij om j" page to i* page
++ LM, represent transp
. T
ished fr
T _ 1 ifalinkis establ
T .
--. LM, = O otherwise
ng and outgoing
. tion matrix which maintain
the record of No. of incomi

Me
= M wh er e is a original trans
i lore that um ”
inks,
ks
difference between LM’ and Ms,

Scanned by CamScanner
jig Data Analytics(MU)

TEGO

ey
ee
core
coor
Given Matrixis- A =

Apply Transpose operation,

REE
ooo
-Hoo
Hoon

Now, Considerinitial Hub score as 1


-_mOoe
Hoo
“oon
bene

afa?at?42244?)
1
Ree

a? + 12 422 444)
2
"

Var +12 +22 44%)


ee
L (12 +12 +2? + 42)
0.2132
0.2132
0.4264
Lo.8528.
Iterate over K.

Q.1 Whatis Page Rank ? Explain the Inverted Index 2

Q.2 Whatis Page Rank? Explain Importance of Page Rank?

Q.3 WhatareLinks in Page Ranking? Explain in Detail.

Q.4 Whatare links in Page Ranking? Explain Back Links and Forward Links with suitable example?

Q.5 Explain the Structure of Web in the context of Link Analysis?


ents?
Q.6 Explain Structure of Web? Whatis thesignificance of In-Components and Out Compon
saieel
Scanned by CamScanner
- —<$
gig Data Analytics(MU) Link Analysis
ctu , b? Explai‘ n the Dead ends
at pi_nygs scuss iinn detail Stru re of we

8 Explain Structure of
Web? Explain Spider trap in detail
all,
4
ranking in search engine?
09 explain the role of Page
d in efficient computation of Web
a0 Explain the different modification suggeste leb pages.

the Page ranking Mechanism?


ott Whatis thrashing? Howit affect
use of Combiners.
0.12 Explain iterating Page Rank Process with MapReduce? Also commenton
iners.

? Explain in Detail.
0.13 Whatis Link Spam
ecture in detail.
0.14 Explain Spam Farm Archit
farm.
t on Non-Sensetive,sensitive and spam
n with neat diagram? Also commen
0,15 What is Spam Farm explai
rm Analysisin detail.
0.16 Explain the Spam Fa
and Spam Mass.
to de al wit h Lin k Sp am with Trust Ranking
? How
0.17 Whatis Link Spam
its Significance.
Authorities? Explain
0.18 Whatis Hubs and
aoa

Scanned by CamScanner
= a.
Module - 6

Syllabus

A Modelfor Recommendation Systems, Content-Based Recommendations,Collaborative Filtering

9.1 Recommendation System

— It is vast widely used now-a-days. It is likely a subclass of information filtering system. It is used to give
recommendationsfor books, games, news, movies, music,research articles, socialtags etc.
— It is also.useful for experts, financial services,life insurance, and social medialike Twitter etc.
— Collaborativefiltering and content-based filtering are the two approach used by recommendation system.

~ Collaborative filtering uses user’s past behaviour and apply somepredication about-user maylike and accordingly post
data. ,
— Content basedfiltering uses user's similar properties of data preferred by user.
— By using collaborative filtering and content based filtering a combine approach is developed i.e. Hybrid
recommendation system.

9.1.1 The Utility Matrix

‘A recommendation system prefers the preference of a utility matrix. Users and item’s these are entities used by
recommendation system. .
erences must be observed.
Users have preferenceto data andthese pref
some item category.
Every data itself is part ofutility matrix as it belongs to
Example : A table representing users rating of apps on scale 1 to 5, with 5 as highest rating Blank representsthat
usernotreplied on scale A1, A2 and A3 for Android 1, 2 and 3 i1, i2,.13 for iOS 1, 2, 3 users A, Band C givesrating.
Al A2 A3 i121 i2 1B

A|3 4 5 4

B 5

c}3 4.4 4

Fig. 9.1.1: A utility matrix representing ratings of apps on a scale Ito 5

Scanned by CamScanner
.

yo
s
9-2 Recommendati o!on Systems.
dati
very minutefraction of real =
ation of Android
it
typ ical users rating are al sci
enario if we consider actual number ofrapplic
~ and 105platform and number ofusers,
itis observed in table for someapps thereis less numberofres ponses.
matrix is to make some Predictions for blank spaces, these prediction are useful in
The 6 oal behind utility
.
recommendation system
.
‘A
As ‘at user gives rating 6 5 to i2 APP So we have to take in account parameters of app i2 like its GUI, memory
s if applicable etc.
consumption, usability, music/effect
nilarly ‘B’ user gives rating ing 5 to A2 app so wehaveto takesimilar parameterin consideration. By judging both apps
similarly 8
user A and B.
i2, A2 features and all we canputpredication what canbe further recommendedto

of fll rating anywhere stil it can be judge and predicted what kind
from user “C” response though there is no use
feature based appuser‘c’ should be recommended.
tems
91.2 Applications of Recommendation Sys
Amazon. Com

- CDNOW. Com

=. Quikr.com

- okx.com

- Drugstore.com

- eBay.com

- Moviefinder.com
endation system.
d seller/bu yer, trading website uses recomm
~ Reel.com and so manyonline goo
solidate in a single place
e Rec omm end ati on, New sAr ticles etc. are likely to be con
~ Product recommendation, M jovi
applications.
Recommendation System
41.3 Taxonomy for Application
Community
inputs
(history, attribute)

Scanned by CamScanner
¥ Big Data Analytics(MU) 9-3 Recommendation Systems
9.2 Content Based Recommendation

It focuses onitems anduserprofiles in form of weighted lists. Profile are helpful to discover properties of items.

9.2.1 Item Profile


- Anactor of drama or of movie is considered as an actor set, few viewers prefer drama or movie bytheir favourite
actor(s). ,

— Aset of teachers, some students prefer to be guided by few teacher(s)only.

— Theyearin which songs album release or made. Few viewers prefer old songs; somepreferto latest songs only, users
sorting of songs based onyear.

— So manyclasses are available which provides some data.

— Few domains has commonfeature for example a college and movie it has students, professors set and actors,
directors set respectively. Certain ratio is maintained as many student and few professors in quantity while many
actor works under one or two director guidance. Again every college and movie has year wise datasets as movie
released in a year by director and actor and college has passing student every year etc.

— Music (song album) and a book has same value featurelike songs writer/poet, year of release and author, publication
yearrespectively.
Sr. Mig. Package Contents
e idl :

Productwith .
feature

Community fo Lst of
"data recommendation

Users
(source of profile and contextual data)
Fig. 9.2.1 : Recommendation system parameters

9.2.2 Discovering Features of Documents

— Document collection and imagesare the twoclasses of items.

Weneed to extract features from documentsand images.

Scanned by CamScanner
Systems
9-4 Recommendation
S
l
Let say news articicles
S per but user
There are many articles in a newspa
inds of docu ment . in a ney
here many wspaper.
S reads very
few of them.A reco mmendation
es to a us er su pp os ed to be interested to read.
est for art icl 2
ny past
websites anindd websy s,subigg :
geem
simila rly th er e ar e so ma ; , blo gs could be reco
i

nenia n r in eerestes un:


f wecan classify blogs accordii gly to the topi s.
x

twill be morefriendly to usersi


‘it
cordin
“1
pic
tely, docum: entclasses cannotprovide le available in fiformationfeatures.
fortunately,

substitute is used for wordidentification on which characteri: terize topic of document.


by removin; g repeatedly used common words ie. elimination of stop word
s. The words
We ne ed to sor t do cu me nt
of
ain after elimination stop wordsare proceed further to counttheir TFi.e. .e. term firequency.
tet rs
umer fre ntis observed and calccula
ulatted.
ed
i.e. inve
IDF ie. invers' jocumentfrequency of each word in a docume
rse Doc
nt.
ofcharacter which characterize documeent
The high scoring wordsarethe group
are fixed to all similar
doc um: i est TF and IDF values of count. Then those n words
it high
- , n words found
Let,
in ent with
documents.
of feature set.
tfoundsthreshold becomesa part
Words whoseTF, IDF values of coun be
nts, there distance need to
a doc ume nt. To kno w simi larity between any two docume
such few words set represent
be done by either
measured betweensetsit can
(b) Cosine Distance
(a) Jaccard Distance

ures from Tags


92.3 Obtaining Item Feat e, Compositor,
ai re ava ila ble like Titl e, ISBN,Edition, Printing,Pric
base many features
~ Incase of publishe d book data title and price tag mostly
.
(r ea de r) of bo ok main ly concern with
“Editor, Copyright et c., but user or search value range for pric
e.
it em s by ta g it ems by enteri ing phrase
s of'sa me
- We can get number offeature e of tag valu! e like price ran
ge for book, colour
on featur
ai la bl e, us er s can search at item
- By keeping tag option av
pping.
shade for cloth etc in online sho r level.
enough tag awareness at use
mi n ta g is to cr ea te tags an d such
~ One proble

92.4 Representing Item Profile


Android OS
ra ti ng fo r mo bi le application based on
we might take th e
num eri ic al fo r instance
Some features are
Platform to be a feature. rs of it.
sta rs by some usel
This rating j real number like 1,234 a n d 5 ible average rating of
a wi ll n ot make a sense to get poss
ating is ar or 5 s t a r s
. mp o n e nt like st
By ly one co
'y keepéing options of on . cture implicit in
e observed withoutlose of stru
a
n application,
BY ken; sible averae
pos
eping 5 options in rating 4 good senting to items.
pre:
numbers, on ® nt of vector re ical features
‘ @ sin gle comp me s co m| ponents of vectors. Numer
Nu meri value based rating provides ter beco
tical paral me
yalued
real
Boo_ value and other
S integer Val ued oF Tech!
Publications
bout similarity of items.

Scanned by CamScanner
9-5 Recommendation Systems

9.2.5 UserProfiles
Bestia
Be
ted with the help of
Vectors are useful to describe items and user's preferences. Users and itemsrelation can be plot
utility matrix.
rating in 1-5 range.
lity matrix has some nonblank entries that are
Example: Considersimilar case like before bututi
) got
t
Consider, user U gives responses with average rate of 3 there art e three applications (Android OS based games
rated average of
ratings of 3, 4 and 5. Then userprofile of U, the component for application will have value i.e.
3-3, 4-3 and 5-3 i.e. value of 1

On otherhand,userv gives averagerating 4. So user v responses to application are 3, 5 and 2.


Theuserprofile for v has in the componentforapplication,the average of 3-4, 5-4 and 2-4,i.e. value — 2/3.

9.2.6 Recommending Items to Users based on Content

Between user's vector and item’s vector cosine distance can be computed with help of profile vectors for users and
items both.

It is helpful to estimate degree to which user will prefer as an item (i.e. prediction for recommendation).

If user’s and response(like 1 to 5 scale for mobile apps)vectorscosine angleis large positive fraction.It meansangle is
close to 0 and hencethereis very small consine distance betweenvectors.

If user’s and responsesvector cosine angleis large negative fraction. It means angleis close to degree of 180 whichis
a maximum possible cosine distance.

Cosinesimilarity function = for measuring cosine angle between twovector


v.V.
2 ested
Cos® = TVITVal
In vector space model

V, [W, 4, W,4, ... WNd]™


Where,

Wy: TF* IDF weight of term “¢ in‘d’ document

TF: Term Frequency

IDF : inverse document frequency

9.2.7 Classification Algorithm


. Itis used to know user’sinterest. By applying somefunction to new item we get some Probabili
ity ty which
which user may like.
lIKe-
Numericvaluesalso help to know aboutdegree ofinterest with someparticularitem,
Fewoftechniques arelisted as follows:

(1) Decision Tree and Rule Induction

(2) Nearest Neighbour Method

Scanned by CamScanner
ic Systems
a) Euc' jidean Distance Metr

cosine Similarity Function


f some other classification algorithm are:

; tH relevance feedback and Rocchio’s algorithm


| Q Linear classification

| 6) probabilistic methods
(a) Naive Bayes.

| Recommendation system in collaborativefiltering becominginteresting as few domains are used move by research
I scholar and academician like human-computerinteraction, information retrieval system and machinelearning.

Few famous recommender systems in somepopular fields like Ringo-music, Bellcore-video recommender (movies),
Jester-jokes etc.
widely used example ofcollaborative filtering and
Collaborative filtering began to use in the early 1990s. Most
fecommendation system is Amazon. com.
by
important. Recommendation must be get appreciated
Jo recommend among large set of values to users i is very
user else effort taken for it were worthless.
on of
4,10,000 title in its collection, so got proper selecti
> Netflix has 17,000 movies collection while Al mozon.com has
fecommendation is necessary.
soning
bec ome s adv anc ed wit h hel p of Bayesian interface, case-based rea
Toolbox used for collaborativefiltering
Method, information retrieval. s‘rating’ and
ce giv en by an y use r to an item is knowna
" eren
| Gllaborating filtering deals with ‘users’ and items. A pref
and Rating).
| Srepresented bytripletvalue set of (User, Item, matrix and it is referred asrating matrix
;
.
is u sed to create
@ sparx
~ Ral ting i ing) tem.
| tr
) lu at io n a n d us e of recommendation sys
e s rat ing
task’ are used foreva
:
"edict task’e
p and ‘re commend al es to apps
. ting matrix con 5 § tar sc
Table 9.3.1 : Sample rating ae 5
SS i | ee fi ngout:)'
ms) Y a ‘ i
ia Bae s(i
ts ae

3
cheat? te é

3
: 3 a

er ;
zp es

| 4
User A 4 5 3
f
B 3 |
User B 4 2
i
k a
3
User C

Scanned by CamScanner
4

¥ Big Data Analytics(MU)
d
—. Predict task tells abouutt preference may give
i n bya user or whatuser'slikely preference to an item?
e
Recon
mmene
dtask helpf
Pful ultto desigi nn-iti emslisi t for user’s need. Thes e n-itemsare not on basis of prediction Preference
se criteria to create recommendation
maybedifferent.
9.3.1 Measuring Similarity

— Among values ili AX js "


e Of utility matrix it is really a big question to measuresimilarity of itemsof users.
Table 9.3.2 : Utility matrix

User A 4 5 1

User B 5 5 4 5

User C 2 4

User D 3

Aboveutility matrix data is quite insufficient to put reliable conclusion. By considering values from A andC,theyrated
two appsin commonbut their ratings are diametrically very opposite.

9.3.2 Jaccard Distance

In this sets of items rated are considered while valuesin matrix are ignored.

d,(A,B) = 1-J(A,B)
_ AVBI-IANBI
~ IAUBI

be given by,
Alternatively Jacard distance can
AAB = (AUB)-(ANB)

zeis of 5, its Jaccord similarity be 1/5 and


For example, user A and User B have anintersectionofsize 1 and a unionsi
Jacard distance be 4/5.
e. .
: User A and User C haveJaccard similarity 2/4.and Jaccard distance is samei.
than A and B.
So, comparatively A and C are closer
one apP ie.
psbutuser A anduserB both rated nearly similar to
= User Aand User Chas very less matching choice ofap
Whatsapp.

9.3.3 Cosine Distance


from 1 to 5 to any ap| p thenit is considered as a 0 (zero)
— fuser doesn't give any rating

Scanned by CamScanner
e Aand User Bis,
jne one e 4xe 5. ‘
~ 0.380
: es + PYS+54+4"
a
A and userCis,
angle between user
cosine A x2+ R
5E 1x4
e 0.322
Vas+ ry +445

e implies a smaller angle.


se! to B compareto C as longer cosine valu
Ais clo
e Data
4 pounding th
o higher.rating and ass ign NULL to lowervalues.
goinding data by assigning one valuet like 3, 4 and5 will consider it as “4” and those having
ratings
ample, in our utility matrix few apps havingratings
for ex er rated keepit NULL.
wil | consid it as un nce C
like gand1
ion as Jaccar d be tw ee n A an d B is 3/ 4 and A and C its 1, si .
s
proach will give co
rrect conclus distance
This aP
A compared to B.
appears further from t matrixwill be.
rified by applying cosine distance tha
ach will give correct conclusion andit can beve
- This appro:
by 1 and 2and
atings)3, 4, 5 are replace
Table 9.3.3 : Utility matrix values(r
(NULL)
1 value (ratings) are kept unrated

an fies
WhatsApp. :
Apps(items)
:
* Users
1
1
User A

1 1 1
User B
1 1
User C
1
User D

; ng Rating
3, 5 Normalizi
e -
conv ert into nega tive whil e high rati ng get conv erted into positive as it is subtracted from averag
low rating get
te, this
his is known as Rating Normalization.
4 System
Pros and Cons in Recommendation Te

1g 0 llaborative
i Filtering

kn Ow! .
engineering efforts needed.
eneipity edge
;
\ Cong”
tiny,
in Fesults.
us, a Tech!
learning for market process. Pavticat

Scanned by CamScanner
wy Big Data Analytics(MU) 9-9 Recommendation Systems
SSS
Cons

(i) Rating feedbackis required.

(ii) New items and users facesto cold start.

9.4.2 Content-basedFiltering
Pros

(i) No. community requirement.

(ii) Items can be compared among themselves.

Cons

(i) Need of contentdescription.

(ii) New users facecold start.

Q.1 Whatis recommendation system?

Q.2 Enlist application of recommendation system and taxonomyfor application recommendation system.
Q.3 Explain utility matrix with example.

Q.4 Explain item profile of content based recommendation.


Q.5 Howitemprofile is represented ?

Q.6 Whatis a userprofile in content based recommendation ?

Q.7 Explainin collaborativefiltering.

Q.8 Whatis measuring similarity in collaborative filtering ?

Q.9 Whatis Jaccard distance andcosine distance in collaborativefiltering ?


Q.10 Explain rounding the data and normalized rating.
Q. 11. Write any two pros andconsfor Collaborativefiltering and content-basedfiltering.

Scanned by CamScanner
Mining Social Network Graph
Module - 6

Syllabus
Social Networks as Graphs, Clustering of Social-Network Graphs, Direct Discovery of Communities in a social graph

10.1 Introduction

See

= Social network idea cameinto theory andresearch in 1980s by Ferdinand Tonnis and Emile Durkhiem.Social network
is bind with domain like social links, social group.

— Major work started in 1930sin various areaslike mathematics, anthropology, psychology etc. I.L. Moreno provides
foundation for social network as provided a Moreno’s sociogram which representsocial links related with a person.

— Moreno’s sociogram example : Namethegirl with whom you would like to go to industrial visit tour.

Fig. 10.1.1 : Moreno’s soclogram with Industrialvisit tour

a Sociogram gives interpersonal relationship among members participated in group. Sociogram ‘present choice in

number.
~Numberof mutual choices
=. o o
C = "Number ofpossible mutual choices in the group

connected with M,nodesof network.


~The Bat rabasi-Albert (BA) model provides a network whichinitially

P
oe
ak
Where, K, - Degree ofnodei

j - All per existing node

Scanned by CamScanner
W Big Data Analytics(MU) 10-2 Mining Social Network Graph

The new nodesgives preference to get attach with heavily linked nodes.

Fig. 10.1.2 : Barabasi algorithm model shows steps of growth of network (M, = M = 2)
BA model is used to generate random scale free network. Scale-free network used in most of popular domain like the
internet, World Wide Web,citation network and few social networks.

Social network deals with large-scale data. After analyzing large data a hugesetof information can be achieved.

Linkedin, Facebookare vast widely used and very popular examplesfor social network. As wecanfind friends over the
network with 1°, 2™, 3™ connection or mutualfriends (i.e. friendsoffriend) in Linkedin and Facebook respectively.

Google+ is one of social network which gives link nodes in groups categorieslike Friends, Family, Acquaintances
following Featured on Google+etc.

Social network is huge platform to analyze data and obtain information.Furtherwill see efficient algorithm to discover
different graphs properties.

10.2 Social Network as Graphs

In generala graphis collection of set of edges (e) and set ofvertices (V). If there is an edge exists between any:two
nodes of graph then that noderelates with each other.

Graphsare categories by many parameters like orderedpairs of nodes, unordered pairs of nodes.
y matrix.
Some edgehasdirection, weight. Relationship amonggraphis explained with help of an adjacenc

Small network can be easily managedto construct a graph,it is quite impossible with huge/wide network.

Summary statistics and performance metrics are useful for design of graph fora large network.

Network and graphs can beelaborate with the help of few parameters like diameteri.e. largest distance between any
twonodes, centrality degree distribution.

Social website like Facebookuses undirectedsocial graphfor friends while directed graphusedin social website like
Twitter, Google+ (plus). Twitter gives connection like 1", 2™ , 3 and Google classify linked connectionin friends,
family, Acquaintances, Following etc.

Scanned by CamScanner
10-3 Mining Social Network Graph

Every node is distinct in a i .


network
social network asgraph are: andit is part of graph bysetoflinks. Some general parameter consider for any

( la) Degree : Numberofa djacent nodes (consi‘deri


dering both out degree andin-degree). Degree of node n, denot edby d(n).
Geodesi : .
{b) Geodesic Distance : Actual distance between two node n, and n, expressedby (i, j).
Density :| see
« ty It gives correctness of a graph,itis Useful to count closeness of network.”
(d) Centrality ty I It tells about degree centrality
ity ii.e. nodes appearancein the centre of network\ central
ity has types.
Example :

Amit Amar
© ©

© ) ©)
Mahesh Rahul Sachin

Fig. 10.2.1
Degree of each nodesareasfollows:

Amit

Amar

Mahesh
Rahul
Sachin

Density ofundirected graphis 0.6.


nodesis as follows
Geodesic Distances between two

Amit

Amar 1 ot 2 1 2
| Mahesh 1 2 = 1 2
Rahul 2 1 1 = 1
Sachin 2 2 2 1 =

Scanned by CamScanner
W Big Data Analytios(MU) 10-4 Mining Social Network Graph
Degreeofcentrality
, _ _d(a)
> = @-H
Closeness centrality

Co @) = Ch a) @-1)
Between’scentrality

Crea= Cqcay/[EME—2]
—1)(g-2)

C@) = jee BO Bx
8 = The numberof geodesics connecting jk
8x (0,) = The numberthatactori is on.

10.2.2 Varieties of Social Network

10.2.2(A) Collaborative Network

A network where each node has somevalueand asit gets connectedwith anothernodeits values get changed.

A tennis player has some records on his namein single. There are some other records on his nameassociated with
anotherplayer namein doubles.

Anode mayhave different values depending onits connection with neighbouring node.

Severalkindsof data are available having two or more commonnetworks.

10.2.2(B) Email Network

Whena noderepresentsan Email accountit is a single node. Every nodeof an e-mail is in link with at least one e-mail
account(i.e. sender mail ID and receiver mailID). .

Sometimes email are send from one side and sometime e-mail are send from both side in such scenario edges are
supposed weakandstrongrespectively.

10.2.2(C) Telephone Network

These nodes consist with values like phone numbers which givesit a distinct value.

As a call is placed between twouser nodes get someadditionalvalueslike timeofcall period of communicationetc.
In telephone network edge gets weight by numberofcalls modebyit to other. Networkassign edges with the way
they contact each otherlike frequently, rarely, never get connected.

10.3 Clustering of Social Network Graphs


Clustergives data into subsets ofrelated or linked objects. Custer coefficient gives degree to which various nodes ofa
graph tend to dustertogether.

ere
Scanned by CamScanner
f ee

Big Data Analytics(MU)


2S Mining Social Network Gi
Graphs are used to represent data basis of graph based
clustering algorithms. Clusters can be generated on
Y few
properties.
40.3-1 Distance Measurefor Social-Netw
s ork Graphs
Measuring a distanceis an essenti ,
Few graph edge has label, it
for applying clustering technique on’any graph.
P sential .
represents distance measure. Some a
. Edgeof graph ma y be unlabeled.
The distance d(x, * y) = if the re
of an edge i.e. nodes appear close asth ere is an edge. The
ence
‘i:
ere is
an exis tenc e or pres
distance d(x,5 y) = 1 me:‘ans no edgeor nodes appear distant
m

- Land can be used to representvalue


s for an existing edge.

Fig. 10.3.1 : Example fortriangle inequality


e triangle inequality
oo’ these are nott rue 2-val ued distan ce. measure. Uses of these values violat
- ‘Oand’ or “1 and .
e and nod es com bin ati on as shownin Fig. 10.3.1
whenthere ar e edg
node A and nodeC i.e.
) and edg e (B, C) but there is no any edge between
e (A,B
- In above example, there is edg
edge (A.C) ce of missing edge
val ue 1 to dis tan ce of an existing edge and 1.5 to distan
by assigning
- Above example can be valued
.
hod

visualization of data and


pat ter s fro m lar ge dataset domain. It is useful in
find
lar technique to
- Clustering is popu
.
hypothesis generation Cluster

erview ofclustering
Fig. 1 0.3.2 : Ov

rith
Various clustering algo
(A) Hierarchical
(B) k-means
(C) K-medoid
(D) Fuzzy C-m'

Scanned by CamScanner
wy Big Data Analytics( MU) 10-6 . Mining 9 Social Network Gi raph |iz
(A) Hlerarchical clustering
100 4
90
80 ! —k=8
70
<The
gcae
. +—k=6
sry 50
40 +—k=5

2 —k=4
20 Le <—k=3


91
[|
92 93 94 95 G6
3
97
+—k=1
98
Fig. 10.3.3 : Hierarchical Clustering example

There are twotypes ofhierarchical clustering :

(i) Agglomerative (bottom-up)

(ii) Divise (top-down)

(i) Agglomerative (bottom-up)


all documents are belongingto onecluster.
It starts with each documentassuming as a single cluster, ever almost

(ii) Divise (top-down)


ts own.
and samecluster. Every node generates clusterfori
It start.with all documentwhich are part of a single

(B) K-meansclustering
ng algorithm.
— [tis one of the unsupervised clusteri
it is an input to algorithm.
— Number of cluster represented by ‘g
ementation.
k on numerical data andit is easy to for impl
~— It is basically an iterative in nature, it wor
o estimate K (‘K’ is a user
sian Info rmat ion Crite rion (BIC) of Min imum Description Length (MDL) can be usedt
— Baye ‘
input)
version of K-meansalgorithm.
ure with K-medoids, K-medoidsis general
— Itis easy to work with any distance meas

(C) K-medoid clustering


and numerical.
— It work with quantitative variable types
idean distances do not work in better way.
— Inboth categoricalvariables and outliers Eucl
— Itis more intensive.
er.
— Compareto K-means,itis computationally costli
e distances available only).
and whe nd at a poi nts are not available (i.e. pair wis
— g
It is applied for cate | dat a

Scanned by CamScanner
/
Big Data Analytics(MU)
jer ork Gi raph
Social Netw
jeans Clustering (Fem)
JOT x sens
0) Fuzzy
It is unsupervised andit always con
verge: Se
Itallows one piece of data
which is part afof two or m lore clus
t ters,
itis used frequently in pattern
Tecognition
40.3.3 Betweenness
- To find communities am
ong social netwo:
with standard clustering me
thods.
~- nness
Betweethat is sh ,
auch th ortest Path available between two nodes. For example an edge(x, y) is betweennessof node a and
such e edge (x, y) lies on shortest path betweena and b.
- aand bare two di a
andb.
ifferent communities where edge(x, y) lies somewhereas shortest path between a
10.3.4 The Girvan - NewmanAlgorithm

= Itis published in 2002 by Michelle Girvan and mark Newmanfor:

o Community detection.

o To measure edge - betweenness amongall existing edges.

o Toremove edge having large valued betweenness.

© To option optimized modular function.


vertex betweennesscentrality.
- Girvan Newman algorithm checks for edge betweenness centrality and
through eachvertex on the network.
- Vertex betweennesscentrality is total number of shortest path that pass
ity) then every pathis adjusted to equal weight 1/N
If any ambiguity found with above (vertex betweennesscentral
vertices.
amongall N paths between two
test path which pass throughgiven edge.
- Edge betweennesscentrality is numberof shor
nness.
lete edgesof high betwee
Example : Successive de

Scanned by CamScanner
Ww Big Data Analytics(MU) 10-8 Mining Social Network Graph

Step 1

Step2

Step 3

@
— Standardprocessto successively deleting edges of high betweenness.

Scanned by CamScanner
Big Data Analytics(MU) g Social Network Graph
10-9
Find edge with high
4:
It betweenness of multiple edges of highest
. ghest
betweenness if thereis a tie- and remove those
- edges from graph
graph . mayaffect to Braph to get separate into multiple components. If so,this is first level of
regions in the portioning of
graph
292: Now,recalculal
and again remove the edge oredges of highest betweenness. It will| break few
a te all betweenness
i f
existing compone nt into smaller,if so, these are regions nested within larger region. Keep repeatation of tasks
ponent
by recalculating 6 all al betweenness and removing the edge or edges having highest betweenness.

40.3.5 Using Betweennessto Find Communities

- Itisan approach to find most shortest Path within a graph which connect two vertex.

highest betweenness are preferred to remove first,


- tis process of systematically removal of edges, edges having
processis continuedtill graph is brokeninto suitable count of connected components.
Example:

Fig. 10.3.4 : Betweenness score for graph example

path.
Between C andB thereare twopath,so edge(A,B), (B, D), (A, C) and(C, D) get credited by half a shortest
components
Clearly, edge (D, G) and (B, G) has highest betweenness,soit will get removedfirst, it will generate
namely {A, B, C, D} and {E,F, G, H}.
score6 i.e. (E.G) and (E,F). Later, removal with
By keeping removal with highest betweenness next removal are with
score i.e. (A, B), (B, D) and (C, D).

~ Finally graph remainsas =

©) ©
ss 5 and more are removed
Fig. 10.3.5 All edges with betweenne
an d C more clost e to “traitor”r” t to
each other than to B and D.In short B and aree “traito
at A
‘Communities’ implies th ity.
us e th ey ha ve fr ie nd G outside the commun
community {A, B, C, O} beca
ected.
H} and only F, G and H remain conn
Similarly G is “traitor” to grouP {EF G

Scanned by CamScanner
¥ Big Data Analytios(MU) 10-10 Mining Social Network Graph

10.4 Direct Discovery of Communities

Discovery of communities deals with large numberedgessearch from a graph.

Finding cliques

Cliques can bedefined as a set of nodes having edges between any twoofvertices.
Tofind clique is quitedifficult task. To find largest set of vertices where any twovertices needsto be connected within
a graph is known as maximum clique.

10.4.1 Bipirate Graph

— It is graph having vertices which can be partitioned into twodisjoint sets suppose set V and set U. Both V and sets
are not necessary of having same size.

— Agraphissaidto bebipirateif and onlyif it does not possesa cycle of an odd length.

Example :
Suppose we have 5 engines and 5 mechanics where each mechanic has differentskills and can handle different engine
by vertices in U. An edge between twovertices to shows that the mechanics has necessary skill to operate the engine
operated by
whichit is linked. By determining maximum matching we can maximize the number of engines being
workforce.

10.4.2 Complete Bipirate Graph


n vertices
- Agraph K,,, is said to be completebipirate graph as its vertex set partitioned into two subsets of m and
respectively.

- Two vertices are connectedif they belongtodifferent subsets.

kao kg
Fig. 10.4.1
10.5 Simrank

—_ Itis one of the approach ofanalyze a social network graphs.

— Graphsconsists of various types of nodes,simrank is useful to calculate the similarity from same type nodes.

= Simrankis useful for random walkers on a social graph while starting with a particular node.
— Simrank needscalculation andit is done at every starting nodefor limitedsizes graph.

wer TechKnowledg®

Scanned by CamScanner
ig Data Analytics(M Mining Social Network Graph
10-11
Z tl
Network
10.5.1 Random Walker on Social
meetto
jal network gra} .
hfounds dire cted. Random walker ofsocial graph can
sod graphis mostly undirected and web grap
any numberof neighboring nodeofit.

network
Fig. 10.5.1 : A tripartite graph example for random walker social

~ Suppose, example as shownin Fig. 10.5.1 U-users, T-tags, W-web pages.


to T, or T,.If
r prefer to go W,then in next attempt it will visit
- Atvery first step walker will go for U, or W,.If walke
meeteither T,, T, or T3.
walker preferto visit U, then in next attempt it will
reachable by user U, and U,.
T, and T, both tag placed on webpage W, and
a, network. Irrespective of start node
any node traversal can visit all n ode of
By keeping trend of visiting start at
as random walkeron social network.
walkercanvisit all node soit is known
tart
10.5.2 Random Walks with Res
dom.
y stop at some nodein ran
- Random node visiting W: alker ma on.
ilities putting in a matrixof transiti
ran dom wal ker may sto p can be calculate with helpofprobab
~ To know when a
degree K
matr ix of grap h G ente ring at row a colu mnb of M is 1/kif node b of graph having
Mis transition
- Suppose, O (zero).
node is a el: ise entry is
and one of the adjacent

Example :

ipartite social graph


Fi ig. 10.5.2: Asimpleb
and
of three imat ges, and two tags “Fog” and “Grass” Image and Image 3 has twotags
~ InFig, 10.5.2 network consists
”.
tag ie. “Grass tart at Image 1 probably
image 2 has onl ly one fe similar than Image2 and random walker withres
ar ab ly mo r
age 3 are comp
> Image 1 and Im to It.
te nt io n af te r applying alysis
support that in vee

Scanned by CamScanner
¥y Big Data Analytics(MU) 10-12 Mi ing Social Network Graph
Nodescan be keptin orderlike Image 1, Image 2, Image3, fog, grass. The transaction matrix for graphwill be like.
0 0 0 1/2 1/3
0 00 o 1
0 0 0 1/2 1/3
1/20 1/2 0 0
1/2 1 1/2 0 O
The fifth column of node “Grass” which is connected to each of image node.If thereforeit has some degreelike 3
then non-zeroentries to node “Grass” column must haveto be 1/3.
The image nodescorrespondtofirst three rows and first three columnsoftransaction matrix so entry 1/3 appears in
thefirst three rowsof column5.Since “fog” node does not have an edgetoeitheritself or “Grass” node.

— Let, B be probability of random walker, so 1-B is probability the walker will teleport to initial node N. ey is column
vector that has 1 in the row for node N otherwiseits0 (zero).

— Inera ofBig data, approx 2.5 quintillion byte data increasing per day. In 2004, Google introduced Map Reduce usedin
search engine.
bya pair of key
— Map Reduceusedfor processing and to generate large data sets. Map function gives data processed
set while Reduce function used to merge those data values.

— Mapand reduce functionlikewise function introduced in Lisp also.

Scanned by CamScanner
1GBig Data Analytos(MU) 7 Mining Social Network Graph
Map-Reduce processdoeswith help of threestages :

o Mapping
o Shuffle

o Reducing

- Counting oftriangle is helpful to know community around any nodewithin social network;it helps to know ‘clustering
co-efficient’.

- Let, agraph G = (V.E.) a simple undirected and unweighted graph.


Let, n lvl

m \El

TW) set of neighbors of v


{WeVl(v,weE}

dv wl
Cluster co-efficient (cc(v)) for a node where v € Vis,

(8) coe = I{@,w)e Elue T(v), we Ti)


Above expressiongivescluster coefficient for node v

- Map Reduceis used in page ranking.It is also useful in :

(i) Web accesslog states,


(i) Inverted index construction

(iii) Documentclustering
(iv) Statistical machine translation
(v) Machine learning
(vi). Weblink-graph reversal
(vil) Distributed sorting

(viii) Distributed pattern based search


(ix) Machinetranslation

21° Whatis sociogram and Barbasi-Alber algorithm?


82 How can a social network treated as graph ?
Q3 Explain the graph parameters listed below :
(i) Degree
(i) Geosestic distance
(ii) Density

Ww ‘Teck!
Publications.

Scanned by CamScanner
Big Data Analytics(MU) 10-14 Mining Social Network Graph
Q.4 How degree,closeness, between's centrally is measured ?
Q.5 Whatis social network ? Explain anyonetype in details. ‘
Q.6 Explain following clustering algorithm in short :
(a) Hierarchical
(b) K-means
© K-medoid
(d) Fuzzy C-means

Q.7 Explain Girvan-Newman algorithm in details with help of suitable example.

Q.8 Whatis betweenness ? Explain use of betweenness to find communities.


Q.9- Explain bipirate graph and complete bipirate graph.

Q. 10 Whatis simrank ? Explainin brief.

Q. 11 Whatis MapReduce ? Enlistits application ?


Ou

Scanned by CamScanner

You might also like