BDA Techmax (Searchable)
BDA Techmax (Searchable)
BDA Techmax (Searchable)
SE
2. To introduce programmingskills to build simple solutions using big data technologies such as MapReduce and
scripting for NoSQL,andthe ability to write parallel algorithms for multiprocessor execution.
3. To teach the fundamentaltechniques andprinciples in achieving big data analytics with scalability and streaming
capability.
4. To enable students to haveskills that will help them to solve complex real-world problemsin for decision support.
5. To provide an indication of the current research approachesthatis likely to provide a basis for tomorrow's
solutions.
1. Understandthe key issuesin big data managementandits associated applications for business decisions and strategy,|
1. Develop problem solving andcriticalthinking skills in fundamental enabling techniqueslike Hadoop, Mapreduce and|
NoSQL in big data analytics.
2. Collect, manage, store, query and analyze various formsof Big Data.
8. Interpret business models and scientific computing paradigms, and apply softwaretoolsfor big data analytics.
4, Adapt adequate perspectivesof big data analytics in various applications like recommender systems, social medial
applicationsetc.
5. Solve Complexreal world problemsin various applications like recommender systems, social media applications,
health and medical systems,etc.
Pre-requisites : Some prior knowledge about Java programming, Basics of SQL, Data mining ‘and machine learning}
methods would bebeneficial.
Scanned by CamScanner
Hadoop HDFS and MapReduce
Large-Scale
2.1. Distributed File Systems : Physical Organization of Compute Nodes
File-System Organization.
Tasks, Combiners,
2.2 MapReduce: The Map Tasks, Grouping by Key, The Reduce
- 10
02 Details of MapReduce Execution, Coping With NodeFailures.
er by edi e,
Multiplication
2.3 Algorithms Using MapReduce : Matrix-Vector
Relational-Algebra Operations, Computing Selections mt siterenne by
and UI
Computing Projections by MapReduce, Union, Intersection,
MapReduce
3)
2.4 Hadoop Limitations (Refer Chapter 3)
NoSQL
3.1. Introduction to NoSQL, NoSQL BusinessDrivers,
Column
3.2 NoSQL Data Architecture Patterns: Key-value stores, Graph stores, ctural 7
ons -of NoSQL archite
family (Bigtable)stores, Document stores, Variati
03 patterns, NoSQL Case Study
data problems;
3.3 NoSQL solution for big data, Understanding the types of big
Choosing distribution
Analyzing big data with a shared-nothing architecture;
handle big data
models: master-slave versus peer-to-peer; NoSQL systems to
problems. (Refer Chapter 4)
Chapter 1: Introduction to Big Data 1-1 to 1-11 Distributed File Systems : Physical Organization of Compute
Nodes, Large-Scale File-System Organization, MapReduce : The
14 Introduction to Big Data Management... Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
12 Big Data Details of MapReduce Execution, Coping With Node Failures,
Algorithms Using MapReduce : Matrix-Vector Multiplication by
1.3 Big Data Characteristics - Four Important
V of Big Data.... MapReduce, —Relational-Algebra_ Operations, Computing
Selections by MapReduce, Computing Projections by MapReduce,
1.4 Typesof Big Data.
Union, Intersection, and Difference by MapReduce, Hadoop
1.5 Big Datavs. Traditional Data Limitations
Business Approach
1.6 “Tools usedfor Big Data. Chapter 3: Hadoop HDFS and Map Reduce3-1 to 3-13
Scanned by CamScanner
Syllabus :
|
The Stream Data Model: A Data-Stream-
Management System,
Examples of Stream Sources, Stream Queri
es, Issues in Stream
Processing, Sampling Data techniques
in a Stream, Filtering,
Streams: Bloom Filter with Analysis, Coun
ting Distinct Elements in |
a Stream, Count-Distinct Problem, Flajolet-M
artin Algorithm,|
Combining Estimates, Space Requirements, Counting Frequent|
Items in a’ Stream, Sampling Methods for
Streams, Frequent |
Itemsets in Decaying Windows,Counting Ones
in a Window: The
Cost of Exact Counts, The Datar-Gionis-Indyk-Motwani Algorithm,
4-1 to 4-27
Query Answering in the DGIM Algorithm, Deca
NoSQL (Whatis NoSQL? ying Windows,
).....
4-1
Chapter 5: Mining Data Streams
NoSQL Basic Concepts.
wee2
4.3 5.1 The Stream Data Model.
Case Study NoSQL (SQL vs
NoSaL). wieS,
5.1.1 A Data-Stream-Management System
44 BusinessDrivers of NoSQL.
.
5.1.2 Examplesof Stream Sources....
45 NoSQLDatabase Types...
5.1.3 Stream Queries...
46 Benefits of NoSQL...
5.1.4 Issues in Stream Processing...
47 Introduction to Big Data Management...
5.2 Sampling Data Techniques in a Stream.
4.8 Big Data... 5.3 Filtering Streams...
y
4.8.1 Tools Usedfor Big Data..... 5.3.1 Bloom Filter with Analysis..
4.8.2 Understanding Types of Big Data Problems 5.4 Counting Distinct Elements in a Stream
....
5.4.1 Count - Distinct Probie:
49 Four Ways of NoSQLto Operate
Big Data Problems... 5.4.2 The Flajolet- Martin Algorithm
Scanned by CamScanner
WF _Big Data Analytics(MU) Table of Contents
Scanned by CamScanner
Big Data Analytics(MU)
aaE Table of Contents
-,-7-,-/;|:2
Chapter 10 : Mining Social Network Graph 10-1 to 10-14 10.3.3 Betweenness... 10-7
Qo00
Scanned by CamScanner
Introduction to Big Data
Syllabus
Introduction to Big Data, Big Data characteristics, Typesof Big Data, Traditional vs. Big Data business approach, Case
Studyof Big Data Solutions ,
- Asaresult of multiple processing machines have to generate and keep hugedata too. Due to this exponential growth
of data, Data analysis becomes very much required task for day to day operations.
- The term ‘Big Data’ means huge volume,high velocity and a variety of data.
— Traditional data managementsystems and existing tools are facing difficulties to process sucha Big Data.
Ris one of the main computing tools usedin statistical education and research. It is also widely used for data analysis
and numerical computing in otherfields ofscientific research.
°
Big data Analysis
Fig. 1.1.1
1.2 Big Data
TEE EE
- Weall are surrounded by huge data. People upload/downloadvideos, audios, images from variety of devices.
— Sending text messages, multimedia messages, updating their Facebook, Whatsapp, Twitter status, comments, online
shopping, online advertising etc.
— Big data generates huge amountdata.
— Asaresult machines have to generate and keep hugedata too. Due to this exponential growth of data the analysis of
that data becomeschallenging and difficult.
Scanned by CamScanner
* 1-2 Introduction to Big Data. ©
8 _Big Data Analytics(MU)
increasing tremendously
— The term ‘Big Data’ means huge volume,high velocity anda variety of data. This big data is
difficulties to process such a Big Data.
day byday.Traditional data managementsystemsandexisting tools are facing
critical to store and manageit. Big is a
- Big data is the most important technologies in modern world. It is really
collection of large datasets that cannotbe processed using traditional computing techniques.
of data. The data in it may be structured data,
— Big Dataincludes huge volume,high velocity and extensible variety
various tools, techniques and frameworks.
SemiStructured data orunstructureddata. Big data also involves
Analysis of ee Uncertainty
é of data
1, Volume
applications.
— Huge amountofdatais generated during big data
very big in size.
The amountof data generated as well as storage volume is
2.5B 25 84% 4/5ths
gigabytes of new data _petabytes of data is of smartphoneusers of U.S, adult smartphone
generated every day. collected every hour check an app as soon users keep their
by a majorretailer. as they wakeup. phoneswith them
22 hours per day.
4/sths 2x
4,000,000,000,000 ‘ 5 minutes
: of the world's datals as manypeople in Ti
Connected objects unstructured, audio, 2013 were 19 response time users
and devices on willing to share thelr expect from
the planet generating —_—viedo,AFID data.
blogs.tweets.all geolocation data In retum| 2 ComPany once
data by 2015. for personalized offers they have contacted
ae _—reprosont new themvia social media.
3x areas to mine compared to tha
\ forInsights. previous year.
increase In data
transmitting transistors 84% of millennials say social 57% of companies
2017. and user-generated In 2014 expect
per human by 500M
8 worth of data content has devote
oead dally, an Influence more than 25%
on whattheybuy. of thelr IT
spending to
70% of boomors agreo, aystemsof
engagement. :
(almost
double the 4
Investment 3
one year ago.) }
Fig. 1.3.2
Scanned by CamScanner
& Big Data Analytios(MU) 43 Introduction to Big Data
2. Velocity
= For timecritical applications the faster processing is very important. E.g. share marketing, video streaming
data.
— The huge amountof datais generated and stored requires higher processing speedof processing
less timein future.
The amountofdigital data will be doubled in every 18 monthsand it repeats maybein
3. Variety
“Structured and
unstructured
Fig. 1.3.3
Veracity
at.
The data captured isis not in certain form
tly.
— Data captured can vary grea
.
on the veracity of the source data
= Soaccuracy of analysis depends
5. Additional characteristics
(a) Programmable
ic.
loreall types by programming log
— Itis possible with big data to exp
data.
ng can be use d to perf or! m anykindof exploration because ofthe scale of the
Progra mmi
Scanned by CamScanner
YF _Big Data Analytics(MU) 1-4 Introduction to Big Data
(d) Iterative
The more computing powercan iterate on your models until you get them as per your own requirements.
1. Introduction
Hadoop,Hive,Pig, Cascading, Cascalog, mrjob, Caffeine, $4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
DB Group :
Structured data Unstructured data Semi- structured
data
2. Structured data
— Structureddatais generally data that has a definate length and format for big data.
— Like RDBMStables have fixed number of colums and data can be increased by adding rows.
Example :
Structured data include marks data as numbers, dates or data like words and numbers. Structured data is very
simple to dealing with, and easy to store in a database.
1. Sensordata : Radio frequency ID tags, medical devices, and Global Positioning System data.
Scanned by CamScanner
Big Data Analytics(MU) 15 Introduction to Big Data
Un-structured data
— Unstructured data is generallydata collected in any available form withoutrestricting them for any formats.
Example :
Unstructured data include video recording of CCTV surveillance.
1. Satellite images: This includes weather data or the data from satellite
etc.
2. Scientific data : This includes seismic imagery, weather forecasting data
Semi-structured data
ctured data.
ctured data, there's also a semi-stru
Along with structured and unstru
ina RDBMS.
— Se mi-structured data is
information that doesn't reside
e in some cases.
tern whichis easier to analyz
- itt may organized in tre ‘ee pat
nts and NoSQLdatabases.
mpl es of sem i-s tru ctu red data might include XML docume
Exa
5. Hybrid data
ve advantages.
e use of bot h typ es of data to achieve competiti
mak
There are systems which will
e lot of data abouttopic.
is off eri ng sim pli cit y whe reas unstructuret d data will giv
— Structure’ d data
TechKaswledgt
n
Scanned by CamScanner
°. Introduction to Big Data
Un-structured
Fig. 1.4.2
Classic Bl
ructured and repeatable anal
Techricarient
pup
Scanned by CamScanner
Big Data Analytics(MU)
Introduction to Big Data
TraditionalbusinessIntelligence
- The traditional data warehouse and business intelligent approach required extensive data analysis
work with
eachof the systems and extensive transfer of data.
— Traditional Business Intelligence (BI) systems offer variouslevels and typesof analyses on structured data but
they are not designed to handle unstructured data.
~ For these systems Big Data may creates a big problems due to data that flows in either structured or
unstructured way.
— Manyofthe datasourcesare incomplete, do not use the same definitions, and not alwaysavailable.
- Saving all the data from each system to a centralized location makesit unfeasible.
Business users
Determine what
question to ask ee Delivers a platform to
enable creative
discovery
It Business
Structuresthe he data
data tto... Explores what questions
answerthat question“ could be asked
Brand sentiment
product strategy»
nce
Fig. 1.5.2 : Traditional businessintellige
.
2. Big data analysis
be able to process
comp lex data set that trad itio nal data processing applications may no
— Big data meanslarge or
efficientely.
data capture, search, sharing, storage, transfer,
visualization,
olv es dat a anal ysis ,
The Big data analytics inv
querying and information security.
for predictive analytics.
— The term generally only used
Ww TechKnewledgé
puntieattens
Scanned by CamScanner
|
:
WF _Big Data Analytics(MU) 1-8 Introduction to Big Data 1
Traditional
BIG 4
0° Data
VKA
The efficiency of big data may lead to more confidentdecision making, and better decisions which canresult in
greater operationalefficiency, cost reduction and reduced risk.
Cloud-based platform can be used for the business world’s big data problems.
There can be somesituations where running workloads ona traditional database maybe the better solution.
SM Accelerating time-to-value
$175
$125
=
= $100 _~
Gs
6
gs
5
EZ3 $50
oO
$25 Data warehouse
appliance
(single SKU) See
$- oot
24 27 33 36
Months
$(25)
Fig. 1.5.4
ww TechKnowles
tications
Pun
Scanned by CamScanner
¥ Big Data Analytics(MU) 1-9 Introduction to Big Data
SEES
Data relationship |By default, stable and interrelationship. Unknownrelationship.
Data location Centralized. Physically highly distributed.
Data analysis After the complete build. Intermediate analysis, as you go.
Data reporting |Mostly canned with limited and pre-defined Reporting in all possible direction across the data
interaction paths. in real time mode.
Cost factor Specialized high end hardware and software.|Inexpensive commodity boxesin cluster mode.
| TraditionalRDBMS
1 Datasize Gigabytes (Terabytes) Petabytes (Hexabytes)
7 Query response time Can be near immediate Haslatency (due to batch processing)
MapReduce i
Kafka, Azkaban, Oozie, Greenplum
Hadoop, Hive,Pig, Cascading, Cascalog, mrjob, Caffeine, $4, MapR, Acunu,Flume,
Storage
Servers
Heroku
EC2, Google App Engine, Elastic, Beanstalk,
NoSQL
Hbase, Hypertable, Voldemort,Riak, CouchDB
‘ZooKeepe 1 Databases, MongoDB, Cassandra, Redis, BigTable,
Processing
cSearch, Datameer, BigSheets, Tinkerpop
R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, Elasti
Tech!
Puntications
Scanned by CamScanner
|
|
¥ Big Data Analytics(MU) 1-10 Introduction to Big Data |
}
1.7 Data Infrastructure Requirements
Acquire
1. Acquiring data
High volume ofdata and transactions are the basic requirementsofbig data. Infrastructure should support the same.
Flexible data structures should be used for the same. The amount of time required for this should beas less as
possible.
2. Organizing data
As the data may be structured, semi structured or unstructured it should be organized in a fast and efficient way.
3. Analyzing data
Data analysis should be faster and efficient. It should support distributed computing.
There are manycases, in which big data solutions can be used effectively,
The computing powerofbig data analytics enables us to predict disease, allow usto find new cures and better |
understand and predict disease patterns.
Like entire DNAstrings can be decodedin minutes
Smart watches can be usedto apply to predict symptoms for various
diseases.
Big data techniquesare already being used to monit
or babies in a specialist premature and sick babyu
nit. By
recording and analyzing every heart beat and breathing
pattern of every baby, the unit was able to develo? |
algorithmsthat can nowpredictinfections 24 hours before
any physical symptoms appear.
Big data analytics allow us to monitor and predict
the developmentsof e pidemics and disease, Integrating dat®|
from medical records with social media analytics enabl
es us to monitor fe w diseases, . 4
Sports
Scanned by CamScanner
Big Data Analytics(MU) 1-11 Introduction to Big Data
Also video analytics track the performanceof every playerin a.cricket game, and sensor technology in sports
equipmentsuchas basketball allows us to get feedback even using smart phones.
to track
Many elite sports teamsalso track athletes outside of the sporting environment using smart technology
nutrition and sleep, as well as social media conversations.
Science and researchis currently being transformed by the new possibilities big data.
f data.
Experiments involvetesting lot ofpossibilities of test cases and generate huge amountso
distributed across many data centers
The many advanced lab uses computing powers of thousands of computers
worldwide to analyze the data,
4. Security enforcement
5. Financial trading
role.
ng (HFT) is a new area where big data can play
Automated Trading and High-Frequency Tradi
trading decisions.
Thebig data algorithmscé an be used to make
asingly take into account
the majo rity of equi ty tradi ng now takes place via data algorithms that incre
Toda y,
to make buy and sell decisions in split seconds.
signals from social media network ks and newswebsites
Review Questions
a.
Qt Write a short note on Big Dat
of Big Data applications.
Q2 Explain vari ious applications
Big Data.
a3 Give all characteristics of
a.
a4 Explain three vs of Big Dat
big data in details.
as Explain various types of
ch ?
aditional business approa
a6 Why tous: e Big Data overtr
data approach.
Q7 Compare Tradition al approach and traditional big
Bi ig Data.
as Explain various needs of
in Big Data.
ag Explain various tools used
:
Q.10 Write a short note on
(a) Typesof Big Data
ss approach.
b) Traditional vs Big Data busine aoa
®
Scanned by CamScanner
Introduction to Hadoop
Module - 4
S ee
Syllabus
2.1 Hadoop
Client —
— Hadoop run applications on systems with thousands of nodesinvolving huge storage capabilities. As distributedfile
system is used by Hadoopdatatransfer rates among nodesarevery faster.
— Asthousands of machinesare therein a cluster user can get uninterruptedservice and node failureis nota big issue in
Hadoopevenif a large numberof nodes becomeinoperative.
— Hadoopusesdistributed storage and transfers code to data. This codeis tiny and consumesless memory also,
- This code executes with data thereitself. Thus the time to get data and again restore results is saved as the data iS
locally available. Thus interprocess communicationtimeis saved which makesit faster processing.
— The redundancy ofdatais important feature of Hadoop due to which nodefailuresare easily handled.
In Hadoopuser need notto worry aboutpartitioning the data, data and task assignment to nodes, communication
between nodes. As Hadoophandlesitall, user can concentrate on data and operationsonthat data.
Scanned by CamScanner
ww Big Data Analytics (MU) 2-2 Introduction to Hadoop
1. Lowcost
As Hadoopis an open-source framework,it is free. It uses commodity hardwareto store and process hugedata. Hence
it is not much costly.
3. Scalability
very little
Nodes can be easily added and removed. Failed nodes can be easily detected. For’all these activities
administration is required.
Hadoop stores both structured and unstructured RDBMS stores datain a structural way.
1
data.
SQL can be implemented on top of Hadoop as the SQL(Structured Query Language) is used.
2.
execution engine.
Scaling out is not that much expensive as machines Scaling up (upgradation) is very expensive.
3.
can, be added or removed with ease and little
administration.
Basic dataunitis relational tables.
4. Basic data unit is key/value pairs.
and codes to With SQL we can state expected result and database
5. With MapReduce wecan use scripts
engine derives it.
tell actual steps in processing the data.
RDBMSis designed foronline transactions.
6. Hadoop is designed for offline processing and
analysis oflarge-scale data.
Scanned by CamScanner
°
¥F _Big Data Analytics (MU) 2-3 Introduction to Hadoop
2. Transfer code to data
In RDBMS, Generally data is moved to code andresults
are stored back. As data is moving there is always a securit
y
threat. In Hadoop small code is moved to data andit
is executed thereitself. Thus data is local. Thus Hadoo
p
correlates preprocessors and storage.
3. Fault tolerance
Running Hadoop means runninga setof resident programs. These resident programs
are also known as daemons.
These daemons may be running on the same server or-on the different servers in the network.
- All these daemonshave somespecific functionality assigned to them. Let us see these daemons.
Scanned by CamScanner
WF _Big Data Analytics (MU) ’ 2-4 Introduction to Hadoop
Secondary NameNode(SNN)
5. The SNN takes snapshots of the HDFS metadataat intervals by communicating constantly with NameNode.
JobTracker
etc.
1. JobTracker determinesfiles to process, node assignmentsfor different tasks, tasks monitoring
TaskTracker
Cy)
|
Interaction
Fig, 2.3.2 : Jobtracker and tasktracker
TechKoowledge
puiteations
Scanned by CamScanner
¥ Big Data Analytics (MU) 2-5 Introduction to Hadoop |
Admin node |
H DFS Cluster Name node| | Data node Data node |.
1 N
DFSin
HDFSisa file system for Hadoop.
© Highly fault-tolerant
o High throughput
HDFS Architecture
Fordistributed storage and distributed computation Hadoopuses a master/slave architecture. The distributed storage /
system in Hadoopis called as the HadoopDistributedFile System or HDFS. In HDFSa file is chopped into 64MB chunks °
: .
and then stored, knownas blocks.
As previously discussed HDFS cluster has Master (NameNode) and Slave (DataNode) architecture. Name Node |
¢
manages the namespace of thefilesystem.
Scanned by CamScanner
Big Data Analytics (MU) 2-6 Introduction to Hadoop
— inthis namespacethe information regardingfile system tree, metadata forall the files and directoriesin thattree etc.
is stored. Forthis it creates twofiles the namespaceimageandthe edit log and storesinformation in it on consistent
basis.
= Aclientinteracts with HDFS by communicating with the Name Node and Datanodes. The user does not know about
or
the assignment of Name Node and Data Nodefor functioning. i.e. Which NameNode and DataNodesare assigned
will be assigned.
1. NameNode
2. DataNode
— DataNode is knownasthe slave of HDFS.
Block ops
Datanodes
Replication]
Rack1
2.4.2 MapReduce
i
17,8Mark)|
May
— MapReduceis a software framework. In Mapreducean application is broken downinto number of small parts.
TechKnowle
Puntieations,
Scanned by CamScanner
Big Data Analytics (MU)
2-7 Introduction to Hadoop -
— These small parts are alsoas called fragments
orblocks. These blocks then can be run on
any nodein the cluster. 4
— Data Processing is done by MapReduce. MapReducescales and runs an appli
cation to different cluster machines. i
— Required configuration changesfor scaling and Tunni
ng for these applications are done by MapReduceitself
. There are |
twoprimitives used for data Processing by
Mapreduce known as mappers and reducers.
|
— Mapping and reducing are the two important phases for executing an appli
cation program. In the mapping phase (
MapReducetakes the input data, filters that input data and then
transforms each data elementto the mapper.
— In the reducing phase, the reducer Processesall the outputs from the mapper, aggre
gatesall the outputs and then
Providesa final result.
— MapReduce uses lists and key/value pairs for processing of data.
|
MapReducecorefunctions
|
1. Read input
Dividesinput into small parts / blocks. These blocks then get assigned to a Map function
q
2. Function mapping
- Partition function : With the given key and number of reducers it finds the correct reducer. (i
— Compare function : Map intermediate outputs are sorted according to this compare function
4, Function reducing i
Intermediate values are reduced to smaller solutions and given to output. .
5. Write output
2
Gives file output.
Input Map Shuffle and sort} Reduce Output
Scanned by CamScanner
° ,
¥¥ Big Data Analytics (MU) 2-8 Introduction to Hadoop
() Map
Map1 Map2 .
< Hello, 1> < Goodnight, 1>
(ii) Combine
(iii) Reduce
< Sachin, 2>
tial access.
Hadoop can perform only batch processing and sequen
Tech
Scanned by CamScanner
Big Data Analytics (MU) 2-9 Introduction to Hadoop
_Introduction
Statistics show that every year amountof data generated is more than previous years.
in rows and columns.
The amount of unstructured data is much morethan structured information stored
from websites, social media and email,
Big Data actually comes from complex, unstructured formats, everything
to videos, presentations,etc. ‘
rks like MapReduceand Google File
The pioneersinthis field of data is Google, which designed scalable framewo
System.
it is a framework that allows for the
Apache open source hasstarted with initiative by the name Hadoop,
distributed processing of such large data sets across clusters of machines.
: Zookeepr(coordination) |
2. Ecosystem
° Hadoop MapReduce
° HadoopDistributed File System
writing applications which can process V4 ty
Hadoop MapReduceis a programming model and software for ta
of computers.
amountsofdatain parallel onlarge clusters
s of data blocks and distributes them on compu
HDFSis the primary storage system, it creates multiple replica
extremely rapid computations.
nodes throughout a cluster to enable reliable,
HBase, Mahout, Sqoop and ZooKeeper.
Other Hadoop-related projects are Chukwa, Hive,
Scanned by CamScanner
*
W Big Data Analytics (MU) 2-10 Introduction to Hadoop
Apache hadoopecosystem
Ambari
Provisioning, managing and monitoring hadoop clusters
5 r
“uy
Fe
=f
<: e
,
l 223
tf
al
e ‘
a
gé 3 z2l| 5a 6
g ||
cL
os 9 = ge 2
BS a8 és a4
i!
aa
a
os == co za 2
Coordination
Zookeeper
©
Flume
HDFS
Hadoopdistributedfile system
Fig. 2.5.2
2.6 _ZooKeeper
ted applications used by Hadoop.
1. Zookeeper is a distributed, open-source coordination service for distribu
to implement higher level services
2. This syste m is a simple set of primitives that distributed applications can build upon
and naming.
for synchronization, configuration maintenance, and groups
Scer eee:
Fig, 2.6.1
such as race conditions and deadlock.
This Coordination services are prone to errors
use distributed applications.
4. The main goal behi ind ZooKeeperis to
ed hierarchical namespace
allo ws dist ribu ted proc esse s to coordinate with each other using shar
5. ZooKeeperwill :
organized as a standard file system.
files and directories.
called znodes, and these aresimilar to
6. The name space made up of of data registers
low latency.
means it can achieve high throughput and
7. ZooKeeper data is kept in-memory, which Knowledge
Queations
Scanned by CamScanner
2-11 Introduction to-Hadoop
WF _Big Data Analytics (MU)
2.7 _HBase
1. HFSis a distributedfile system suitable forstoring large files. HBaseis a database built on top of the HDFS.
2. HDFSdoes not support fast individual record lookups. HBaseprovides fast lookups for larger r tables. S
Scanned by CamScanner
Big Data Analytics (MU) 412 Introduction to Hadoop
5. It is good for structured data. _| It is good for semi-structuredas well as structured data.
7. It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical Processing (OLAP).
2.7.3 HBaseArchitecture
= The Master performs administration, cluster management, region management, load balancing andfailure handling.
- Region Serverhosts and managesservers, region splitting, read/write request handling,client communication etc.
a Region contains Write Ahead Log {WAL). It may have multiple regions. Region is made up of Memstore and HFiles in
which datais stored. Zookeeperis required to manageall the services.
Regions are createdfirst andsplit points are assignedat the timeoftablecreation.Initial set of region split points are
to be usedvery carefully otherwise loaddistribution will be heterogeneous which may hamperclusters performance.
2. Autosplitting
This Is by defaultaction. It splits region when oneofthe stores crosses the max configuredvalue.
Scanned by CamScanner
Introduction to Hadoop \
2-13
¥ ig Data Analytics (MU)
|
3. Manualsplitting :
Whenregionserverfails
Column Families
Fig. 2.7.2 : HBase data model
Scanned by CamScanner
W Big Data Analytics (MU) 2-14 Introduction to Hadoo
1. Tables
2. Rows
Each rowis oneinstance of data. Each table row is identified by a rowkey. These rowkeysare unique and always
treated as a byte[].
3. Column Families
Data in a row are grouped together as Column Families. These are stored in HFiles.
4. Columns
5. Cell
ACell stores data as a combination of rowkey, Column Family and the Column (ColumnQualifier).
6. Version
is the numberof versions are 3 butit can be
Onthebasis of timestamp different data versions are created. By default
;
configured to someother value as well.
2.8 HIVE
tool.
Hive is a data warehouseinfrastructure
data into tables, rows, columnsandpartitions.
It processes structured data in HDFS.Hive structures
Scanned by CamScanner
Wig Data Analyt
ics (Mu)
ee 2-15 Introduction to Hadoop
2.8.1 Architecture of HIVE
2. Meta store
Hive stores Meta data, schemaetc.in respective database servers known as metasores.
3. HiveQLprocessengine -
HiveQL is used as querying language to get information from Metastore. It is an alternative to
MapReduce Java -
program. HiveQLquery can bewritten for MapReducejob.
. |
Execution engine
Querry processing andresult generation is the job of Execution engine.It is same as that of MapReduceresults.
5. HDFS or HBASE
Hadoopdistributedfile system or HBASEare the data storage techniquesto storedata intofile system.
2.8.2 Working of HIVE
JJobTracker| -
TaskTracker
Scanned by CamScanner
Big Data Analytics (MU) 2-16 Introduction to Hadoop
1. Execute Query : Command Line or Web UI sends query to JDBC or ODBCDriver to execute.
2. GetPlan : With the help of query compiler driver checks the syntax and requirementof query.
3. Get Metadata : The compiler sends metadata request to Metastorefor getting data.
4. Send Metadata: Metastore sends the required metadataasa responseto the compiler.
5. Send Plan: The compiler checks the requirement and resends the plan to the driver. Thus the parsing and
compiling of a query is complete.
6. Execute Plan : The driver sends the executeplan to the execution engine.
7. Execute Job: The execution engine sendsthe job to JobTracker. JobTrackerassignsit to TaskTracker.
7.1 Metadata Operations : The execution engine can execute metadata operations with Metastore.
8. Fetch Result : The execution engine receives the results from Data nodes.
9. Send Results : The execution engine sends those resultant values to the driver.
10. Send Results : The driver sendsthe resultsto Hive Interfaces.
1. Databases
2. Tables
3. Partitions
4. Buckets or clusters
Partitions
Table is dividedinto a smaller parts based onthevalueofa partition column. Then ontheseslices of data querries can
be madefor faster processing.
Buckets
Buckets give extra structure to the data that may be usedforefficient queries. Different data required for querries
joined together. Thus querries can be evaluated quickly.
Review Question
Q.1 Write a short note on Hadoop.
Scanned by CamScanner
EBBLEN
a d o o p H D F S a n d M a p R educe
H
Module - 2
Syllabus
Nodes, Large-Scale File-System Organization,
Distributed File Systems : Physical Organization of Compute tion,
Reduce Tasks, Combiners, Details of MapReduce Execu
MapReduce : The MapTasks, Grouping by Key, The
: Matrix-Vector Multiplication by MapReduce,
Coping With Node Failures, Algorithms Using MapReduce by MapReduce,
ions by MapReduce, Computing Projections
Relational-Algebra Operations, Computing Select
Union,Intersection, and Difference by MapReduce, Hadoop Limitations
Clustor 4 Cluster 3
Scanned by CamScanner
Big Data Analytics(MU) 3 Hadoop HDFS and MapReduce
3.1.1 Physical Organization of Comput
e Nodes
The compute nodesare arranged in rack
s with each rack holdin; g around 8 to 64 compute nodes as depicted in
Fig. 3.1.2.
Networking
“-® Device
(Switch, Router, Hub)
Nodes
There are two levels.of connections intra-rack and inter-rack. The compute nodesin
a single rack are connected
through a gigabit Ethernet andthis is knownasintra-rack connection. Additionally, the racks are
connected to each
other with anotherlevel of networkora switch which is knownasthe inter-rack connection.
- One major problem with this typeofsetup is that therearea lot of interconnected components and more the
number
of components, higher is the probability offailure. For example, single node failure or an entire rackfailure.
- To make the system more robustagainst such type offailures the following steps are taken :
- Duplicate copies offiles are stored at several computenodes.This is doneso that evenif a compute nodecrashes, the
file is not lost forever. This feature is known as “Data Replication”, Fig. 3.1.3 shows data replication. Here, the data
item D, is originally stored on Node 1 and a copy each is stored on Node 2 as well as Node 3, It means in total we have
three copies of the samedata item D,. This is known as “Replication Factor” (RF).
LN LN
| @s)
:
Node 1 Node 2 Node 3 Node 4
Fig. 3.1.3 : Data replication
— Computations are subdivided into tasks such that evenif onetask fails to complete execution, it may be restarted
withoutaffecting other tasks.
Scanned by CamScanner
ST"
=
3
Thetypesoffiles which are mostsuited to be used with DFSare :
3.
© Very largesized files having size in TBs or more.
2
© Files with very less numberof update operations compared to read and append operations.
usually of 64 MBin size. Each chunkis normally
In DFS,files are divided into smaller units called “chunks”. A chunkis
—
It is also ensured that these three compute nodes are}
replicated and stored in three different compute nodes.
one copy of the chunk is available.
members ofdifferent racks so thatin the eventof rackfailure at least
.
adjusted by the user based on the demands ofthe application
Both the chunk size and the replication factor can be
acts
separate file called master node or name node.This file
All the chunks of a file and their locations are stored in a
_
The master nodeis also replicated Just like the individual}
asan indexto find the different chunks of a particular file.
chunks.
a directory. This directory in turn is replicated} _
Theinformation aboutthe master nodesand their replicas are storedin
of the locations wherethe directory copies reside.
ina similar fashion and all the participants of the DFS are aware
as : -
There are manydifferent implementationsof the DFS described above such
example.
Explain conceptof MapReduce using an
an example.
hat is the MapReduce? Explain therole of combinerwith the help of
in parallel, on large clusters of
MapReduce can be used to write applications to process large amounts ofdata,
which is easily available in the local
commodity hardware (a commodity hardware is nothing but the hardware
market)in a reliable manner.
amming modelfor distributed computing based onja\4|
MapReduce. is a processing technique as well as a progr
or java framework.
prog! ramming language
m contains two importantfunctions, namely Map and Reduce :
The MapReduce algorithi
pairs. How th
accept one or more chunks from a DFS and turn them into a sequence of key-value
o The Maptasks
for the Map function.
is determined bythe code written bythe user
input data is converted into key-value pairs
d keys are hel
alue pairs produced by the Maptasks. These sorte
o Amastercontroller collects and sorts the key-v
aiue'paits having
g the Reduc eta sks. This distri bution is done in such a wayso thatall the keywv
‘divided amon
e Reducetask.
samekeyare assigned to the sam
written by the user fort
s associated with a particular key. The code
o The Reduce tasks combineall of the value
combinationis done.
Reducefunction determines how the
Scanned by CamScanner
Big Data Analytics(MU) 3-4 Hadoop HDFS and MapReduce
© Anelement is stored entirely in one chunk. That means one elementcannotbestored across multiple
chunks,
© Thetypes of keys andvalues both are arbitrary,
Let us understand the MapReduceoperations with an example.Let us suppose weare givena collection of documents
and thetask is to compute the counts of the numberof times each word occurs in that collection.
Here each documentis an input element. One Map taskwill be assigned one or more chunks and that Maptaskwill
processall the documentsin its assigned chunk(s).
— After the successful completion of all the Map tasks, the grouping of the key-value pairs is done by the master
controller.
— The numberof Reducetasks is set by the userin advance. The mastercontroller uses a hash function that maps each
key to the range0 to r-1.In this step all the key-value pairs are segregatedin files according to the hash function
output. Theser files will be the input to the Reducetasks.
— The mastercontroller then performs the grouping by key procedure to produce a sequence ofkey/list-of-values pairs
for each key k which areof the form (k,[Vs, V2, Va, -.- » Val), where (k, Vi), (K, V2), (Kk,V3), ».-, (ky Vn) are the key-value pairs
*
which were produced byall of the Maptasks.
— The input to a Reducetask is one or more keys and theirlist of associated values. And the output produced by the
pairs produced
Reducetask is a sequenceof zero or more key-value pairs which may bedifferent from the key-value
Maptasksare of the
by the Maptasks. But in most cases both the key-value pairs produced by the Reduce tasks and
sametype. : .
— Inthefinal step the outputs producedby all of the Reduce tasks are combined ina singlefile.
outputs will be of the form (w, s)
— {nour word count example, the Reducetaskswill add all the values for each key, The
pairs, where w Is a word andsls the numberoftimesIt appearsin the collection of documents.
— Fig. 3.2.1 showsthe varlous MapReduce phasesfor the word frequency counting example.
TochKnowladg’
Pusticatiess
Scanned by CamScanner
Te
|
IN ‘Bear, 2:
Car, 3
Deer
River, 4 River, 2
S ag
- Acombineris a type of mediator betw
een the mapperphaseand the reducer
phase. The use of combinersistotally
optional. As a combiner sits between the
mapperand the reducer, it accepts the
Output of map phase as an input and
passesthe key-value pairs to the reduce operation.
Scanned by CamScanner
Big Data Analyti
3-6 Hadoop HDFS and MapReduce
Combiner
Generation Y
of key-value <ky, V4>
pair <kp,Vg>
Combiner <—4
Reducer.
- Inthis section we will discuss in details how a MapReducebased program is executed. The user Program first creates a
Master controller process with the help of fork commandusing thelibrary provided by the MapReduce system as
depicted in Fig. 3.2.3:
- In addition to the Master process, the user program also forks a numberof workerprocesses. These processes run on
different compute nodes.
Scanned by CamScanner
uce
Hadoop HDFS and MapRed
Big Data Analytics(MU)
assign’,
_ Reduce
data
file
Intermediate
files
o Master nodefailure,
- Ifthe Master nodefails then the entire processhasto berestarted. This is the worstkindoffailure.
- Ifa Map worker nodefails then the Masterwill assign the tasks to someothi er available
i worker node evenif the task
had completed.
- Ifa Reduce workernodefails then the tasks are simply rescheduled on some other Reduce worker lat
er later,
Scanned by CamScanner
3-8 Hadoop HDFSand MapReduce
Let us consider a Matrix M ofsize n x n. Let mj denotethe elementin row i and columnj. Let us also consider a vector
v of length n and the j" th element of the vectoris represented as ve
The matrix-vector multiplication will Produceanother vector x whoseith elementx; is given by the formula:
n
Kedj =1my
In reallife applications such as Google’s PageRank the dimensionsof the matrices and vectors will be in trillions. Let us
at first take the case where although the dimension is large but it is also able to fit entirely in the main memory of
the compute node.
The Map task at each Map worker node works on a chunkof M and the entire v and produces the key-value pairs
(i, my). All the sum terms of the componentx; of the result vector of matrix-vector multiplication will be getting the
samekeyi.
— The Reducetasks sum all the valuesfor a particular key and producetheresult(i, x;).
In the case the wherethe vector v is toolarge tofit into the main memory of a computenode,an alternative approach
as shownin Fig. 3.3.1 is taken. The matrix M is divided into vertical stripes and the vector v is divided into horizontal
stripes havingthe following characteristics:
© The size of a stripe in v must be such thatit canfit conveniently in main memory of a compute node.
- Nowit is sufficient to multiply the jth stripe of M with the jth stripe of v. A chunk of a stripe of M and the
proceed as described earlier.
corresponding entirestripe ofvis assigned to each Maptask and the calculations
Panes
Matrix M Vector V
Fig. 3.3.1 ; Division of matrix and vectorInto stripes
Te Kaemtedgi<
Puntieatians:
Scanned by CamScanner
edu |
Hadoop HDFS an id Map Red
9
¥ Big Data Analytics(MU)
re ig! a
dby Rel
Me Nites
o Selection,
© Projection,
o Union,
© Intersection,
o Difference.
1. Selection operation
applied on every tuplein the relation.
In case of a selection operation a constraint which is denoted by ‘Cis
the system as output.
Only those tuples which satisfy the specified constraint 'C’ will be retrieved and shownby
Selection operation in the relational algebra is represented by G¢(R).
Where, o —> representsselect operation
C > represents condition/constraint
2. Projection operation
mls (R)
Where, % represents project operation
S — represents subset
R represents therelation
All the three operations operate on the rows of twodifferent relations. The basic requirementis that both of the
relations mustbe having the same schema.
Scanned by CamScanner
Big Data Analytics(MU)
Hadoop HDFS and MapReduce
Ve Contents (tuples)
from both relations Contents (tuples)
common in both relations
OO. |
AuB
AnB
Contents from A
contents from B
Generation of
key-value pair ——+ Output
(tt) !
(forward to console)
x o
Map phase Reduce
In the Map task, the constraint C is applied on each tuple t of the relation.
If Cis satisfied by a tuple t, then the key-valuepair(t, t) is produced for that particular tuple. Observe that here both
the key as well as the value are the tuple itself.
As the processing is already finished in the Map function the Reduce function is the identity function. It will simply
forward on the key-value pairs to the output for display.
The outputrelation Is obtained from either the key part orthe value part as both are containingthe tuple t.
In the Map task, from each tuple t in R the attributes not present in S are eliminated and a new tuple t’ is constructed.
‘The output of the Maptasks Is the key-value pairs (t’, t’).
The main job of the ReducetaskIs to ellminate the duplicate t’s as the output of the projection operation cannot have
duplicates.
Teed .
Publications
Scanned by CamScanner
Big Data Analytics(MU)
Reduce phase
- For the union operation R US, the tworelations R and S must have the same schema.Thosetuples which arepresent]
in either R or S or both mustbe present in the output.
- The only responsibility of the Map phase is of converting each tuplet into the key-valuepair (t, t).
- The Reducephaseeliminates the duplicates just as in the case of projection operation. Here for a key t there can be|
either 1 valueif it is present in only oneofthe relations or t can have valuesifit is presentin both the relations.In
either case the output producedby the Reducetaskwill be (t, t).
Convert
it
—Tuplet (t, Value n
7 to key-value i
i: pair t
i 4
Value 2
of relation ' key t
R ' can have Value 4
& A 1 or 2 values JQ
t
Mapphase ' Reduce phase
Fig. 3.3.5 : Union operation with MapReduce
Scanned by CamScanner
Convert
it
i to key-value '
pale WK Itkey't
''
ofrelation ' hasvalueinlist [t, t]
R | then generate(t, t)
& _A___olse NULL “y
- For the difference operation R — S, both the relations R and S must have the same schema. The tuples which are
present only in R and notin will be present in the output.
- The Map phase will produce the key-value pairs (t, R) for every tuple in R and(t, S) for every tuplein S.
- The Reduce phasewill produce the output (t, t) onlyif the associated value of a keyt is [R].
aapevnnenennnnnRenennencn enn Picncen
Produce key-value
Pair (t, R)
For each key
Relation R
if we have
associated list
[R] thenoutputis
Produce key-value key-valuepair (t, t)
caved: Pair (t, S) else NULL
Relation §
% J
Map phase Reduce phase
Fig. 3.3.7 : Difference operation with MapReduce
3.4 HadoopLimitations
— Hadoopis a collection of open source projects created by Doug Cutting and Mike Cafarella in 2006.It was inspired by
Google’s MapReduce programming framework. Hadoopconsist of the following core modules:
Scanned by CamScanner
Big Data Analytics(MU) 3-13 Hadoop HDFS and MapReducg
© Noteasy to use: The developer needsto write the code forall the operations which makesit very difficult
to use,
© Security : Hadoop does not support encryption which makes it vulnera
ble.
° Real-time data processing not supported : Hadoopis designed to support only batch processing and
hencereajJ
time processingfeatureis missing.
Noiteration support : Hadoopis not designed to support the feeding of the output of one stage to the inputof
the next stage of Processing.
© No caching: Intermediate results are not cached and this brings downthe performance.
Q.7 Explain Union, Intersection and Difference operations with MapReduce techniques.
Scanned by CamScanner
NoSQL
Syllabus
i
Introon
7 Nest NoSQLBusiness i i
Drivers, NoSQL Data Architecture Patterns: Key-value stores, Graph stores,
NoSQL architectural patterns, NoSQL Case Study,
tosa. in {Bigtable)stores, Document stores, Variations of
Analyzing big data with a shared-nothing
on solution for big data, Understanding the types of big data problems;
NoSQLsystems to handle big data
architecture; Choosing distribution models : master-slave versus peer-to-peer;
problems.
History
Carlo Strozzi in the year 1998.
The term NoSQL was first used by
noprovision of SQL Query
rce Database system in which there was
— He mentioned this name for his Open Sou
‘
interface.
and actu ally comesin practice.
in USA, NoSQL was comesinto picture
In the early 2009,at conference held
2. Overview
em).
onal Database Management Syst
NoSQLis a nota RDBMS(Relati
or large amountofdatastored in dist
ributed environment.
NoSQL is specially designedf
trictions like RDBMS. It gives options to
fea tur e of NoS QLi s, it isn ot bounded by table schema res
The important sent in table.
thereis no suc! h column is pre
store somedata evenif
join operations.
NoSQLgenerally avoids
3. Need
book, Google+, Twitter and
qu irem ents are cha nge d lot . Data is easily available with Face
In real time, data re
others. user-generated
information, social graphs, geographic location data and other
The data that includes user
content. h can operate
ata ,it is nec ess ary to work with a technology whic
ntresources andd
To makeuse of such abunda
such data.
data.
y designed to operate such
— SQL databases are not ideall
e amount ofdata.
designed for operating hug
NoSQLdatabases specially
Scanned by CamScanner
Big Data Analytics(MU) 42 Nosay
4. Advantages
(i) Good resource Scalability.
(ii) Lower operational cos
t.
(iii) Supports semi-stru
cture data.
(iv) Nostatic schema.
5. Disadvantages
CAP theorem states three basic requirements of NoSQLdatabases to design a distributed architecture.
{a) Consistency
Database must remain consistentstate like before, even after the execution of an operation.
(b) Availability
This meansthat the system continuesto function even the communication failure happens betweenserversi.e. if oné
server fails, other serverwill take over.
1, CA
Scanned by CamScanner
Big Data Analytics(MU
43
2. «CP
— Basic availability
— Soft state
— Eventual consistency
6. Datastorage
— Toadd redundancy to a database, we can add duplicate nodes and configure replication.
- Scalability is simply a matter of adding additional nodes. There can be hash function designedtoallocate data to
server.
Data storage
as document based, key-value pairs, graph
SQL databases stores data in a table whereas NoSQLdatabases stores data
databases or wide-column stores.
some rows.
SQL data is stored in form of tables with
documents or graph based data with no standard schema
NoSQL data Is stored ascollection of key-value palr or
definitions.
Database schema
a which cannot be change very frequently, whereas NoSQL databases have
SQLda tabases have predefined schem
* nge any tlme for unstructure d data.
dynamic schema which can be cha
Complex querles
form for running complex query.
- SQL databases provides standard plat
for running complex queries.
— NoSQL doesnot provide any standard environment
as SQL query language.
- NoSQLare not as powerful
Scanned by CamScanner
—
Full form is Structu red Query Language. - is ional database.
Full form Not Only SQL or Non-relationa
2. |SQlisa declarative query language.
This is Not a declarative query language.
3. |SQL databases works On ACID properties,
NoSQLdatabase follows the Brewers CAP " eorem,
’
|Atomicity
Consistency
Consistency
‘Availability
Isolation
Partition Tolerance _
Durability
4, _|Structured and organized data
Unstructured and unreplicable data
5.
j
|Relational Databaseis table based. Key-Value pair storage, ColumnStore, DocumentStore, Graph|
databases.
6. Data andits relationshipsare stored in separ
ate|Nopre-defined schema.
tables,
7. |Tight consistency. Eventual consistency rather than ACID property.
8. |Examples :
Examples :
MysaL
MongoDB
Oracle
Big Table
MS SQL Neod4j
PostgreSQL Couch DB : L
SQLite Cassandra
DB2 IHBase
enSate
WE
3
— Big Data is one of the main driving factor of NoSQL for business.
Scanned by CamScanner
Big Data Analytics(MU) 45 NoSaL
3. Location independence
- Itis ability to read and write to a database regardless of wherethatI/O operation is done.
— The master/slave architectures and database sharding can sometimes meetthe need for location independent
read operations.
4, Moderntransactionalcapabilities
The transactions conceptis changing and ACID transactionsare no longer a requirementin database systems.
6. Better architecture
- The NoSQLhas morebusinessoriented architecture for a particular application.
= So, Organizations adopt a NoSQLplatform that allows them to keep their very high volume data.
Scanned by CamScanner
Examples
— Cassandra
— Azure Table Storage (ATS)
— DyanmobdB
Fig. 4.5.1
Use Cases
This type is generally used when you need quick performance for basic Create-Read-Update-Delete operations any
data is not connected.
Example
- Storing andretrieving session information fora Web pages.
Limitations
It may not workwell for complexqueries attempting to connect multiple relations of data.
cassandra
T®:
ELBASE
OM) nyPertaBe«
Amazon SimpleDB amazon
.
Fig. 4.5.2
Scanned by CamScanner
NoSQL
w Big Data Analytics(MU)
Examples
3. Document database
Document databases works onconcept of key-valuestores where “documents” containsa lot of complex data.
Every documentcontainsa uniquekey,usedto retrieve the document.
tured
Key is used forstoring, retrieving and managing document-oriented information also known as semi-struc
data.
§ mongoDB
yar
AR .
{ea}
terrastore
JrientDB
Gd .
Couchbase
mR
Fig. 4.5.3
Examples
Use Cases
— The example of such system would be eventlogging system foran application or online blogging. ~
— Inonline blogging useracts like a document; each post a document; and each comment,like, or action would °
be a document.
Scanned by CamScanner
: me, post ¢ ontent, or timest
t ampof :
All docu ents would Id cicontain ntain info:
information about the type of data, userna
documentcreation.
Umitations
~ It's challenging for document store to handle a transaction that on multip
iple documen ts.
— Document databases maynotbegoodif data
is required in aggregation.
4. Graph database
Neo4j
InfiniteGraph
neniDB
twitter / flockdb
Fig. 4.5.4
Use Cases
Scanned by CamScanner
- NoSQL
Big Data Analytics(MU) 49
Limitations
variations.
- Graph Databas:
P es maynot beoffering better choice over other NoSQL
- pplication needstoscale horizontally this may introduces poor performance.
If applicati
5. Comparison of NoSQLvariations
lability.
Key value store database
3. Location independence
i ion of database operation.
write database regardless oflocat
NoSQLdata base can read and
Management
47 Introduction to Big Data s: Sending
a. Peo ple upl oad /do wnl oad vid leos, audios, images from variety of device
y huge dat
Weall are surrounded b sApp, Twitter status, comments, online
edia messages, UP dating their Faceboo! k, What
text messages, multim
huge data.
ising etc. generates
shopping, online advert l growth of data the analysis of
era te and kee p hug e data too. Due to this expon entia
e to gen
As a result machines hav
becomes challeng' in
g and difficult.
that data dously
gh ve lo ci ty an d a va ri et y of da ta . This big datais increasing tremen
me, hi
’ means huge volu
- The term ‘Big Data
day by day. h a Big Data.
ande xis tin g tool s are faci ng difficulties to process suc
gement systems
- Traditional data ma! ina used for data analysis
ion and research. It is al iso widely
tistical educat
on e of th e ma in co mputing tools usedin sta
- Ris research.
l co mp ut in gi n oth er fields of scientific
~ and numerica
Tech!
Puptications
Scanned by CamScanner
ig Data Analytics(MU)
4.8 Big Data
We all are surrounded by hug
‘
edata. People upload/downl
oad videos,audios, images from variety
Sending text Messages, mu
ltimedia messages, updati “eeut
shopping, online adv
ng the ir Fa ce bo ok , WhatsApp, Twiter a
ertising etc, “
Generates huge data. As a result machines have to generate and keep huge data too Du
. e to this exponentiai l grows
of data the analysis of that da
ta becomes challenging anddiff
icult.
The term ‘Big Data’ Meanshuge volume, high vel
ocity anda variety of data. This big data is increa
sing tremendous}
dayby day. Traditional data Managementsystemsandexistingtools are facing difficulties to process
sucha Big Data,
Big data is the Most important
technologies in modern world. It
is really critical to store and manage
Collection oflarge datasets that cannot It. Bigis
be processed using traditional computing
techniques.
Big Data includes huge volume, high velocity and extensibl
e variety of data. The data in it may be structured
Semi Structured data or Unstructured data. data
Big dataalso involves various tools, techniques
and frameworks.
Four Important of Big
Data
Analysis of Uncertainty
of data
Scanned by CamScanner
W_Big Data Analytics(MU) an NoSQL
3. Servers
High volumeof data and transactionsare the basic requirements of big data. Infrastructure should support the same.
Flexible data structures should be used for the same. The amountof time required for this should be asless as
possible.
Eee
, Predictablelatency
transactionvolurr
lexible data structures~
2. Organizing data
As the data maybestructured, semistructured or unstructuredit should be organizedin a fast and efficient way.
3. Analysing data
Scanned by CamScanner
Big Data Analytics(MU)
Working
The schema-less format of a keyvalue databaseis required for data storage needs.
Thekey can be auto-generated while the value can beString.
,
The keyvalue uses a hash table in which thereexists a unique key and a pointer to a particularitem of data.
logical group of keys called as bucket, There can beidentical keys in different buckets.
It will improve performance becauseof the cache mechanisms that accompany the mappings.
is hash (Bucket+ Key),
Toread any value you need to know boththe key and the bucket becausethereal keyis a ha Y
Read Write values
Row-oriented database
2001-01-01
:
[275]200502-07 [tones[aim]
[S14 amme001_[ Young[sue]
Emp_no| Dept_id | Hire_date Emp_in] Emp_in
1 1 2001-01-01 Smith Bob
2002-02-01 Jones Jim
@}alalaln
olmimfafa
Scanned by CamScanner
Data Analytics(MU
4-13
created at runtime.
Read and write Is do
ne Using columns.
It offers fast search/
access and data aggr
egation,
Data Model
Working
- This type ofdataIs a collection of key value pairs Is compressed as a documentstore quite similar to a
key-value
store, , but the only differenceIs that the values stored Is known as “documents” has some defined structure and
encoding.
Examples
(i) Mongo DB
(ii) Couch DB
Techitnowledg’
Puniications
Scanned by CamScanner
g Data Analytics(MU)
4. Graph database
Data graphs
Working j
In a Graph NoSQLDatabase, a flexible graphical representation is used with edges, nodes and properties which.
provide index-free adjacency.
Data can beeasily transformed from one modelto the other using a Graph Base NoSQLdatabase.
These databases use edges and nodesto representandstore data.
These nodesare organised by somerelationships, whichis represented by edges between the nodes.
Scanned by CamScanner
Big Data Analytics(MU) 415 NoSQL
Examples
4. Hierarchical System
- Multiple CPUs are attached to a common global shared memory via interconnection network or communication
bus.
- Shared memory architectures usually have large memory caches at each processor, so that referencing of the
shared memory is avoided whenever possible.
- Moreover, caches need to be coherent. That meansif a processor performs a write to a memory location, the
data in that memory location should be either updated at or removed cached data.
Interconnection network
P = Processor
‘Commonglobal shared memary(M) D=Disk
M= Memory
(b) Advantages
Data can be accessed by any processorwithout being moved from oneplaceto other.
y writes.
— Aprocessor can send messages to other processors muchfaster using memor
(c) Disadvantages
— Bandwidth problem
Notscalable beyond 32 or 64 processors, since the bus or interconnection networkwill get into a bottleneck.
ng time of processors.
~ More numberof processors can increase waiti
Scanned by CamScanner
4.10.2 Shared Disk System
(a) Architecture details
But,Y everyY processor has| local
— Multiple processors can access all disk directly via inter communication networ! k.
memory.
Each processorhas its own memory; the memory busis not a bottleneck.
(¢) Disadvantages
(d) Applications
Scanned by CamScanner
4.10.3 Shared Nothing Disk Syste;
(a) Architecture details
™
EacJ hproces: sor has iti s
own local memory and
reine “e local di
A processor at one no nn municate with another processorus
de ma y communni ing high speed communicatio net
i ost n work.
3
le whi.ch functiionsas Server for data that is stored on loc
al disk.
— Moreover,the ii interconnection net‘tworks for shared nothing systems areusually designed to bescalable, so that
wecan increasetr: ‘ansmi 7 :
ission capacity as more nodes are added to the network
Disk
M= Memory
Local Memory
Shared memory
Fig. 4.10.3 : Shared nothing architecture
(b) Advantages
interconnection network queries which
— In this type of architecture no need to go through all 1/0, Onlya single
ough the network.
access nonlocal disk can pass thr
.
erof CPU anddisk can be connected asdesired
- Wecana chieve High degree of parallelism.i.e. Numb
ure syst ems are morescalable andcaneasily support a large number ofprocessors.
Shared nothing architect
(c) Disadvantages
res since sending data
n and of nonlocaldisk access is higher than other twoarchitectu
= Cost of communicatio
n at both ends.
involves so’ ftware interactio
partitioning.
— Requires rigid data
(d) Applications
ase architecture.
tabase machin e uses shared nothing datab
~ The teradata da
otypes.
Gamm a research prot
~ Grace and the
TechKnsledga
punticat
Scanned by CamScanner
¥ Big Data Analytics(MU)
4-18
4.10.4 Hierarchical System
Architecture details
.
The hierarchical architecture comes with
combin: ed characterist
‘stiics of shared mem
ory, s hared disk and shred NOthini
architectures,
This architecture is at
tempts to reduce the
com plexity of programming such systems yields to dis
memory architectures,
where logically there
is a si ingle shared memory, the memo
tributed virtua
system software, allows ry mapping hardware coupled witl
each Processor to view
th disjoint memories as a single virtual mem
ory.
The hierarchical architecture is also
referred to as nonuniform memory
Hierarchical arch architecture (NUMA).
itecture
IP Network
IP Network
Scanned by CamScanner
Big Data Analytics(MU)
Distribution
——— ofwork load
to slave nodes
databases,
As all nodes has the samepr
iority, so the requests fro
m data base users will be rec
irrespective of work load eived by any of the nodes
distribution.
Node 1
Node 6
Node 4
Userrequest Node 5
Scanned by CamScanner
Big Data Analytics(MU) 4-20 NoSQu
4.11.1 Big Data NoSQL Solutions
3. Neo4J 4. MongoDB
4.11.1(A) Cassendra
1. Introduction
Cassendra is a distributed storage system mainly designed for managing large amount of structured across
multiple servers.
Apache’s HBase
MongoDB
Scanned by CamScanner
Features of Cassandra
W Scalability
Cassandrais highly scalable system; it al: .
ement.
Ai) 24X7Availability SO allow to add more hardwareas per data requir
ry JJ amazon
DynamoDB
Fig. 4.11.3
is ' .
1 Data Model table is a collection of various items and each item is a
D namoDB in form of
base called
Amazon’s NOSQL data WH lettnentetet
:
collection ofattributes.
Scanned by CamScanner
Big Data Analytics(MU)
ofits ci columns with dat a
f d sch
sch ema of
oO tab les
le: w th p primary ry key key andlist
tablele ha a fixed
al abase, a a tab
na relational dat
types.
- Alltuples are of same schema.
— DynamoDBData modelcontains,
|
o Table
o Items
o Attributes
requireto defineall of the attrijbute nam
es and data types:
— DynamoDBrequires only a primary key and does not
in advance.
Primary Key
identifiesifies item
ich h ident
In orderto create table we mustspecify primary key column name whic i n table uniquely.
set iin
Partition Key
evils)ehcp)
(Hash’key) Ig Allnbutes
Cisse ce}
(Range key) CNtual oly ces
Scanned by CamScanner
Big Data Analytics(Mu)
1015
njectName = "JDB"
SBN = "111-111"
thors = ['Author 3"] ”
rice = 1543
ageCount = 5000
bl cation = TechMax
3. Data Type
ous data types
— Amazon DynamoDBsupports vari
Binary, Boolean, and Null.
- Scalar types : Number, String,
Map.
— Document types + List and
Set.
Number Set, and Binary
- Set types : String Set,
CRUD Operations
(a) Table Operations
(i) CreateTable
your account.
te new table on
— Itisused to crea mmand.
bl e w e ca n us e Describ eTable co
of ta
— Tocheck status
TechKnomledgi
Publications
Scanned by CamScanner
a
"ProjectionType": ‘string"
'ProvisionedTh ugh
"ReadCapacityUni nbe :
"WriteCapacityUnits": number
(ii) Readtables
(ili) Deleteltem
item withhelpofits
pri mi
(iv) Getitem
_
; GetIte
; m operation returns a setof fF attr ites
ibu for i item with
it the give
i giver n prima
i ry ry key.
Senate et halasibeak
key
_ (c) Others
(i) Query
A Query operation uses the primary keyofa table to directly access itemsfrom thattable.
(ii) Scan
TheScan operation returns one or moreitems anditem attributes.
DynamoDB
Fig. 4.11.5
Scanned by CamScanner
Data Analytics(MU)
MapSuite
Map Data DynamoDB Extension Consume
Map Data
6. Fig. 4.11.6
Data Indexing
4
— In orderto have
efficient access to
data ina table, q
Primary key attrib Amazon Dynamo
utes, DB creates and
maintains indexe
s for the|
with attributes ot
herthan ‘the Prim
ary key,
Secondary Indexe
s
Scanned by CamScanner
Bitig Data Analytics(MU)
(MU) az NoSQL,
lo
Types of secondary indexes ;
{i) Global Secondary index
Anindex witha partition key andsort key,different from index onthe table.
{ii) Local Secondary Index
Q0a0
Scanned by CamScanner
Mining Data Streams
Syllabus
Thedataavailable from a stream is fundamentally different from the data stored in a conventio
nal databasein the
sense that the data available in a database is complete data and can be processed
at any time we wantit to be
processed. On the other hand,stream data is not completely available at any
one point. of time. Instead only some
datais available with which wehaveto carry on ourdesired processing.
Scanned by CamScanner
ay ——=
Big Data Analytics(MU)
Mining Data Streams
—=—_=>——_— 5-2
3
Scanned by CamScanner
Big Data Anal tics(MU)
Ad - hoc querias 1
® input I
streams
1.2,3,4,5,6 ©*
ASD3N216P a
i z sooam
pres a
Data streams precessing
0110 101010010 processor
Time factor <—
©
Active / working
Limited
In size
Be are;
queries directly on the archival store
is not supported. Also,
comparedto thefetching ofdata fro . the fetching of data from this store tak
m the workingstore, es a lot oftime as
‘5. Output streams : The output cons
ists of the full
Scanned by CamScanner
gig Data Analytios(MU)
6 Examples of Stream Sources Mining
ing DataData Streams
Strea!
- The networking components such asswitches and routers on the Internet receive streams ofIP packets and
route
them to the proper destination. These devices are becoming smarter day by day by helping in avoiding congestion and
detecting denial-of-service-attack,etc. . w
~ Websites receive manydifferenttypesof streams.Twitter receives millions of tweets, Google receives tens of millions
of search queries, Facebook receivesbillions of likes and comments,etc. These streams can be studied to gather
useful information such as the spreadofdiseasesor the occurrence of some suddeneventsuch as catastrophe.
51.3 Stream.Queries
Standing queries
Ad-hoc queries
Scanned by CamScanner
Big Data Analytics(MU)
Mining Data Stream,’
_Stansingqueries : eischeequeries.
. Fig. 5.1.3 : Query types
1. Standing queries
|
— Astanding query isa query which is stored in a
designated place inside the stream processor. The
standing queries arg
executed whenever the conditions
for that particular query becomes tru
e. |
For example,if we take the case of a temperature senso |
r then we might have the following standing queriesin
‘i the |
stream processor : |
|
© Whenever the temperature exceeds i
50 degrees centigrade, output an alert.
|
© Onarrival of a new temperature
reading, producethe averageof all the readings
arrived so far starting from the
beginning. |
Scanned by CamScanner
2 Sampling Data Techniques
5. = ae 3 q ina Stream
ya sampling dt
A sample _ an a stream which adequately represents the entire stream. The answers to the queries on the
sample can be considered asthough theyare the answers to the queries on the whole stream.
Let us illustrate the conceptof sampling with the example of a stream of search engine queries. A search engine may
be interested in learning the behaviour:ofits users to provide more personalized search results or for showing
relevant advertisements. Each search query can be considered as a tuple having the following three components:
(user, query, time)
Obtaining a representative sample
- Thefirst step is to decide whatwill constitute the sample for the problem in hand. For instance, in the search query
stream wehavethe following two options:
o Take the sampleof users and includeall the queries of the selected users.
Option number2 as a sampleisstatistically a better representation of the search query stream for answering the
queries related to the behaviour of the search engine users.
The next step is to decide what will be the size of the sample compared to the overall stream size. Here wewill
assume a sample sizeof 1/10th of the stream elements.
- Whena new user comesintothe system,a random integer between0 and is generated. If the numberis 0, then the
user’s search query is added to the sample.If the number is greater than 0, then the user’s search query is not added
to the sample. list of such usersis maintained which shows which user’s search query is to beincluded in the
sample.
> When a new search query comesin the stream from anyexisting user, then the list of users is consulted to ascertain
to beincludedin the sampleor not.
Whether the search query from the user is
For this method to work efficiently the list of users has to be kept in the main memory because otherwise disk access
a time-consuming task.
will be necessary to fetch thelist which is
ution to this
But as thelist of users growsit will be come adom
.
problem to accommodate it into the main memory Onesol
numbergenerator. The hash function will map a user to a number
Problem is to use a hash function as the rani hue
ry of the useris added to the sample and otherwiseitis not
between 0 to 9. Ifa user hashes to 0 then the search q
added,
ize of any rationalfraction a/b by using a hash function which maps a user to a
'n genera | we can create a samples
P i addedto the sampleif the hashvalueis less than a.
limber between 0 and b-1 andthe user's query Is
TechKnewledga:
Scanned by CamScanner
Each tuple in
the stream Co
nsi ists of n comport nents out of which a subset
of componentscalled key on
forthe sample is based, Which
For instance,
in the search
query ex ‘ample the key cons
User, query an ists of only one component useroutof t he three ¢;
dtime, Butit oMPONey
t is not always necessary to consideronly use
key or even th r as the key, we coul Id even make que
query a By
e pair (user, quer
y) as the key. Sthe|
Scanned by CamScanner
Data Analytics(MU)
=8 ta
ing Data: Streams
i wi
o The criteria involve the look
in,
~
set is
huge andca nnot be stored in the mainSP of set
Memor y. Membership. In this case thefiltering becomesharderif the
.
Bloomfiltering is a filtering techni Que whichj i |
the criteria. of the tuples which donot satisfy
'S usedfor eliminating orrejecting most
example offiltering
_ Let us consider the exampleof s Pam email“ filtering. Let S be the set of safe and trusted email addresses. Assume that
the size of Sis onebillion email a
ddresses and the stream
consist of the pairs (email address,
j i email message). ;
_ The set S cannot be accommodated address is of minimu m 20 bytesin
ited in main memory because on average an email
size. So, to test for set membersh n S, jit becomes
ipip iin
ersh necessary to perform disk accesses. But as discussed earlier a disk
access is manytimes slowerthan main Memory access
- We can do the spamfilteringusing only the main memory and nodisk accesses with the helpof Bloomfiltering. In this
technique the main memoryis usedas bit array.
= Say we have 1 GB of main memory availablefor the filtering task. Since each byte consists of 8 bits, so 1 GB memory
contains8 billion bits. This means we havea bit arrayof8 billion locations.
- We now needa hash function h which will map each email addressin S to one ofthe billion locationsin the array.All
- As there are 1 billion email addresses in S and billion bits in main memory, so approximately 1/8th of the total
available bits will be set to 1. The exact countofbits that are set to 1 will be less than 1/8th because more than one
email address may hashto the samebit location.
to which it is
- Whenanewstream elementarrives, we simply need to hashit and check the contents of the bit location
the bit is 0, then the
hashed. If the bit is 1, then the emailis from a safe and trusted sender. On the other hand,if
email is a spam.
The components of a
. / ;
Anarray of n bits initialized to O's.
ons :
ch maps 2 key to one of the bit locati
Asset H of k hash functions each of whi
hy hy, .., hy
keys- scantibonsei
Aset S consisting of m number of ~
ter al
Fig. 53.1illustrates the block diagram of Bloo™ al
Scanned by CamScanner
Mining Data
5-9
ig Data Analytics(MU) Input data stream
Elements
whose
keys are in'S'
— Initially all the bit locations in the array are set to 0. Each key is taken. from S and one by oneall of the k hash |
functionsin H are applied onthiskeyK.All bit locations producedby h,(K) are set to a.
Analysis of bloomfiltering
— The Bloomfilter suffers from the problem offalse positives. This means evenif a key is not a
memberofS,, thereisa - |}
chancethat it might get accepted by the Bloom filter.
Scanned by CamScanner
The major componentsofthis algorithm are :
Hashfunction :
nh) = 3x41 mod
32
h(5) = 3G) +1 mo
d 32.= 16 mod 32 =
16 = 10009
h3). = 33)+1 mod3
2= 10 mod 32 = 10= 01
nh) = 30) + 1 mo 01 0
d 32.= 28 mod 32 =
28 = 11109
A(2)= 9Q) +1 mod
32 =7 mod 32=
7 = 00111
h(7) = 97) +1
mod 32 = 22 mo
d 32 = 29 = 10
AC) = 911) 119
+1 mod 32 =34
Tail lengths: {4, mod 32 = 2 = og
1,2, 0, 1, 1} o19
Scanned by CamScanner
pig Data Analytics(MU)
5-12 : "Mining Data Streams
43 Combining Estimates
5.
Si
_ Thereare three approa _
Pproaches for combining the estimates from the different hash functions :
o Averageofthe estimates, or
o Median ofthe estimates, or
o The combinationof the
above two.
_ Ifwe take the averageof the estimate
. sto arrive at thefinal estimate thenit will be problematic in those cases where
oneor a few estimates are very large as compared to therest of the others.
Suppose theestimates from the various
hash functions are 16, 25, 18, 32, 900,23. The occurrence of 900will take the average estimate to the higher
side
although most ofthe other estimates are
notthat high.
- The median is notaffected by the problem described above. But a median
will always be a powerof2. So, for example
the estimate using a medianwill jumpfrom 2° = 256 to 2° = 512, and there
cannot be any estimate value in between.
So,if the real value of m is say, 400,then neither
256 nor 512 is a good estimate.
- Thesolutionto this is to use a combination of both the average and the median.
The hashfunctions are divided into
small groups. The estimates from the groups are averaged. Then the median of the averagesi
s calculated whichis the
final estimate.
- Noweven if a large value occurs in a group and makesits average large, the median
of the averageswill nullify its
effect on the final estimate. The groupsize should be a small multiple of log, m so that any possible
averagevalue is
obtained andthis will ensure that wegeta close estimate by using a sufficient number of hash functions.
~ The only data that needs to be stored in the main memory is the largesttail length computed so far by the hash
function on the stream elements.
> So, there will be as manytail lengths as the number of hash functio
ns and eachtail length is nothing but an integer
value,
~ Ifthere is only a single stream,millions of hash functionscan be used onit. But a million hash functionsare far more
than whatis necessary to arrive at a close estimate.
~ Only when there are multiple streams to be processed simultaneously, we haveto limit
the numberof hash functions
Per stream. Even in this case the time complexity of calculating the hash valuesis a bigger concern than the space
Constraint.
ms ina Stream
a Counting FrequentIte!
at somepoint.
© Avstream has no end while every file ends
ee SS
Scanned by CamScanner
Big Data Analytics(MU) Mining Data Strea
os Mg
© The time of arrival of a stream element cannot be predictedin advance whili e the da ta in a file is already avai, able, |
~ Moreover, the frequentitems in a stream at some point of time
5 i
may be diffe rentfrom the freq4uent items in the Same
stream at someother pointof
time.
~ i
To continue our discussion, we need to understand the concept of an itenee je market-basket modelof data
l init bockitlondledth aterne We
have twotypes of objects.Oneis items and the otheris baskets. Thesetofitemsina i
— In the next section we consider someof the sampling meth 5 i uent items in a
odsavailable for counting the freq Stream,
Wewill consider the stream elements as baskets of items.
— After the completion of the first iteration we can run another iterati
on ofthefrequent-itemsets algorithm with :
© Anew file of baskets, or
unmanageable,
Scanned by CamScanner
iq pata Analytics(MU)
EB 5-14 ing Data Streams
fo solve the secondissue, onh Only thoseite:
3 -
cored sets are scored whoseall immediate proper subsets are already being
st = '
2. Approximate count
- For the exact count approach, we needtostore the entire N-bit window in the main memory. Otherwiseit will not be
possible to compute theexact countofthe desired elements. Let us try to understandit with the following arguments.
- Let us suppose instead of storing the N-bit window in main memory, we store an n-bit representation of the N-bit
window where N> n.
~ As p #q, it means they y must differ in at least one bit position. But since both of them are having the same
representation, the answt er to the query of the numberof 1’s in the last k bits is going to be the same for both p and q,
whichis clearly wrong. ,
~ Thus, from the above discussion we can cont clude that it is totally necessary to store the entire N-bit window in
Memory for exact counts.
roneet
Scanned by CamScanner
The twobasic components
ofthis algorithm are:
© Timestamps, and
© Buckets,
Size=8
Storage Space
requirements
gorithm can be
1. Asingle bucketis determined as
represente follows
d using O(log N) bi
2. ts. .
Number Of buck
etsis (log N)
,
3. Total space Fequir
ed = (log? N). "
5.6.
Scanned by CamScanner
its, th
! . one e size - en weobsey rvetl it 7
the size 1 buck ets,
inside poth ,
Thise
et.a e
meansin this case the oldest
4 bu tt , , 2 bucket and ath ees
e size 4 buckh
ine fom
timestam p buck eti s the size
qhusthe estimate of the number
o} f 1s Sin
i the latest
16 bits = (4/2) +2'
+1.+1=6, But the actual number of 1's is 7-
6.4 Decaying Windows
Windows?
Sanee
Ex
irreVeena
older elements.
This type of windowis suitabl le for answeriing the queries on th @ most common recent ele!
most popular current movies, , or the eome”
Most popularitems bought on Flipkart recently,’ or the im
etc.
OO
vs decaying window
Fig. 5.6.1: Fixed length window
fixed len gth sliding window and an exponenti
ally decaying window of equ al i
s Fig. 5.6.1 showsthe difference bel tween a ow.
s the fixed length wind
weight. The rectangular box repre: sent
following steps are taken:
When a new elem entets:arrivesin t he stream, the
Scanned by CamScanner
ities
Big, Data Analytics(MU; 17 SESi
Q.4 List and explain various Datastream sources.
Q.5 What are stream Queries ? Explaindifferent Categories of stream Queries.
Q.6 Discuss different issuesin Data stream processing:
Q.7
‘ i 2
Whatis sampling ofDatain a stream ? How do weobtain representative sample ?
size?
Q.8 Explain General Sampling problem. Whatis effect on stream if we vary the sample
Q.9 Explainthefiltering process ofdata streamswith suitable example. *
Q. 10 Whatis bloomfilter ? Explain Bloomfiltering process with neat diagram.
Q. 14 Explain the process of combining the Estimates. Also commenton space requirements.
Scanned by CamScanner
Finding Similar Items
syllabus it
r e s : D e f i n i t ion of a Distance Me: a
7
e a n 5: stances, Jat card Distance, Cosine Distance, Ed
u
Distance Meas i"ng - stance. sure, Eucli d Di
tance, Hamm Di
Dis
61 Distance Measures
in the
tance measur e. Let x and y be two points
cal led a spa ce. A spa ceis necessa ry to define any dis
Asetof poi nts is input, and produces the
d a: s a fun cti on wi hic h tak es the two points x and y as
define
space, then a distance measureis ction is denotedas :
The distance fun
points x and y as output.
distance between the two
d(x, y) 20 is zer o.
stanc .¢ betwee n a point and itself
2. Zero distance : The di
xee- re Y
lity
iangle Inequa
Fig. 6-1-1 : Tr
Scanned by CamScanner
Big Data Analytics(MU)
oat in de details :
measuresin
distance
In this section we shall discuss about thefollowing
distance measures.
— The Euclidean distance is the most popularout ofall the di ifferent
they
on the i
Eucli dean space . If ie consi der an n-d ime nsional Euclidean space
— The Euclidean distance is measured
sider the two-dimensional Euclidean
each point in that space is a vector of n real numbers. For example, if we con
real numbers.
space then eachpoint in the spaceis represented by(xs, 2) where x, and x; are re
i
which Pace
in the n-dimensional space |s
— The most familiar Euclidean distance measure is known as the L,- norm
defined as:
zi
(Ext, X2,---Xnb [Yas Yar-Yal) = > jz, 21 Ci
— For the two-dimensionalspace the L,- norm will be :
1. Non-negativity : The Euclidean distance can never be negativeasall the sum terms(x;— yi) are squared and the
square of any number whetherpositive or negative is always positive. So the final result will either be zero
ora
positive number.
2. Zerodistance: In case of the Euclidean distance from a pointto itself all the x/'s will
be equalto the y/s. This in
turn will makeall the sum terms(x,—y,) equalto zero. So thefinal result
will also be zero.
3. Symmetry :(x,- y,)*will always be equalto (y; x;)”. So the
Euclidean distance is always symmetric.
4. Triangle inequality : In Euclidean space, the length ofthe side of a
triangle is always less than or equal to the sum.
ofthe lengthsof the other twosides.
— Someother distance measuresthatare used on the
Euclidean space are:
1. L,-norm where ris a constant :
Scanned by CamScanner
Finding SimilarItems.
=
» a+{s
el gi
+
wn
Q L,-norm = 10-614+4—71
443
= 7
4-71)
L.-norm = max (110-61,
8)
max (4, 3)
4
d(x, y) = 1—SIM(x, y)
sures the closeness of twosets. Jaccardsimilarity is given by the ratio of
SIM( x, y) is the Jac car d simi lari ty whi ch mea
of the unionof the sets x and y.
the size of the intersection and the size
nce :
~ Wecan verify the distance axio! ms on the Jaccarddista
ion of two sets can never be mor ethan thesize of the union. This means
1. Non-negativity : Thesize of the inters ect d(x, y) will never be negative.
than or equal to 1. Thus
the ratio SIM(x,y) will always bea value less
en xU X= XO x =x. In this caseeSSIM(x, y) = x/x = 1. Hence, d(x, y) = 1-1 = 0.In other
2. Zero distance : If x= y, th
is zero.
ce betwee n the same set and itself
wordsthe Jaccard distan Jaccard
ricx Uy =yUxand xy =y Ax, hence
bo th un io n as we ll as intersectio! n are symmet
3. Symmetry : As
distance is also symmetric d(x, y) = 4(¥- x). minhash function
ca ce ca n al so b e con: sidered as the probability that a random
quallity : Jac rd distan
Triangle ineequa
sets x and y to the same V alue.
does not map both the
(z) # (yD)
Pih(x)# h(y)] < Plh(x) #b@)] + Pth n.
io
ndom minhashfunct
Whereh is the ra
Techiinomledgi
Pubtications
Scanned by CamScanner
Big Data Analytics(MU) Finding Similar
rotation from x to y.
Ex.6.1.2: Consider the following two
vectors in the Euclidean
Space ;
x=[1,2,—1], andy =[2, 4, 4),
Calculate the cosine dis
tance between x and
y.
Soin. :
,
Given x=[1,2,-1] ;y=[2,1,
q
@ XY = [2] + [2x1] + (C1) x1]
2424C1)=242-]
= 4-133
Gi) Tynorm forx = OC
H ye
L,normfory = VORPO
_ - :
P
= 8 = Ge e
», @- no rm of y) “6
Scanned by CamScanner
61
x= JKLMN
y = JLOMNP
~ For calculating the Edit distance between x and y we have to convert string x into string y using the edit
Operationsofinsertion and deletion.
~ Atfirst compare the character sequences in both thestrings :
x = J K LM N
y= Jeb Oo M N P
bee ddd -
@ @@ ® © © — Positions
Scanned by CamScanner
Finding Similar
Big Data Analytics(MU)
x= J LM N
@2OoO®
character L and before the character M,
Now the character O hasto be inserted at position 3 i.e. after the
x= J L OM N
budibib
®©2O@OO ®@O
Afterthe third andfinal edit operation (insertion) the status of the string is :
X= J LO MN P
Llib
9®@OO®O @
The longest common subsequenceof twostrings x and y is a subsequence of maximum length which appears in
both x andy in the samerelative order, but not necessarily contiguous.
Letusillustrate the conceptoffinding the Edit distance using LCS. method with the sameset of strings as in ‘the
previous method:
x= JK LMN
y=JbtLOMN P
length ofstring x =
un
length ofstring y
Fn
tt
length of LCS
Scanned by CamScanner
ua
Big Data Analytics(MU) 6-7 Finding SimilarItems
So,
=3
ro distance : Onlyin the case of two identical strings, the Edit distance will be zero.
Symmetry : The edit distance for converting string x into string y will be the same for converting string y into
string x as the sequenceofinsertions and deletions can be reversed.
Triangle inequality : The sum of the numberof edit operations required for convertingstring x into string z and
for converting the string
then string z into string y can neverbe less than the numberof edit operations required
x directly into the string y.
x= 1 0001 1
1 1 1 0 1 0
y
Ibe i did
© ®@ © @ © © Positions
Scanned by CamScanner
Big Data Anal MU, 6-8
Q.2 Whats distance Measure ? Explain different criteria's regarding distance measures.
Q.3 What do you mean by Euclidean distance ? Explain with example.
Q.4 Whatis Cosine distance ? Explain with suitable example.
Q.5 Consider following are the two vectors in Euclidean space X = [1, 2, - 1] and Y=
[2,1,1]. Calculatethecosine distance
between X and Y. ao i
Q.6. What is Edit distance ? Explain with classical method.
Q.7 Whatis Edit distance ? Explain with Longest Common Subsequence (LCS) method.
Qo0
Scanned by CamScanner
Clustering
Syllabus
CURE Algorithm, Stream-Computing AS P “
itream-Clustering, Algorithm, Initializing & Merging Buckets, Ans
wering
Queries
Cluster Using REpresentative i.e. CURE is very efficient data clustering algorithm for specifically large databases.
CUREis robustto outliers.
spherical clusters.
in s ic al as well as non-
s better ph er
~ CURE algorithm work ge database : sudipto Gul ha, Rajeev Rastogi,
Kyuseok Shim.
algo rithm for I: ar ch.
CURE: Anefficient cluste ring r an all-points or centroid approa
as re pr es ' entative cluste th
ed
h are 5 catter
It prefers a set of points whic eed uP clusteri
ng.
g to sp
sa mp li ng an d partitionin!
CURE uses ra ndom
Scanned by CamScanner
Big Data Analytics(MU) . 72
+
Make random sample
4
Makepartitioning of sample
J
Partially cluster partitions
L
Eliminate outliers
L
Clusterpartial clusters
L
Labeldatain disk
A ccentroid-based point‘c’ is chosen. All remaining scattered points are just at a fraction distance of to get shrunk
towardscentroid.
Such multiple scattered points help to discoverin nonspherical clusteri.e. elongated cluster.
ee) 4
Fig. 7.2.1
— These points are usedas representative of clusters andwill be usedas point in d,,,, cluster merging approach.
— After each merging, C sample points will be selected from original representative of previous clusters to represent '
newcluster.
we tool,
Scanned by CamScanner
W.. Big Data Analytics(MU) Clusterin
il fires
r mergingwill be stoppedunt
=
Cluste er is found.
Nearest| _Merge =
Nearest}
O
Fig. 7.2.2
i
er in cluster.
nerally less than numb
= Outliers points are ge m each cluster are labelled with data
set
re d, mu lt ip le re pr esentative points fro
gets cluste
- As random sample
remainders. all-points
ed to centroid or
app roa ch fou nd most efficient compar
sc attered point ie. CURE
— Clustering based on m.
itional cl justering algorith
approach of trad
thm)
RE (c lu st eri ing algori
Pseudo function ©' f CU
Scanned by CamScanner
¥ Big Data Analytics(MU) : 7-4 Clusterin
—— 9
1) >dist (wt
Scanned by CamScanner
y :
ata Analytics(MU)
s
Stream computingis usefuli in real time
_ system like count of items placed
on a conveyor belt.
_~ 1BM announce
saanestions to d stream
ees computit
puting '& sy: system in
i 2007, which
i runs 800
it microprocessors and it enables to software
pplicat Get split to task and rearrangedata into
answer.
_- AT1 technologies
iawrlatercy derives stream
CPU boresolve computing
r
tational phical Processors (GPUs) working
i with Graphical wiiwith high performance
irmi with
i i ji
@stder
Fig. 7.3.1 : Standard stream for input, output and error
Scanned by CamScanner
Big Data Analytics(MU; 7-6 Cluster
A smallsize ‘p’ is chosen for bucket wherep is powerof2. Timestampofthis bucket belongs to a timestamp Of most
recentpoints of bucket.
Clustering of these points done by specific strategy. Method preferred for clustering at initial stage provide the
centriod or clustroids, it becomes recordfor eachcluster.
Let,
* Every point, creates a new bucket, where bucketis time stampedalong with clusterpoints.
P— mergeoldest two
If any bucket with more timestamp than N time unit prior to current time, at such scenario nothingwi
ll be in window
of the bucketsuch bucketwill be dropped.
if we created p bucket then twoofthree oldest bucketwill get merged. The newly merged
bucketsize nearly ay as we
needed to merge bucketswith increasingsizes.
To merge two consecutive buckets we needsize of bucket twice thansize of 2 buckets
going to merge. Timestamp of
newly merged bucketis most recent timestamp from 2 consecutive buckets. By computing few paramet
ers decision of
cluster mergingis taken.
Let, k-meansEuclidean. A cluster represent with numberof points (n) and centriod (c).
Scanned by CamScanner
Clustering
ata Analytics(MU)
orithm ?
What is clustering alg
What is CURE ?
ter algorithm.
Write procedure of CURE clus
sampli ng and partiton sampling.
Whatis sampling ? Explain random
r orithm.
Write pseudo function of cluste alg
cluster in CURE.
Write procedurefor merging
g?
Whatis stream computin
, stddir ?
Whatis stdin, stdout
m.
Explain BDMO algorith
ering ?
Wh atis bu ck et , howit is used forclust
q.10
g of bucket.
initializing and mergin
a.11 Explain in brief
Scanned by CamScanner
_Link Analysis
Syllabus
rs
PageRank Overview, Efficient computation of PageRank : PageRank Iteration Using MapReduce, Use of Combine
to Consolidate the Result Vector
Web-crawler is the web component whoseresponsibility is to identify, and list down the different terms found on
every web page encountered byit.
This listingofdifferenttermswill be storedinside the specialized data structure knownas an “inverted Index’
‘An inverted index data structure haslisting of different non-redundanttermsandit issues an individual pointer to all
available sources to which given termis related.
Every term from the inverted indexwill be extracted and analyzed for the usageofthat term within the web page-
Scanned by CamScanner
4
ig Data Analytics(MU) 82 Link Analysis
Big DS “ ;
fag in aweb
a Percentage Within the given web page According to percentage of usage of terms
“has someusage
Every! term
Scanned by CamScanner
Big Data Analytics(MU) Link Analysis
83
Webpage A
Webpage B
Fig. 8.1.2
The numberoflinksexists between two or more webpagescan becategorizeasfollows:
1. Backlinks
2. Forwardlinks
1. Backlinks
With reference to Fig. 8.1.2 A and are the Back links of web page‘C’ i.e. Backlink indicates given web pageis
referred by how many numberof other web pages.
Forwardlink
Forwardlink represents the fact that, how many webpageswill be referred by a given web pages.
Clearly,out of these twotypesoflinks backlinks are very important from Ranking of documents perspective.
‘A web page which contains numberofbacklinks is said to be important web page and will get upper position in
Ranking.
Rw =e Ve>By RM
Ny
Where,
: Represents the web page By
Ny: It represents numberofforwardlinks of page v.
C: It represents the Normalizations factor to make
A world wide web can be considered as the ‘Di-graph’ i.e. Directed graph Any graph
‘@’ is composed of tw?
fundamental componentsvertices and Edges.
W Kaenielt
Scanned by CamScanner
Big Data, Analytics(MU) Link Analysis
54
Gs,
| Lie
; ; Vertices/nod
Here,vertices or Nodescan be mapped to Page: .
Ss
o Ifwe consider a small 7
| Part of worl d wide web containi ng 4 web pages namedas P,, yy Pz,Par P3, Pa4 -
s and forward links to othe r
o Every page i has Back link pages.
o Fig. 8.1.3 showsthe above mentionedstructure.
Fig. 8.1.3
Fig. 8.1.4
to 1/3.
oO
at P a B e P , / P3/ p, is equal
user W! ill be
© Probability that
2 itself is ‘0’.
user will be at page
© Probability that
then ©
oseu se! h a s ch os en page Pz .
© Supp
is 1/2, i
tha t use ! wi ll be at PABE Py
© Probabili ty ‘
be at page Pa is 1/2.
© Probability that user will
° Probabililiity that use! r will be an be represented using special structure known as “Transition
b surfing bya
These possibilities of we
Matrix’. ed of 1" pages ‘n’ rows and ‘n’ columns. Twopointerc andj will be to
n
In general, the transitio s Fr
row and colu! mn
Tepresent the current eee
Pe!
Scanned by CamScanner
Big Data Analytics(MU) 8-5 Link An;
A BC D
A 0 12 1 «0
B 130 «O 2
M =
Cc 13. 0 O 1/2
, pL12 0 0
Matrix should be seen column wise Example 2
v2
Fig. 8.1.5
xX Y Z
x V2 2 0
M =Y 1/2 0 1
Z 0 1/2 0
x V3
y] =/13
Z. 1/3.
i -|i2 12 i [3]
1/2 v2 0 1/418 ...For first iteration
V6. o 12 o1 Li ae
[is] [i 1/2 i [|
V3} =/12 0 1] 172 iterati
\d iteration
14 o 12 of Liv6, .--For
Eee
Hence, with simplified page Rank algorithm critical problem has evolved ie. during eachitera
tion, the loop
accumulates the rank but neverdistributes rank to other pages.
To identify the location at which theuser will in near future one must have a Probability with a specialized function
knownas “A page Rank”.
Scanned by CamScanner
Link Analysis
pig Data Analytics(MU)
e a/n with
page will b
th
vector. component. The probability that the user will be at"
an
Consider a vector V, as 1
where,
(i) m, represents the probability of user movement atgiven instance from j" location to i™ location.
(ii) Vjrepresents the probabilities that useris at j" position for previous instance.
Scanned by CamScanner
¥ Big Data Analytics(MU) 87 Link Analysis
ES,
In practice any given webstructure is composed of 4 types of components :
3. Out components
4. Disconnected components
. . ther for for tl the
1. Astrongly connected components Is nothing but the components whicharedirectly connected to each other
data exchange andtheyalso has forward and backwardlink to each other.
2. In-components : In-components are the integralpart of where it exhibit the relation with SCC suchthat,
Not recognized
from SCC
Fig. 8.1.7
3. Out-components : Out-components are the structures which shows following properties.
Reachable from
scc
®———————>.” Out components
Not recognized
to SCC
The in-componentand out components can have tendrils which represents in and out components.
Tendrils Tendrils
(A)
Fig. 8.1.9
Scanned by CamScanner
if
rix is known aS ‘stochasticity and
The property of having sum = 1 for Most
e columnsin given transition mat
of
‘
there are dead ends then someof the ans have ‘0’ entries.
ConsidertheFig. 8.1.10.
2 12 0 0
ends :
in g ar e th e wa ys to deal with dead ming links.
Follow that nodeby removing their inco
n delete
with dead el nds we ca ed th e same approachin
First approa' ch to deal mor e dea d ends whic! h has to be solv wi th
it will int! roduce
an ta ge o f th is approach is
Disadv
Nodes which are
recursive manner. a giv en gra ph to or web will be keptas it is the
total page rank for r th e calculation of
we del e te th e node but e set of oth er no des which acts as predecessors fo
Tho ugh consider th
‘e ee phG , bu t we, can
notavailable in gr!
page rank. ‘or nodes and h
8
consid!
Additionally we can
r th is pr oc ed ur e, some node:
Afte
ulations.
predecessorscalc Pa
io ns all ni odes has their
After some iter at
node deletion order.
Scanned by CamScanner
- .
Link Analysig
. 5 i
— Suppose we have graph containing nodes and these nodes are arranged in following ™: anner as shown | in
Fig. 8.1.11.
Fig. 8.1.11
If we observe the Fig. 8.1.11 to calculate the Page rank. Wefin
d that, Node is the dead end asit doesn’t have any
forwardlinks i.e. the links going out from
Node 5.
— So hence, to avoid the dead ends, delete
the Node5 andits corresponding are coming from Node 3. So now
the graph
G becomes.
Scanned by CamScanner
above graph will be,
the transition matrix for
if 0 120
M =| 120 1
12 12 0
lows =
we can have componentvector representation for above matrix as fol
18
3 ~ (1) Iteration 1
18
1/6
> We have to calculate the page rank for Node 3 and Node 5 with the exact opposite order of node deletion. Here
of predecessors.
Node 1, Node 2, Node 4 are in therole
rank of Node 3
~ Numberof successor to Node 1 = 3, Hence, the contribution from Node 1 for calculating the page
is 1/3
Node5 for calculating the page rank of node 3 Te
For Node 5 it has 2 successors. Hence the contribution from
X%9) + 2*9) =54 4,2) +(4x3) -8
Page rank of Node 3 =(3
For calculating the page rankof Node 5, Node 3 plays @ crucialrole. As Node3 has, numberof successors = 1 and a
of Node=5 node
Thas node 3 asits predecessor. Hence, we can conclude that Node 5 waspagerank same as that
As the aggregateof their page rankis greater than 1, so It doesn’t indicate the distribution for a Biven ;
user whois.
Surfing through that web page.Still it highlights the Importance of web pagerelatively.
Another Way to deal with dead endsIs configure the process for a given user by having ——
a “Taxation”. n that it is
“sumed to be moved through web known as
Is knownas “spider traps”,
Texation methodpoints to other problem also which
~N
Scanned by CamScanner
8-14 . Link Analysis
wv Big Data Analytics(MU)
ie. Spider Trap = Set of web pages with no dead ends but no edge goingoutside also (no forward link)
traps in realtime in a
Spidertraps can be sowed in the web with or withoutintention. There can be multiple spider
8.1.14 which showsthe part of web containing only
given webpage.Set but for demonstration purpose, consider Fig.
one spidertrap.
13 0 O 1/7
=
"
138 0 O 1f
1/3 122 0 0
heultimate
If we proceed further by the same method stated in previous section for calculating the page rankthent
result that wegetwill be,
1/4
1/4
— (1) Iteration (1)
1/4
1/4
3/24
5/24
— (2)Iteration (2)
11/24
5/24
5/48
148
~ (3) Iteration(3)
29/48
MA8
Scanned by CamScanner
i Analytics(MU)
8-12 Link Analysis
a =.
21/288
31/288
205/283 ~ 4) Iteration (4)
31/288
1
0
Thehighest page rankwill be given to Node3 as, there is no link which goes outfromit, but if has thelink to inside it.
So, user is going to struck at Node 3. As it is represented that numberof user are there at Node 3 so Node 3 has
greater importance.
To have a remedyto this problem we just configure the methodof calculating the page rank by injecting a new
concept known as‘teleporting’ or morespecifically a probability distribution of “teleporting” and we are not following
the links going out from given node.
To calculate the teleporting probability — calculate new componentvectorV,.,, for estimating page ranks suchthat,
Scanned by CamScanner
8-13 Link Analysis
Big ig Data Analytics(MU)
iB the fetched web
riteria to arrange
— For exam) ple, the most popular search engine Google has 250+ such predefinedcriteria
pagesin some particular order. v rch
ae of user’s search query.
~ Every pageonthe web should possess minimum one wordor one phraseinit. Same asthat
s 5 i it a page will havet
— If the given web page doesn’t contain any word or phrase then thereis less probability that pag he
highest page rank.
Svanc
also m atters eomliwil
e.g. / ew
In page rankingcal culatiion, the place on the web page whe! re the pl phraseis appeared is
a ; the phrase appeare!
(phrase appears in the headerwill have more importance than in footer,
average importance)
— In previous sections we have studied that, how to calculate the page rank of the given webpagein a given web
structure.
— The efficiency in such complexcalculation is achieved as we have taken a small part of web i.e. for 4-5 nodesor pages
only. 7
— But,if scale-out this small for concept to a real-time condition billions of web pages V, a matrix — vector multiplication
we haveto computeoforderatleast 70-80 times till a component vector will stop changing its value.
— Forsuch real time complexity the solution proposed is use of Mapreducetechnique studied in Section 3.2, but such
usageis notthat straight forward,it has two handles to cross.
(i) The most important parameter that how to represent the transition matrix for such huge number of web pages.
If we try to represent the matrix for all available web pages whichare underconsideration then it is absolutely in
efficient for performing the calculations. One wayto handle this situation is to indicate the non-zero elements
only. .
(ii) One morethingis if we go for an alternative to mapreducefunctionality for performance and efficiency concerns
then we maythink for ‘combiners’ explained in Section 3.2.4.
— The combiners generally used to minimize the data more specifically an intermediate data result to be transferred
to
reducer task.
iit
KiVy KpV2 KoVo KyVg KoVy KVe
\_Z
Fig. 8.2.1: Use of combiners
Scanned by CamScanner
Link Analysis
814
«Date. Analytics(MU)
ch more impact to reduce the effect of “thrashing”
iso, the striping concept doesn't have mu ‘
postin of env iro nme nt use d i.e. Distributed computing
U itself irrespect ive
All ©! putations performed by the CP / CPU is is to execu! te the
instructions
ti
DCE) or standalon e computing. Hence, the main task of processor
environment (DCE)
ta from the secondary storag
e. her
e featuring of da ry storage rat
and ni ot th
y in just fetching the data from seconda!
tuation processor /CPU is bus
ifin some commonlyoccurredsi
ionis knownas “Thrashing”
than executing the instruction then suchsituat
n Matrix
24 Representation of Transitio of links going ou
t from
and num be! r
deal with arebillions
web pages that we are going to
‘As we knowthat, number of
10 on an average. web
agiven web pages are
ind ica te the tra nsi tio n mat rix is to have a list of different
o. The best way to
entry ‘1! in billion pages is not zer
associatedvalues.
page which has entries ‘non-zero’ with
ke,
Thestructure will lookli
aro ontty,
bytes
Abytes + 4 bytes + 8 bytes 16
Fig. 8.2.2
d of quadratic.
has linear nature instea
So, space required here :
1/number of oflink
links goin, ig
wise representation of nonzeroentriesi.e. . out
Wecan apply more compression by column ;
from a given web page
Acolumn is representedas ,
degree
linteger > to represent —? out
n-zero entry in tha t column
integer > torepresent — for every no
L
ry lo cation
yields a row number ofent
Scanned by CamScanner
8-415 Link Analysis
¥ Big Data Analytics(MU)
ectors depicted by V and View
Asingle pass of pagerankcalculationincludes calculation two componentv
View = B:M-V+(1—B):e/n i
Where, B = Constant (Ranges between0.8 to 0.9)
€ = components vectorofentries 1
M = Transition matrix
When‘n’ has small value then V andV,,.,, can be stored in primary memory or main memory for Map task.
If in real —timeV is big in size so that if can’t befit into main memory then we can gO forstriping method.
Hence depending on the requirement situation and complexity of problem a method to be used should be decided.
Combiner
Generation
of key-value <kKy, Vy> <kj, vp
pair <ky.V5>
Combiner +—
Scanned by CamScanner
= ral nk question arises thi i
when page : giants
en the mation
in informat i. technology such i
as google will develop som ——
tot. Additionally there are somesecurity related issues such as “The spams”
gut :we do have
‘i destructive minded people
: in society whowill
i alwaystry -
to affect the system by performing seme
malicious activities. Hence for page ranking calculations “spammers” are catia into existence.
spammers haveintroduce thetools and techniquesthrou
; oe h which
i f i i Kk can be ;increas en
aselected multiple such intentional rise in the value of care rankis ae “ Pele m
“_ forlink spam spammersintroduce the webpagesitself forlink spamming.
- The malicious web pagesintroduced by the spammersis known as spam farm.Fig. 8.3.1 showsthe basic architecture
of spam farm.
Sensetive
pageS Targated web page farm
forlink spam
Scanned by CamScanner
W_Big Data Analytics(MU) 8-17 Link Analysis
— Ifweconsider the spammersperspective then Fig. 8.3.1 can bedividedinto 3 basic blocks
1. Non-sensitive 2. Sensitive
3. Spam farm
1. Non-sensitive
Thesearethe pages which are generally not accessible to the spammerfor any spammingactivity. As these pagesare
not accessible to spammersso, theywill not affect by any activity performed by the spammer.
2. Sensitive
Theseare the web pages which are generally accessible to the spammers for any spammingrelatedactivity. As these
Pages are accessible to the spammersso, they will get affected easily by spamming activity performed by the
spammer.
The effect of spamming on these pagesis generally indirect as these pages are not manipulated by the spammers
directly.
3. Spam farm
The spam farm is the collection of malicious web pages whichare usedto increase the numberoflinks pointed to and
coming out from a given web pageso, ultimately the page rank of a given pagei.e. target web pagewill increase
dramatically. There are other category web pages which supports the spamming activity by aggregating page ranking
i.e. a part of term (1 - B).
The term ‘f’will depicts the fact that, how a part ‘of Page rankis segregated amongthe successor nodes for the
next
iteration. Actually B is the constant term ranges between 0.8 to 0.9 generally (0.85).
— We knowthat, there are some web pages whosupports the spamming activity.
- Soa pagerankof oneof such supporting pagecanbe calculated with thehelp of following formula
Pris) = B-Y¥/m+(1-8)/n
Where,
Scanned by CamScanner
g pa ge with B multip|
(ii) page rank of supportin ‘iples.
B- (Pr (s)
where, Prd) = B-Y/mt(1-fyn
» We can concludethi at the page rank‘Y’ rt of target webpage‘t’ will bein the e fform o' if
Y =x+B-m(Bt,+28)
m” oo
_= x+B 2 y+B-(-B)xq
ma
a constant Q,
Here we can introduce
nk Spam
8.3.3 Dealing with Li page rank system.
fu nd am en ta l pr im ary things related to
on the
ffect oflin k spam h the link spammingthe
different search
- Aswe have seen thee ly he ncce
e to dea l wit
ystem complete
wil l di st ur b th e pag! e rank 5) izing the effect of lin
k spam.
- Link spam he lp in minim
which will
erent solu! tion,
engine thought of diff m ing they are
as follows:
with l i n k s p a m
io nd eventually
call y the re are t w o ways to deal wi ll ha ve to p- view of whole scenar a
- Basi m
h engine algorith ructure.
c! ini dexing st
() A traditional approa them from the
link
k spams and find the alternate way to do the
algorithm win find suchlin r w e b p age the sp
al mmerwill
pamme
m de le tes the s
But as soon as al go ri th
nk with reference to
ure for cal culation of page ra
spamming. ify the proced
() Trust ranking
.
\
In trust ranking the system !5
Part of spam farm.
Scanned by CamScanner
Big Data Analytics(MU) 19 Link Anal
— Such set of web pagesis termed as “topic”.
Consider a spam farm page wantto increase a page rank of trusted web page. So, spam page can.have a link to
trusted pagebut that trusted page will not established a link to spam page. .
(il) Spam mass
— In spam astechnique, the algorithm of page rankingwill calculate the page rank for every web page also the part of
thepage rank (affected part) whose contributor is spam page will be analysed. This analysis is done with the helpof
comparison between normalpage rank and pageranking obtainedthroughtrust ranking mechanism.
This comparison can be achieved throughfollowing formula :
pr(Sm) ==Pa
Where, P((S,) = page ranking by spam mass technique
P, = page ranking bytraditional method
P,t, = page ranking bytrust ranking method
If P.(s,,) < 0 i.e. negative
Or
Pr (s,,) > 0 but < 1 i.e. not close to 1 then that pageis not a spam pageelseit is a spam page.
— The hubs and authorities is an extension to the concept of page raking. Hubs and authorities will add more
preciseness to the existing page rank mechanism.
= Theordinary,traditional page rankalgorithm will calculate the pagerankforall the web pagesavailablein a given web
structure. But user doesn’t wantto examine orviewall of these web pages. He/she just wantfirst 20 to 50 pages in an
average case. :
— Hence,the idea of hubs-and authorities will cameinto existence to haveefficiency and reduce workload calculating
page rank.
|
|
— Inhubs and authorities page rankwill be calculated for only those web Pages whowill fetch in resultant set of web }
i
pagesfor a given search query.
= Itisalso knownas, hyperlink induced topic search abbreviated as HITS.
— .The traditional pagerankcalculations have single view for a given web page. But hubs and authorities algorithm will
have twodifferent shadesof viewsfor a given web page.
1. Some web page has importance as theywill present signification information of given topic so these web pages
are knownasthe authorities,
tion of any randomly selected ‘spits nel <
2. Somewebpages has importance because they gives us the informa
theywill direct us to other web pagesto collect more information about the same. Such web pages known as
hubs.
Scanned by CamScanner
‘
o5(MU) Link Analysis
r
1 rormalizing Hubsand Authority
a
af a page can be viewed.
section hubs andauthorities these are the two shades with which a web
i earlier
stated in
a given web page.
so, we can allot,2 typesofscoresfor
Hubbiness Authority
score score
Fig. 8.4.1
score
— represents hubbiness
h
ity score
a — represents author
. s of j" page-
th ge s ‘h’ wi ll gi ve measure of Hubbines
t b pa
> }"componen of a we
mea: sure of author
ity of j* page.
a web page ‘a’ will give
J" component of
~ jr
web.
pages in agiven
‘LM’ for web
ider link matrix
> Tohave thenotion of‘h’ and ‘a’ cons
resent as LMij
~ Any elementof LM can be rep abl ished from i" pageto j”
page.
LMij = 1if ali nk is est
Scanned by CamScanner
jig Data Analytics(MU)
TEGO
ey
ee
core
coor
Given Matrixis- A =
REE
ooo
-Hoo
Hoon
afa?at?42244?)
1
Ree
a? + 12 422 444)
2
"
Q.4 Whatare links in Page Ranking? Explain Back Links and Forward Links with suitable example?
8 Explain Structure of
Web? Explain Spider trap in detail
all,
4
ranking in search engine?
09 explain the role of Page
d in efficient computation of Web
a0 Explain the different modification suggeste leb pages.
? Explain in Detail.
0.13 Whatis Link Spam
ecture in detail.
0.14 Explain Spam Farm Archit
farm.
t on Non-Sensetive,sensitive and spam
n with neat diagram? Also commen
0,15 What is Spam Farm explai
rm Analysisin detail.
0.16 Explain the Spam Fa
and Spam Mass.
to de al wit h Lin k Sp am with Trust Ranking
? How
0.17 Whatis Link Spam
its Significance.
Authorities? Explain
0.18 Whatis Hubs and
aoa
Scanned by CamScanner
= a.
Module - 6
Syllabus
— It is vast widely used now-a-days. It is likely a subclass of information filtering system. It is used to give
recommendationsfor books, games, news, movies, music,research articles, socialtags etc.
— It is also.useful for experts, financial services,life insurance, and social medialike Twitter etc.
— Collaborativefiltering and content-based filtering are the two approach used by recommendation system.
~ Collaborative filtering uses user’s past behaviour and apply somepredication about-user maylike and accordingly post
data. ,
— Content basedfiltering uses user's similar properties of data preferred by user.
— By using collaborative filtering and content based filtering a combine approach is developed i.e. Hybrid
recommendation system.
‘A recommendation system prefers the preference of a utility matrix. Users and item’s these are entities used by
recommendation system. .
erences must be observed.
Users have preferenceto data andthese pref
some item category.
Every data itself is part ofutility matrix as it belongs to
Example : A table representing users rating of apps on scale 1 to 5, with 5 as highest rating Blank representsthat
usernotreplied on scale A1, A2 and A3 for Android 1, 2 and 3 i1, i2,.13 for iOS 1, 2, 3 users A, Band C givesrating.
Al A2 A3 i121 i2 1B
A|3 4 5 4
B 5
c}3 4.4 4
Scanned by CamScanner
.
yo
s
9-2 Recommendati o!on Systems.
dati
very minutefraction of real =
ation of Android
it
typ ical users rating are al sci
enario if we consider actual number ofrapplic
~ and 105platform and number ofusers,
itis observed in table for someapps thereis less numberofres ponses.
matrix is to make some Predictions for blank spaces, these prediction are useful in
The 6 oal behind utility
.
recommendation system
.
‘A
As ‘at user gives rating 6 5 to i2 APP So we have to take in account parameters of app i2 like its GUI, memory
s if applicable etc.
consumption, usability, music/effect
nilarly ‘B’ user gives rating ing 5 to A2 app so wehaveto takesimilar parameterin consideration. By judging both apps
similarly 8
user A and B.
i2, A2 features and all we canputpredication what canbe further recommendedto
of fll rating anywhere stil it can be judge and predicted what kind
from user “C” response though there is no use
feature based appuser‘c’ should be recommended.
tems
91.2 Applications of Recommendation Sys
Amazon. Com
- CDNOW. Com
=. Quikr.com
- okx.com
- Drugstore.com
- eBay.com
- Moviefinder.com
endation system.
d seller/bu yer, trading website uses recomm
~ Reel.com and so manyonline goo
solidate in a single place
e Rec omm end ati on, New sAr ticles etc. are likely to be con
~ Product recommendation, M jovi
applications.
Recommendation System
41.3 Taxonomy for Application
Community
inputs
(history, attribute)
Scanned by CamScanner
¥ Big Data Analytics(MU) 9-3 Recommendation Systems
9.2 Content Based Recommendation
It focuses onitems anduserprofiles in form of weighted lists. Profile are helpful to discover properties of items.
— Theyearin which songs album release or made. Few viewers prefer old songs; somepreferto latest songs only, users
sorting of songs based onyear.
— Few domains has commonfeature for example a college and movie it has students, professors set and actors,
directors set respectively. Certain ratio is maintained as many student and few professors in quantity while many
actor works under one or two director guidance. Again every college and movie has year wise datasets as movie
released in a year by director and actor and college has passing student every year etc.
— Music (song album) and a book has same value featurelike songs writer/poet, year of release and author, publication
yearrespectively.
Sr. Mig. Package Contents
e idl :
Productwith .
feature
Community fo Lst of
"data recommendation
Users
(source of profile and contextual data)
Fig. 9.2.1 : Recommendation system parameters
Scanned by CamScanner
Systems
9-4 Recommendation
S
l
Let say news articicles
S per but user
There are many articles in a newspa
inds of docu ment . in a ney
here many wspaper.
S reads very
few of them.A reco mmendation
es to a us er su pp os ed to be interested to read.
est for art icl 2
ny past
websites anindd websy s,subigg :
geem
simila rly th er e ar e so ma ; , blo gs could be reco
i
Scanned by CamScanner
9-5 Recommendation Systems
9.2.5 UserProfiles
Bestia
Be
ted with the help of
Vectors are useful to describe items and user's preferences. Users and itemsrelation can be plot
utility matrix.
rating in 1-5 range.
lity matrix has some nonblank entries that are
Example: Considersimilar case like before bututi
) got
t
Consider, user U gives responses with average rate of 3 there art e three applications (Android OS based games
rated average of
ratings of 3, 4 and 5. Then userprofile of U, the component for application will have value i.e.
3-3, 4-3 and 5-3 i.e. value of 1
Between user's vector and item’s vector cosine distance can be computed with help of profile vectors for users and
items both.
It is helpful to estimate degree to which user will prefer as an item (i.e. prediction for recommendation).
If user’s and response(like 1 to 5 scale for mobile apps)vectorscosine angleis large positive fraction.It meansangle is
close to 0 and hencethereis very small consine distance betweenvectors.
If user’s and responsesvector cosine angleis large negative fraction. It means angleis close to degree of 180 whichis
a maximum possible cosine distance.
Scanned by CamScanner
ic Systems
a) Euc' jidean Distance Metr
| 6) probabilistic methods
(a) Naive Bayes.
| Recommendation system in collaborativefiltering becominginteresting as few domains are used move by research
I scholar and academician like human-computerinteraction, information retrieval system and machinelearning.
Few famous recommender systems in somepopular fields like Ringo-music, Bellcore-video recommender (movies),
Jester-jokes etc.
widely used example ofcollaborative filtering and
Collaborative filtering began to use in the early 1990s. Most
fecommendation system is Amazon. com.
by
important. Recommendation must be get appreciated
Jo recommend among large set of values to users i is very
user else effort taken for it were worthless.
on of
4,10,000 title in its collection, so got proper selecti
> Netflix has 17,000 movies collection while Al mozon.com has
fecommendation is necessary.
soning
bec ome s adv anc ed wit h hel p of Bayesian interface, case-based rea
Toolbox used for collaborativefiltering
Method, information retrieval. s‘rating’ and
ce giv en by an y use r to an item is knowna
" eren
| Gllaborating filtering deals with ‘users’ and items. A pref
and Rating).
| Srepresented bytripletvalue set of (User, Item, matrix and it is referred asrating matrix
;
.
is u sed to create
@ sparx
~ Ral ting i ing) tem.
| tr
) lu at io n a n d us e of recommendation sys
e s rat ing
task’ are used foreva
:
"edict task’e
p and ‘re commend al es to apps
. ting matrix con 5 § tar sc
Table 9.3.1 : Sample rating ae 5
SS i | ee fi ngout:)'
ms) Y a ‘ i
ia Bae s(i
ts ae
3
cheat? te é
3
: 3 a
er ;
zp es
| 4
User A 4 5 3
f
B 3 |
User B 4 2
i
k a
3
User C
Scanned by CamScanner
4
‘
¥ Big Data Analytics(MU)
d
—. Predict task tells abouutt preference may give
i n bya user or whatuser'slikely preference to an item?
e
Recon
mmene
dtask helpf
Pful ultto desigi nn-iti emslisi t for user’s need. Thes e n-itemsare not on basis of prediction Preference
se criteria to create recommendation
maybedifferent.
9.3.1 Measuring Similarity
User A 4 5 1
User B 5 5 4 5
User C 2 4
User D 3
Aboveutility matrix data is quite insufficient to put reliable conclusion. By considering values from A andC,theyrated
two appsin commonbut their ratings are diametrically very opposite.
In this sets of items rated are considered while valuesin matrix are ignored.
d,(A,B) = 1-J(A,B)
_ AVBI-IANBI
~ IAUBI
be given by,
Alternatively Jacard distance can
AAB = (AUB)-(ANB)
Scanned by CamScanner
e Aand User Bis,
jne one e 4xe 5. ‘
~ 0.380
: es + PYS+54+4"
a
A and userCis,
angle between user
cosine A x2+ R
5E 1x4
e 0.322
Vas+ ry +445
an fies
WhatsApp. :
Apps(items)
:
* Users
1
1
User A
1 1 1
User B
1 1
User C
1
User D
; ng Rating
3, 5 Normalizi
e -
conv ert into nega tive whil e high rati ng get conv erted into positive as it is subtracted from averag
low rating get
te, this
his is known as Rating Normalization.
4 System
Pros and Cons in Recommendation Te
1g 0 llaborative
i Filtering
kn Ow! .
engineering efforts needed.
eneipity edge
;
\ Cong”
tiny,
in Fesults.
us, a Tech!
learning for market process. Pavticat
Scanned by CamScanner
wy Big Data Analytics(MU) 9-9 Recommendation Systems
SSS
Cons
9.4.2 Content-basedFiltering
Pros
Cons
Q.2 Enlist application of recommendation system and taxonomyfor application recommendation system.
Q.3 Explain utility matrix with example.
Scanned by CamScanner
Mining Social Network Graph
Module - 6
Syllabus
Social Networks as Graphs, Clustering of Social-Network Graphs, Direct Discovery of Communities in a social graph
10.1 Introduction
See
= Social network idea cameinto theory andresearch in 1980s by Ferdinand Tonnis and Emile Durkhiem.Social network
is bind with domain like social links, social group.
— Major work started in 1930sin various areaslike mathematics, anthropology, psychology etc. I.L. Moreno provides
foundation for social network as provided a Moreno’s sociogram which representsocial links related with a person.
— Moreno’s sociogram example : Namethegirl with whom you would like to go to industrial visit tour.
a Sociogram gives interpersonal relationship among members participated in group. Sociogram ‘present choice in
number.
~Numberof mutual choices
=. o o
C = "Number ofpossible mutual choices in the group
P
oe
ak
Where, K, - Degree ofnodei
Scanned by CamScanner
W Big Data Analytics(MU) 10-2 Mining Social Network Graph
The new nodesgives preference to get attach with heavily linked nodes.
Fig. 10.1.2 : Barabasi algorithm model shows steps of growth of network (M, = M = 2)
BA model is used to generate random scale free network. Scale-free network used in most of popular domain like the
internet, World Wide Web,citation network and few social networks.
Social network deals with large-scale data. After analyzing large data a hugesetof information can be achieved.
Linkedin, Facebookare vast widely used and very popular examplesfor social network. As wecanfind friends over the
network with 1°, 2™, 3™ connection or mutualfriends (i.e. friendsoffriend) in Linkedin and Facebook respectively.
Google+ is one of social network which gives link nodes in groups categorieslike Friends, Family, Acquaintances
following Featured on Google+etc.
Social network is huge platform to analyze data and obtain information.Furtherwill see efficient algorithm to discover
different graphs properties.
In generala graphis collection of set of edges (e) and set ofvertices (V). If there is an edge exists between any:two
nodes of graph then that noderelates with each other.
Graphsare categories by many parameters like orderedpairs of nodes, unordered pairs of nodes.
y matrix.
Some edgehasdirection, weight. Relationship amonggraphis explained with help of an adjacenc
Small network can be easily managedto construct a graph,it is quite impossible with huge/wide network.
Summary statistics and performance metrics are useful for design of graph fora large network.
Network and graphs can beelaborate with the help of few parameters like diameteri.e. largest distance between any
twonodes, centrality degree distribution.
Social website like Facebookuses undirectedsocial graphfor friends while directed graphusedin social website like
Twitter, Google+ (plus). Twitter gives connection like 1", 2™ , 3 and Google classify linked connectionin friends,
family, Acquaintances, Following etc.
Scanned by CamScanner
10-3 Mining Social Network Graph
Amit Amar
© ©
© ) ©)
Mahesh Rahul Sachin
Fig. 10.2.1
Degree of each nodesareasfollows:
Amit
Amar
Mahesh
Rahul
Sachin
Amit
Amar 1 ot 2 1 2
| Mahesh 1 2 = 1 2
Rahul 2 1 1 = 1
Sachin 2 2 2 1 =
Scanned by CamScanner
W Big Data Analytios(MU) 10-4 Mining Social Network Graph
Degreeofcentrality
, _ _d(a)
> = @-H
Closeness centrality
Co @) = Ch a) @-1)
Between’scentrality
Crea= Cqcay/[EME—2]
—1)(g-2)
C@) = jee BO Bx
8 = The numberof geodesics connecting jk
8x (0,) = The numberthatactori is on.
A network where each node has somevalueand asit gets connectedwith anothernodeits values get changed.
A tennis player has some records on his namein single. There are some other records on his nameassociated with
anotherplayer namein doubles.
Anode mayhave different values depending onits connection with neighbouring node.
Whena noderepresentsan Email accountit is a single node. Every nodeof an e-mail is in link with at least one e-mail
account(i.e. sender mail ID and receiver mailID). .
Sometimes email are send from one side and sometime e-mail are send from both side in such scenario edges are
supposed weakandstrongrespectively.
These nodes consist with values like phone numbers which givesit a distinct value.
As a call is placed between twouser nodes get someadditionalvalueslike timeofcall period of communicationetc.
In telephone network edge gets weight by numberofcalls modebyit to other. Networkassign edges with the way
they contact each otherlike frequently, rarely, never get connected.
ere
Scanned by CamScanner
f ee
erview ofclustering
Fig. 1 0.3.2 : Ov
rith
Various clustering algo
(A) Hierarchical
(B) k-means
(C) K-medoid
(D) Fuzzy C-m'
Scanned by CamScanner
wy Big Data Analytics( MU) 10-6 . Mining 9 Social Network Gi raph |iz
(A) Hlerarchical clustering
100 4
90
80 ! —k=8
70
<The
gcae
. +—k=6
sry 50
40 +—k=5
2 —k=4
20 Le <—k=3
5°
91
[|
92 93 94 95 G6
3
97
+—k=1
98
Fig. 10.3.3 : Hierarchical Clustering example
(B) K-meansclustering
ng algorithm.
— [tis one of the unsupervised clusteri
it is an input to algorithm.
— Number of cluster represented by ‘g
ementation.
k on numerical data andit is easy to for impl
~— It is basically an iterative in nature, it wor
o estimate K (‘K’ is a user
sian Info rmat ion Crite rion (BIC) of Min imum Description Length (MDL) can be usedt
— Baye ‘
input)
version of K-meansalgorithm.
ure with K-medoids, K-medoidsis general
— Itis easy to work with any distance meas
Scanned by CamScanner
/
Big Data Analytics(MU)
jer ork Gi raph
Social Netw
jeans Clustering (Fem)
JOT x sens
0) Fuzzy
It is unsupervised andit always con
verge: Se
Itallows one piece of data
which is part afof two or m lore clus
t ters,
itis used frequently in pattern
Tecognition
40.3.3 Betweenness
- To find communities am
ong social netwo:
with standard clustering me
thods.
~- nness
Betweethat is sh ,
auch th ortest Path available between two nodes. For example an edge(x, y) is betweennessof node a and
such e edge (x, y) lies on shortest path betweena and b.
- aand bare two di a
andb.
ifferent communities where edge(x, y) lies somewhereas shortest path between a
10.3.4 The Girvan - NewmanAlgorithm
o Community detection.
Scanned by CamScanner
Ww Big Data Analytics(MU) 10-8 Mining Social Network Graph
Step 1
Step2
Step 3
@
— Standardprocessto successively deleting edges of high betweenness.
Scanned by CamScanner
Big Data Analytics(MU) g Social Network Graph
10-9
Find edge with high
4:
It betweenness of multiple edges of highest
. ghest
betweenness if thereis a tie- and remove those
- edges from graph
graph . mayaffect to Braph to get separate into multiple components. If so,this is first level of
regions in the portioning of
graph
292: Now,recalculal
and again remove the edge oredges of highest betweenness. It will| break few
a te all betweenness
i f
existing compone nt into smaller,if so, these are regions nested within larger region. Keep repeatation of tasks
ponent
by recalculating 6 all al betweenness and removing the edge or edges having highest betweenness.
- Itisan approach to find most shortest Path within a graph which connect two vertex.
path.
Between C andB thereare twopath,so edge(A,B), (B, D), (A, C) and(C, D) get credited by half a shortest
components
Clearly, edge (D, G) and (B, G) has highest betweenness,soit will get removedfirst, it will generate
namely {A, B, C, D} and {E,F, G, H}.
score6 i.e. (E.G) and (E,F). Later, removal with
By keeping removal with highest betweenness next removal are with
score i.e. (A, B), (B, D) and (C, D).
©) ©
ss 5 and more are removed
Fig. 10.3.5 All edges with betweenne
an d C more clost e to “traitor”r” t to
each other than to B and D.In short B and aree “traito
at A
‘Communities’ implies th ity.
us e th ey ha ve fr ie nd G outside the commun
community {A, B, C, O} beca
ected.
H} and only F, G and H remain conn
Similarly G is “traitor” to grouP {EF G
Scanned by CamScanner
¥ Big Data Analytios(MU) 10-10 Mining Social Network Graph
Finding cliques
Cliques can bedefined as a set of nodes having edges between any twoofvertices.
Tofind clique is quitedifficult task. To find largest set of vertices where any twovertices needsto be connected within
a graph is known as maximum clique.
— It is graph having vertices which can be partitioned into twodisjoint sets suppose set V and set U. Both V and sets
are not necessary of having same size.
— Agraphissaidto bebipirateif and onlyif it does not possesa cycle of an odd length.
Example :
Suppose we have 5 engines and 5 mechanics where each mechanic has differentskills and can handle different engine
by vertices in U. An edge between twovertices to shows that the mechanics has necessary skill to operate the engine
operated by
whichit is linked. By determining maximum matching we can maximize the number of engines being
workforce.
kao kg
Fig. 10.4.1
10.5 Simrank
— Graphsconsists of various types of nodes,simrank is useful to calculate the similarity from same type nodes.
= Simrankis useful for random walkers on a social graph while starting with a particular node.
— Simrank needscalculation andit is done at every starting nodefor limitedsizes graph.
wer TechKnowledg®
Scanned by CamScanner
ig Data Analytics(M Mining Social Network Graph
10-11
Z tl
Network
10.5.1 Random Walker on Social
meetto
jal network gra} .
hfounds dire cted. Random walker ofsocial graph can
sod graphis mostly undirected and web grap
any numberof neighboring nodeofit.
network
Fig. 10.5.1 : A tripartite graph example for random walker social
Example :
Scanned by CamScanner
¥y Big Data Analytics(MU) 10-12 Mi ing Social Network Graph
Nodescan be keptin orderlike Image 1, Image 2, Image3, fog, grass. The transaction matrix for graphwill be like.
0 0 0 1/2 1/3
0 00 o 1
0 0 0 1/2 1/3
1/20 1/2 0 0
1/2 1 1/2 0 O
The fifth column of node “Grass” which is connected to each of image node.If thereforeit has some degreelike 3
then non-zeroentries to node “Grass” column must haveto be 1/3.
The image nodescorrespondtofirst three rows and first three columnsoftransaction matrix so entry 1/3 appears in
thefirst three rowsof column5.Since “fog” node does not have an edgetoeitheritself or “Grass” node.
— Let, B be probability of random walker, so 1-B is probability the walker will teleport to initial node N. ey is column
vector that has 1 in the row for node N otherwiseits0 (zero).
— Inera ofBig data, approx 2.5 quintillion byte data increasing per day. In 2004, Google introduced Map Reduce usedin
search engine.
bya pair of key
— Map Reduceusedfor processing and to generate large data sets. Map function gives data processed
set while Reduce function used to merge those data values.
Scanned by CamScanner
1GBig Data Analytos(MU) 7 Mining Social Network Graph
Map-Reduce processdoeswith help of threestages :
o Mapping
o Shuffle
o Reducing
- Counting oftriangle is helpful to know community around any nodewithin social network;it helps to know ‘clustering
co-efficient’.
m \El
dv wl
Cluster co-efficient (cc(v)) for a node where v € Vis,
(iii) Documentclustering
(iv) Statistical machine translation
(v) Machine learning
(vi). Weblink-graph reversal
(vil) Distributed sorting
Ww ‘Teck!
Publications.
Scanned by CamScanner
Big Data Analytics(MU) 10-14 Mining Social Network Graph
Q.4 How degree,closeness, between's centrally is measured ?
Q.5 Whatis social network ? Explain anyonetype in details. ‘
Q.6 Explain following clustering algorithm in short :
(a) Hierarchical
(b) K-means
© K-medoid
(d) Fuzzy C-means
Scanned by CamScanner