Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

BDA Module 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Cosine distance Giuen by

2 2
x

2
ty t y

For
X=(y23-1)
y=(2
Y.y CIx2)(2xD +Ex)
22-)

lx(=t4t
tyt=
PAGE N

DATE

Eclidian discence
=

d(xy=0
A Sqtiihy Symmetic poperty beca e
CX-y)2 =(yi-x)2
For exU ( 2 ) and (64)
(2-6 (47-4) 2

4)2t(3)2

JlG+ 9
J25
5
Jaccar Disane

dCx,y)|- STm (xy) ).


of
e t iand
intesection
s
the aio cfsi zes
o f x and y

d Cxzy) is non negatiue beca-se


interSeotiOn reUesexstt size
union
whenK=y becase xUXx
and x nX x

dCX)4) =d (y)X) i.e itstsfy


symnetic Prper ty becaese
and

y=i 5,8)7 lo

=4

2
3
The distan Ce becomesense when po) ntS ee
sting the distanle oet 2 stngs

yyy2 ym be tue Smallls


numb ol inser a d dele tjon e
eharaces conuett x nt y

x abc de
y=actdegds

2 De ete b
I n s e i t tsalte

Edit distance =3

Hamming d:stance

Let x asd be 2 tngs Coiltng of o's


and 's Hamming di4ace beteueen x gndy be
theno. these

2
=3 CoS
3
JG
J4tL+
y=)+()w for
+L JIt4 J
+(2)+EY? forx L2nom
)+(2X1+E)x]
ofbothecors
m NoL2
fvectors poduct Dot Cos=
m ndatlon Recornme tin PliCa LL2- Y=
uers. ice andsimilaity
te is mof anglR the LesS Distan
e aCosine
eeto bet) angte iS Tt
Sums:. BDA
DATE
PAGE
NG
PAGE No.
DATE

) JacCard DiGAance To frnd distanCe bet


2 Sets.

A={L2,3
6={i24,5

AUB
2
5

5
. only
o Contain cohich is
os boolean
ue for wsed DstRnce
- Hamming 4
OATE

á Cuchdean distance L(27)


2

2
N(a-2)+c4-72
2.
4+2)

Nl6+9
J25
5

DGTM (DotarGionis - Ihdut-motwani)


.Elernents: (0Tinestayp s Bckets
Used t fnd number of ]'s in strean ef dat
'Rules or foming a bucketi
in it
DEueny bucket Should cortain at least a single
startfom .
DRiqhtt side ef bucket should stictty
of bucket is equal o number of l's in it
(3) Length poLer of
D6veny bucket length ghould be in not
moe to leftthe bucet size shou
As p
deea se
(6 No more hantwo buCEetsCan naue same size

Alsownte exane br
or all o boul re
all gb

Teacher's Signature:...
(PAGE No.
DATE

ConSider ollouing treami

N s 20

m ouceets:.

2
4 4
2 2

Hub and Authesty

OHb Sco e
O Auehority Scoe

. Hub These at pages that ink to athosties


Eq- Gst of neos papers.

thos- Thege arepages ohich contai ihet ne.


Hoe Page.

fage Ronk
is the functien that assigrsa ealnumberto each
page ín t e e

Teacher's Signaturei..
DATE

Higher he paqe rank ef the-page, the moe important


it is.
Dsed by Geqle Seach toank wcbsi4e9.
Page sant is naned after lary Page.
Dead Gnd:

tebpage with no outlinkg ave called as Dead


ends.

3) Edit DistAnce' Used to calculae distance betoee


twoPoints whe points are
e pestd as Sbing.
X= ab e
acfd e

hsert
lter e
-. Cdit ditance 3

Teacher's Signature:.
Map Tasks
Achunk is a collection of elements and no element is stored across twochunks.
Technically, all inputs to Map tasks and outputs from Reduce tasks are of the key-valuc-pair
form.

The Map function takes an input clement as its argument and produces zero or morckey-valuc
pairs.

Grouping by Key
As soon as the Map tasks have all completed successfully,the key-value pairs are grouped by
key and the values associated with each key are formed intoa list of values.

2| Page

ENG 15 AM
O Type here to search BSE smlcap -5.27% ^G 4)
G O File D:/BDA%20Practs/BDA620Notes620 By%20Santosh620 Tamboli%20Sir.pdf
3 of 33

The user typically tells the MapReduce system what r should be. Then the master controller
picksa hash function that applies to keys and produces a bucket number from 0 to r- 1.
M
Reduce Tasks

The Reduce function's argument is a pair consistingof a key and list of associated values.
The output of the Reduce function is a sequence of zero or more key-value pairs.
These key-value pairs can be of a type different from those sent from Map tasks to Reduce tasks,
but often they are the same type.

A Reduce task receives one or more keys and their associated value lists.

Reduce task executes one or more reducers. The outputs from all the Reduce tasks are merged
into a single file.
Combiners

It may producing many pairs (w, I). (w. 1),- we could


apply the Reduce function within the Map task, before the output of the Map tasks is subject to
grouping and aggregation.

These key-value pairs would thus be replaced by one pair with key w and value ccgal to the sum
of all the l's in those pairs.
That is, the pairs with key w generated by a single Map task
would be replaced by a pair (w,m), where m is the number of times thatw appears.
Type here tosearch ENG 923 AM
29C Smoke G 40
Q. What is PageRank?
PageRank is a function that assigns, areal number to each page in the Web. M

The intent is that the higher the PageRank of a page, the more
"important" it is.

9|Page

Big Data Analytics Notes By Prof. Santosh Tamboli Sir...

There is not one fixed algorithm for assignment of PageRank and variations on the basic idea can
alter the relative PageRank of any two pages.
Web can be represented as a directed graph, where pages are the nodes, and there is an arc from
ENG A1 AM
O Type here to search 29°C Smoke ^G ) IN
10 of 33
C.g.

Page A has Iinks to cach of the other


Aand D only: page C has a link three pages: page B has links to
only to A, and page D has links to B andC
only.
Suppose arandom surfer starts at page A in above
diagram. There are
this surfer will next be at each of those pages with probability 1/3, and Iinks
has
to B. C. andD. so
zero probability of
being at A. A random surfer at B has, at the next
at D. and 0of being at B or C. step. probability 12 of being at A, 12 of being
DGIM Alqorithm t
Datar- Guionis - Tndynk-Motwani Algoxithm
This alorithm is used to tind number
of i's in a dataset
This alqaithm uses ollog²N) bits ta
tepresent a oindaohavingq of Nbits
T allauas the estimate the number af
's in he aindaa oith an errar ho
baore than So Yo
This alorithm consist ot taco components
)Timestamp
2) Bucket.
Each bit that arrives has a imestam
Far Ex - TÆ bit arives 100 than timestamp
becames Qs l00
indo size isqeherally taken as mutiple
ot 2 ahich is Aiuidednto bucket
Salindae cantains a data shing cansisting
ot o's and 's

Rules for forhainq buckets


) The Right side af the bucket should
always start with 1 Tf its stat oith
othen it is neqlected.
For Ex lolo
Bucket size is 4 because there
are four 1's and tiqht side is 1

Scanned with CamScanner

a) Evey bucket shauld have atleast


1's else no bucket can be tormed.

a) A) buckets shauld be in pooeis of2


2,48J6SOor
i.e

The bucke can not decrease in size


a) Eveybucket shauld have atleast
1's else na bucket can be orned
a) AJL buckets should be in paaers 2
SO on

The bucke Can nat decrease in slze as


we mave toogrds left side ie move
in incLeasing order towards left
For Ex -
2 2
lallbuoed)
2 X (Not allaw

Example af DaM algorithm

N= 244winda Size)

(4). (2)

HoLO to add neo bit arriving fromn the xight


Tf nea bit =o No change in bucket, as
shoen beloo

Scanned with CamScanner

TosestmpsR7 92 95

When neo bit =o nteH enters


92
Eoestamps&
98 100 Neabit

When neo bit =o etent enters.


92

When hew bil =| entes

92 9

1o1o 000 1011|oooo||o o

When neo bit | enters.

92

42 8S 98 lo2

|o|011ooooloL|oo1o|

Scanned with CamScanner

8592. 98

o101oool ot u|o|oooi|

You can 566in73his algoithm


curtent times tno bucket
LHmestamp <N
92 8S 98 l02

To oolol|ooo1o1l| oo

Scanned with CamScanner

8592 98 lo2

You can continue this algarithm f


curtent
timestamtm Jeftmo bucket
Far Ey - o3 -87 =16
16N=24
So continue thisalgorithm atherwise
STOP

Enal answer n query


Hou many 's are there in heJast o bis
o1o1 IOoo'oOo|o1lool
3olce

No. of is io he last 20 bits =11

Scanned with CamScanner

You might also like