Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Locality sensitive hashing
Lecture notes of AI506 by Kijung Shin
Introduction
Motivation
Locality sensitive hashing
A Common Metaphor
• Many problems can be expressed as finding “similar” sets.
• Find near-neighbors in high-dimensional space.
• Examples :
• Pages with similar words.
• Customers who purchased similar products.
• Images with similar features.
Problem for Today’s presentation
• Given :
• High dimensional data points !!, !", … , !#.
• Some distance function $ !!, !"
• Goal :
• Find all pairs of data points (!$, !%) that are within some distance thresh
old $ !$, !% ≤ s.
• Note :
• Time complexity of naïve solution is )(*&).
• Documents are so large or so many that they can’t fit in main mem
ory.
The Big Picture
Documents as High-Dimensional Data
• Step1 : Shingling
• Convert documents to sets.
• It’s preprocessing stage
• Simple approaches :
• Document ⇒ { words in document }
• Document ⇒ words in document ∖ {meaningless words}
• Don’t work well for this application. Why?
• “Football is more exciting than Baseball” = “Baseball is more exciting than Football”
• Need to account for ordering of words!
• Shingling!
Step 1 : Shingling
• A k-shingle for a document is a sequence of k-tokens that appea
rs in the document
• Example : k=2 document D=abcab
• Shingling -> {ab, bc, ca, ab} -> {ab, bc, ca}
• Hash the shingles -> {1, 5, 7} = [1, 0, 0, 0, 1, 0, 1]
• If you worry about order of shingles yet, pich k large enough
• K=5 is OK for short documents
Step 2 : Min-hashing
• We just have completed pre-processing
• It remains computational cost problem
• We have N=1,000,000 documents
• =(= − 1)/2 = 5 ∗ 10!! comparisions
• Computation capacity : 10' cmp/sec ⇒ 5 ∗ 10( sec requires = 5 days
• For 10 million, it takes more than a year..
• We need to improve Computation Capacity using Min-hash algo
rithm.
Step 2 : Min-hashing
Min-hashing
2 1 2 1
10!×10" input data becomes 1×10" matrix
Signature matrix
Similarity of Columns == Similarity of Signatures
In Probability
So we can save time of comparing two documents
while preserving information of documents in high probability.
Step 2 : Min-hashing
• Similarity of two sets : Jaccard distance
• !"# $!, $" =
|$!∩$"|
|$!∪$"|
• ' $!, $" = 1 − !"#($!, $")
• Goal :
To find a hash function ℎ(⋅)such that
• If !"# $!, $" is high, then ℎ $! = ℎ $" in high prob.
• If !"# $!, $" is low, then ℎ $! ≠ ℎ $" in high prob.
Where the function ℎ(⋅) is small enough to fits in RAM
Step 2 : Min-hashing
• There is a suitable hash function for the Jaccard similarity :
• Min-Hashing
• ℎ' $ = #"/(:$ ( *!0 1
• F ∶ permutation 1, 2, … , n → {1, 2, … , J}.
• J ∶ the length of shingle dictionary.
• !"#($!, $") = 2' ℎ' $! = ℎ' $"
• If OPQ R!, R" is high, then ℎ R! = ℎ R" in high prob.
• If OPQ R!, R" is low, then ℎ R! = ℎ R" in low prob.
, -
Step 2 : Min-hashing
• Proof )
• T = 1, 3, 4, 5, 7 , F ∶ 1, 2,3, 4, 5, 6 , 7 → 1, 2, 3, 4, 5, 6 , 7
• Y. min F T = 1 =
!
(
• ℎ. R! = ℎ. R" ⇔ min
/∶ 1! / 2!
F(!) = min
/∶ 1" / 2!
F(!)
⇔ min
/∶ 1! / 2!
F(!) = min
/∶ 1" / 2!
F(!) ∈ F(R! ∪ R")
!! x1!" !3
x1
+ ,! ≠ + , ∀, ∈ 0"
+ ,# ≠ + , ∀, ∈ 0!
+ , = + 2 for , ∈ 0!, 2 ∈ 0" if and only if ,, 2 ∈ 0! ∩ 0"
Y ℎ. R! = ℎ. R" =
R! ∩ R"
R! ∪ R"
Step 2 : Min-Hashing
^ _ 4# 1! 24# 1"
= Y ℎ. R! = ℎ. R" =
R! ∩ R"
R! ∪ R"
= TPQ(R!, R")
Use Monte Carlo Method.
Step 3 : LSH(Locality Sensitive Hashing)
• Goal : Find all pairs of data points (14, 15) that are within some di
stance threshold ' 14, 15 ≤ s.
• General Idea of LSH : Find all candidate pairs whose similarity m
ust be evaluated.
Step 3 : LSH(Locality Sensitive Hashing)
Step 3 : LSH(Locality Sensitive Hashing)
Step 3 : LSH(Locality Sensitive Hashing)
• Case 1 : 5"# $!, $" = 0.8 ! = 0.8 9 = 20 ; = 5
• We want R!, R" to be a candidate pair
• So we want to hash them to at least 1 common bucket
• Y R!, R" PJ `aQQaJ bc`def = 1 − Y R!, R" Jaf PJ `aQQaJ bc`def
= 1 − ∏$ Y `$,!, `$," Jaf PJ `aQQaJ bc`def
= 1 − ∏$(1 − Y `$,!, `$," PJ `aQQaJ bc`def )
= 1 − ∏$(1 − 0.8()
= 1 − 1 − 0.328 "6
= 99.965%
Step 3 : LSH(Locality Sensitive Hashing)
• Case 2 : 5"# $!, $" = 0.3 ! = 0.8 9 = 20 ; = 5
• We want to hash R!, R" to NO common buckets
• Y R!, R" PJ `aQQaJ bc`def = 1 − Y R!, R" Jaf PJ `aQQaJ bc`def
= 1 − ∏$ Y `$,!, `$," Jaf PJ `aQQaJ bc`def
= 1 − ∏$(1 − Y `$,!, `$," PJ `aQQaJ bc`def )
= 1 − ∏$(1 − 0.3()
= 1 − 1 − 0.00243 "6
= 4.74%
Locality sensitive hashing
Locality sensitive hashing

More Related Content

Similar to Locality sensitive hashing

3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
Viet-Trung TRAN
 
Python.pptx
Python.pptxPython.pptx
Python.pptx
AshaS74
 
Hadoop london
Hadoop londonHadoop london
Data Exploration with Elasticsearch
Data Exploration with ElasticsearchData Exploration with Elasticsearch
Data Exploration with Elasticsearch
Aleksander Stensby
 
RedDot Ruby Conf 2014 - Dark side of ruby
RedDot Ruby Conf 2014 - Dark side of ruby RedDot Ruby Conf 2014 - Dark side of ruby
RedDot Ruby Conf 2014 - Dark side of ruby
Gautam Rege
 
ScotRuby - Dark side of ruby
ScotRuby - Dark side of rubyScotRuby - Dark side of ruby
ScotRuby - Dark side of ruby
Gautam Rege
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
Sean Cribbs
 
Convex Hull Approximation of Nearly Optimal Lasso Solutions
Convex Hull Approximation of Nearly Optimal Lasso SolutionsConvex Hull Approximation of Nearly Optimal Lasso Solutions
Convex Hull Approximation of Nearly Optimal Lasso Solutions
Satoshi Hara
 
Json the-x-in-ajax1588
Json the-x-in-ajax1588Json the-x-in-ajax1588
Json the-x-in-ajax1588
Ramamohan Chokkam
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
Esoteric, Obfuscated, Artistic Programming in Ruby
Esoteric, Obfuscated, Artistic Programming in RubyEsoteric, Obfuscated, Artistic Programming in Ruby
Esoteric, Obfuscated, Artistic Programming in Ruby
mametter
 
Elasticsearch at EyeEm
Elasticsearch at EyeEmElasticsearch at EyeEm
Elasticsearch at EyeEm
Lars Fronius
 
Invertible-syntax 入門
Invertible-syntax 入門Invertible-syntax 入門
Invertible-syntax 入門
Hiromi Ishii
 
Perl 6 in Context
Perl 6 in ContextPerl 6 in Context
Perl 6 in Context
lichtkind
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
Vitomir Kovanovic
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
huguk
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
Sameera Horawalavithana
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
Prof. Wim Van Criekinge
 
Haskell in the Real World
Haskell in the Real WorldHaskell in the Real World
Haskell in the Real World
osfameron
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
vpletap
 

Similar to Locality sensitive hashing (20)

3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
 
Python.pptx
Python.pptxPython.pptx
Python.pptx
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Data Exploration with Elasticsearch
Data Exploration with ElasticsearchData Exploration with Elasticsearch
Data Exploration with Elasticsearch
 
RedDot Ruby Conf 2014 - Dark side of ruby
RedDot Ruby Conf 2014 - Dark side of ruby RedDot Ruby Conf 2014 - Dark side of ruby
RedDot Ruby Conf 2014 - Dark side of ruby
 
ScotRuby - Dark side of ruby
ScotRuby - Dark side of rubyScotRuby - Dark side of ruby
ScotRuby - Dark side of ruby
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
 
Convex Hull Approximation of Nearly Optimal Lasso Solutions
Convex Hull Approximation of Nearly Optimal Lasso SolutionsConvex Hull Approximation of Nearly Optimal Lasso Solutions
Convex Hull Approximation of Nearly Optimal Lasso Solutions
 
Json the-x-in-ajax1588
Json the-x-in-ajax1588Json the-x-in-ajax1588
Json the-x-in-ajax1588
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
Esoteric, Obfuscated, Artistic Programming in Ruby
Esoteric, Obfuscated, Artistic Programming in RubyEsoteric, Obfuscated, Artistic Programming in Ruby
Esoteric, Obfuscated, Artistic Programming in Ruby
 
Elasticsearch at EyeEm
Elasticsearch at EyeEmElasticsearch at EyeEm
Elasticsearch at EyeEm
 
Invertible-syntax 入門
Invertible-syntax 入門Invertible-syntax 入門
Invertible-syntax 入門
 
Perl 6 in Context
Perl 6 in ContextPerl 6 in Context
Perl 6 in Context
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 
Haskell in the Real World
Haskell in the Real WorldHaskell in the Real World
Haskell in the Real World
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 

More from SEMINARGROOT

Metric based meta_learning
Metric based meta_learningMetric based meta_learning
Metric based meta_learning
SEMINARGROOT
 
Sampling method : MCMC
Sampling method : MCMCSampling method : MCMC
Sampling method : MCMC
SEMINARGROOT
 
Demystifying Neural Style Transfer
Demystifying Neural Style TransferDemystifying Neural Style Transfer
Demystifying Neural Style Transfer
SEMINARGROOT
 
Towards Deep Learning Models Resistant to Adversarial Attacks.
Towards Deep Learning Models Resistant to Adversarial Attacks.Towards Deep Learning Models Resistant to Adversarial Attacks.
Towards Deep Learning Models Resistant to Adversarial Attacks.
SEMINARGROOT
 
The ways of node embedding
The ways of node embeddingThe ways of node embedding
The ways of node embedding
SEMINARGROOT
 
Graph Convolutional Network
Graph  Convolutional NetworkGraph  Convolutional Network
Graph Convolutional Network
SEMINARGROOT
 
Denoising With Frequency Domain
Denoising With Frequency DomainDenoising With Frequency Domain
Denoising With Frequency Domain
SEMINARGROOT
 
Bayesian Statistics
Bayesian StatisticsBayesian Statistics
Bayesian Statistics
SEMINARGROOT
 
Coding Test Review 3
Coding Test Review 3Coding Test Review 3
Coding Test Review 3
SEMINARGROOT
 
Time Series Analysis - ARMA
Time Series Analysis - ARMATime Series Analysis - ARMA
Time Series Analysis - ARMA
SEMINARGROOT
 
Differential Geometry for Machine Learning
Differential Geometry for Machine LearningDifferential Geometry for Machine Learning
Differential Geometry for Machine Learning
SEMINARGROOT
 
Generative models : VAE and GAN
Generative models : VAE and GANGenerative models : VAE and GAN
Generative models : VAE and GAN
SEMINARGROOT
 
Effective Python
Effective PythonEffective Python
Effective Python
SEMINARGROOT
 
Understanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence FunctionsUnderstanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence Functions
SEMINARGROOT
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
SEMINARGROOT
 
Attention
AttentionAttention
Attention
SEMINARGROOT
 
WWW 2020 XAI Tutorial Review
WWW 2020 XAI Tutorial ReviewWWW 2020 XAI Tutorial Review
WWW 2020 XAI Tutorial Review
SEMINARGROOT
 
Coding test review 2
Coding test review 2Coding test review 2
Coding test review 2
SEMINARGROOT
 
Coding Test Review1
Coding Test Review1Coding Test Review1
Coding Test Review1
SEMINARGROOT
 
Strong convexity on gradient descent and newton's method
Strong convexity on gradient descent and newton's methodStrong convexity on gradient descent and newton's method
Strong convexity on gradient descent and newton's method
SEMINARGROOT
 

More from SEMINARGROOT (20)

Metric based meta_learning
Metric based meta_learningMetric based meta_learning
Metric based meta_learning
 
Sampling method : MCMC
Sampling method : MCMCSampling method : MCMC
Sampling method : MCMC
 
Demystifying Neural Style Transfer
Demystifying Neural Style TransferDemystifying Neural Style Transfer
Demystifying Neural Style Transfer
 
Towards Deep Learning Models Resistant to Adversarial Attacks.
Towards Deep Learning Models Resistant to Adversarial Attacks.Towards Deep Learning Models Resistant to Adversarial Attacks.
Towards Deep Learning Models Resistant to Adversarial Attacks.
 
The ways of node embedding
The ways of node embeddingThe ways of node embedding
The ways of node embedding
 
Graph Convolutional Network
Graph  Convolutional NetworkGraph  Convolutional Network
Graph Convolutional Network
 
Denoising With Frequency Domain
Denoising With Frequency DomainDenoising With Frequency Domain
Denoising With Frequency Domain
 
Bayesian Statistics
Bayesian StatisticsBayesian Statistics
Bayesian Statistics
 
Coding Test Review 3
Coding Test Review 3Coding Test Review 3
Coding Test Review 3
 
Time Series Analysis - ARMA
Time Series Analysis - ARMATime Series Analysis - ARMA
Time Series Analysis - ARMA
 
Differential Geometry for Machine Learning
Differential Geometry for Machine LearningDifferential Geometry for Machine Learning
Differential Geometry for Machine Learning
 
Generative models : VAE and GAN
Generative models : VAE and GANGenerative models : VAE and GAN
Generative models : VAE and GAN
 
Effective Python
Effective PythonEffective Python
Effective Python
 
Understanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence FunctionsUnderstanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence Functions
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Attention
AttentionAttention
Attention
 
WWW 2020 XAI Tutorial Review
WWW 2020 XAI Tutorial ReviewWWW 2020 XAI Tutorial Review
WWW 2020 XAI Tutorial Review
 
Coding test review 2
Coding test review 2Coding test review 2
Coding test review 2
 
Coding Test Review1
Coding Test Review1Coding Test Review1
Coding Test Review1
 
Strong convexity on gradient descent and newton's method
Strong convexity on gradient descent and newton's methodStrong convexity on gradient descent and newton's method
Strong convexity on gradient descent and newton's method
 

Recently uploaded

Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available
Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata AvailableKolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available
Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available
roshansa9823
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
punebabes1
 
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
seenu pandey
 
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
RajdeepPaul47
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Applications of Data Science in Various Industries
Applications of Data Science in Various IndustriesApplications of Data Science in Various Industries
Applications of Data Science in Various Industries
IABAC
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Amazon Web Services Korea
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you
@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you
@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you
Delhi Call Girls
 
buku report tentang analisis TIMSS 2023.pdf
buku report tentang analisis TIMSS 2023.pdfbuku report tentang analisis TIMSS 2023.pdf
buku report tentang analisis TIMSS 2023.pdf
ABDULKALAM847167
 
@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...
@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...
@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...
ritu36392
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
2024 June - Orange County (CA) Tableau User Group Meeting
2024 June - Orange County (CA) Tableau User Group Meeting2024 June - Orange County (CA) Tableau User Group Meeting
2024 June - Orange County (CA) Tableau User Group Meeting
Alison Pitt
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
taqyea
 
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
67n7f53
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
Amazon Web Services Korea
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
Jyotishko Biswas
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
dipti singh$A17
 

Recently uploaded (20)

Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available
Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata AvailableKolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available
Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
 
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
 
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
Applications of Data Science in Various Industries
Applications of Data Science in Various IndustriesApplications of Data Science in Various Industries
Applications of Data Science in Various Industries
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you
@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you
@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you
 
buku report tentang analisis TIMSS 2023.pdf
buku report tentang analisis TIMSS 2023.pdfbuku report tentang analisis TIMSS 2023.pdf
buku report tentang analisis TIMSS 2023.pdf
 
@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...
@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...
@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
2024 June - Orange County (CA) Tableau User Group Meeting
2024 June - Orange County (CA) Tableau User Group Meeting2024 June - Orange County (CA) Tableau User Group Meeting
2024 June - Orange County (CA) Tableau User Group Meeting
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
 
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 

Locality sensitive hashing

  • 2. Lecture notes of AI506 by Kijung Shin Introduction
  • 5. A Common Metaphor • Many problems can be expressed as finding “similar” sets. • Find near-neighbors in high-dimensional space. • Examples : • Pages with similar words. • Customers who purchased similar products. • Images with similar features.
  • 6. Problem for Today’s presentation • Given : • High dimensional data points !!, !", … , !#. • Some distance function $ !!, !" • Goal : • Find all pairs of data points (!$, !%) that are within some distance thresh old $ !$, !% ≤ s. • Note : • Time complexity of naïve solution is )(*&). • Documents are so large or so many that they can’t fit in main mem ory.
  • 8. Documents as High-Dimensional Data • Step1 : Shingling • Convert documents to sets. • It’s preprocessing stage • Simple approaches : • Document ⇒ { words in document } • Document ⇒ words in document ∖ {meaningless words} • Don’t work well for this application. Why? • “Football is more exciting than Baseball” = “Baseball is more exciting than Football” • Need to account for ordering of words! • Shingling!
  • 9. Step 1 : Shingling • A k-shingle for a document is a sequence of k-tokens that appea rs in the document • Example : k=2 document D=abcab • Shingling -> {ab, bc, ca, ab} -> {ab, bc, ca} • Hash the shingles -> {1, 5, 7} = [1, 0, 0, 0, 1, 0, 1] • If you worry about order of shingles yet, pich k large enough • K=5 is OK for short documents
  • 10. Step 2 : Min-hashing • We just have completed pre-processing • It remains computational cost problem • We have N=1,000,000 documents • =(= − 1)/2 = 5 ∗ 10!! comparisions • Computation capacity : 10' cmp/sec ⇒ 5 ∗ 10( sec requires = 5 days • For 10 million, it takes more than a year.. • We need to improve Computation Capacity using Min-hash algo rithm.
  • 11. Step 2 : Min-hashing Min-hashing 2 1 2 1 10!×10" input data becomes 1×10" matrix Signature matrix Similarity of Columns == Similarity of Signatures In Probability So we can save time of comparing two documents while preserving information of documents in high probability.
  • 12. Step 2 : Min-hashing • Similarity of two sets : Jaccard distance • !"# $!, $" = |$!∩$"| |$!∪$"| • ' $!, $" = 1 − !"#($!, $") • Goal : To find a hash function ℎ(⋅)such that • If !"# $!, $" is high, then ℎ $! = ℎ $" in high prob. • If !"# $!, $" is low, then ℎ $! ≠ ℎ $" in high prob. Where the function ℎ(⋅) is small enough to fits in RAM
  • 13. Step 2 : Min-hashing • There is a suitable hash function for the Jaccard similarity : • Min-Hashing • ℎ' $ = #"/(:$ ( *!0 1 • F ∶ permutation 1, 2, … , n → {1, 2, … , J}. • J ∶ the length of shingle dictionary. • !"#($!, $") = 2' ℎ' $! = ℎ' $" • If OPQ R!, R" is high, then ℎ R! = ℎ R" in high prob. • If OPQ R!, R" is low, then ℎ R! = ℎ R" in low prob. , -
  • 14. Step 2 : Min-hashing • Proof ) • T = 1, 3, 4, 5, 7 , F ∶ 1, 2,3, 4, 5, 6 , 7 → 1, 2, 3, 4, 5, 6 , 7 • Y. min F T = 1 = ! ( • ℎ. R! = ℎ. R" ⇔ min /∶ 1! / 2! F(!) = min /∶ 1" / 2! F(!) ⇔ min /∶ 1! / 2! F(!) = min /∶ 1" / 2! F(!) ∈ F(R! ∪ R") !! x1!" !3 x1 + ,! ≠ + , ∀, ∈ 0" + ,# ≠ + , ∀, ∈ 0! + , = + 2 for , ∈ 0!, 2 ∈ 0" if and only if ,, 2 ∈ 0! ∩ 0" Y ℎ. R! = ℎ. R" = R! ∩ R" R! ∪ R"
  • 15. Step 2 : Min-Hashing ^ _ 4# 1! 24# 1" = Y ℎ. R! = ℎ. R" = R! ∩ R" R! ∪ R" = TPQ(R!, R") Use Monte Carlo Method.
  • 16. Step 3 : LSH(Locality Sensitive Hashing) • Goal : Find all pairs of data points (14, 15) that are within some di stance threshold ' 14, 15 ≤ s. • General Idea of LSH : Find all candidate pairs whose similarity m ust be evaluated.
  • 17. Step 3 : LSH(Locality Sensitive Hashing)
  • 18. Step 3 : LSH(Locality Sensitive Hashing)
  • 19. Step 3 : LSH(Locality Sensitive Hashing) • Case 1 : 5"# $!, $" = 0.8 ! = 0.8 9 = 20 ; = 5 • We want R!, R" to be a candidate pair • So we want to hash them to at least 1 common bucket • Y R!, R" PJ `aQQaJ bc`def = 1 − Y R!, R" Jaf PJ `aQQaJ bc`def = 1 − ∏$ Y `$,!, `$," Jaf PJ `aQQaJ bc`def = 1 − ∏$(1 − Y `$,!, `$," PJ `aQQaJ bc`def ) = 1 − ∏$(1 − 0.8() = 1 − 1 − 0.328 "6 = 99.965%
  • 20. Step 3 : LSH(Locality Sensitive Hashing) • Case 2 : 5"# $!, $" = 0.3 ! = 0.8 9 = 20 ; = 5 • We want to hash R!, R" to NO common buckets • Y R!, R" PJ `aQQaJ bc`def = 1 − Y R!, R" Jaf PJ `aQQaJ bc`def = 1 − ∏$ Y `$,!, `$," Jaf PJ `aQQaJ bc`def = 1 − ∏$(1 − Y `$,!, `$," PJ `aQQaJ bc`def ) = 1 − ∏$(1 − 0.3() = 1 − 1 − 0.00243 "6 = 4.74%