SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
Duplicate detection
Kira Radinsky
Based on the Standford slides by Christopher Manning
and Prabhakar Raghavan
Duplication
• ~30% of the content on the Web is near-duplicate pages
– Pages with content that is nearly identical to that of other
pages
• Issues:
– Index duplicate content only once
– Return only one version in the search results
– How can near-duplicate pages be identified in a scalable
and reliable manner?
Duplicate/Near-Duplicate Detection
• Duplication: Exact match can be detected
with fingerprints
• Near-Duplication: Approximate match
– Compute syntactic similarity with an edit-distance
measure
– Use similarity threshold to detect near-duplicates
• E.g., Similarity > 80% => Documents are near
duplicates
• Not transitive though sometimes used transitively
Computing Similarity - Shingles
Features for Similarity (Shingles (Word N-Grams))
• Segments of a document (natural or artificial breakpoints)
• K-shingling of a document transforms the document into a
set containing all windows of k contiguous terms
• E.g., 4-shingling of
“My name is Inigo Montoya. You killed my father. Prepare to die”:
{
my name is inigo
name is inigo montoya
is inigo montoya you
inigo montoya you killed
montoya you killed my,
you killed my father
killed my father prepare
my father prepare to
father prepare to die
}
Computing Similarity – distance metric
Similarity Measurement
• Denote by Sk(d) the k-shingling of document d
• Definition: the resemblance of d1 and d2,
R(d1,d2) = |Sk(d1)  Sk(d2)| / |Sk(d1)  Sk(d2)|
• The distance measure Δ(d1,d2) = 1-R(d1,d2) is a
metric
Shingles + Set Intersection
1. Computing exact set intersection of shingles
between all pairs of documents is
expensive/intractable
– Approximate using a cleverly chosen subset of
shingles from each (a sketch)
2. Estimate (size_of_intersection /
size_of_union) based on a short sketch
Doc
A Shingle set A Sketch A
Doc B
Shingle set B Sketch B
Jaccard
Sketch of a document
Create a sketch vector (of size ~200) for each document
– Documents that share ≥ t (usually 80%) corresponding
vector elements will be considered near duplicates
– Definitions:
• Let f map all k-shingles in the universe to 0..2m (e.g., f =
fingerprinting)
• Let p be a random permutation on 0..2m
– For doc D, sketchD is as follows:
• Sketch Option 1:
Fm(d)=minm(Π(Sk(d)) be the m smallest numbers after
applying Π to the k-shingling of d
• Sketch Option 2:
Vn(d) = { t ε Π(Sk(d)) | t = 0 mod n }
From Shingles to Sketches (cont.)
• A function X is an unbiased estimator of a value Y if E(X)=Y
• Theorem: when choosing Π u.a.r., the following functions are
unbiased estimators of R(d1,d2):
• | minm(Fm(d1)  Fm(d2))  Fm(d1)  Fm(d2) | /
| minm(Fm(d1)  Fm(d2)) |
• | Vn(d1)  Vn(d2) | / | Vn(d1)  Vn(d2) |
• Either Fm(d) or Vn(d) can be chosen as d’s sketch
– Fm(d) has the advantage that it is of fixed size
– Vn(d) is easier to compute
Multiple Sketches
• Let pi be a random permutation on 0..2m
• For doc D, sketchD[ i ] is as follows:
– Let f map all shingles in the universe to 0..2m (e.g.,
f = fingerprinting)
– Let pi be a random permutation on 0..2m
– Pick MIN {pi(f(s))} over all shingles s in D
Computing Sketch[i] for Doc1
Document 1
264
264
264
264
Start with 64-bit f(shingles)
Permute on the number line
with pi
Pick the min value
Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 1 Document 2
264
264
264
264
264
264
264
264
Are these equal?
Test for 200 random permutations: p1, p2,… p200
A B
However…
Document 1 Document 2
264
264
264
264
264
264
264
264
A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is
common to both (i.e., lies in the intersection)
Claim: This happens with probability
Size_of_intersection / Size_of_union
Why?
BA
Set Similarity of sets Ci , Cj
• View sets as columns of a matrix A
– one row for each element in the universe.
– aij = 1 indicates presence of item i in set j
• Example
ji
ji
ji
CC
CC
)C,Jaccard(C



C1 C2
0 1
1 0
1 1 Jaccard(C1,C2) = 2/5 = 0.4
0 0
1 1
0 1
Key Observation
• For columns Ci, Cj, four types of rows
Ci Cj
A 1 1
B 1 0
C 0 1
D 0 0
• Let A = # of rows of type A
• Claim
CBA
A
)C,Jaccard(C ji


Min Hashing
• Randomly permute rows
• Hash h(Ci) = index of first row with 1 in
column Ci
• Surprising Property
• Why?
– Both are A/(A+B+C)
– Look down columns Ci, Cj until first non-Type-D
row
– h(Ci) = h(Cj)  type A row
   jiji C,CJaccard)h(C)h(CP 
Min-Hash sketches
• Pick P random row permutations
• MinHash sketch
SketchD = list of P indexes of first rows with 1 in
column C
• Similarity of signatures
– Let sim[sketch(Ci),sketch(Cj)] = fraction of
permutations where MinHash values agree
– Observe E[sim(sig(Ci),sig(Cj))] = Jaccard(Ci,Cj)
Example
C1 C2 C3
R1 1 0 1
R2 0 1 1
R3 1 0 0
R4 1 0 1
R5 0 1 0
Signatures
S1 S2 S3
Perm 1 = (12345) 1 2 1
Perm 2 = (54321) 4 5 4
Perm 3 = (34512) 3 5 4
Similarities
1-2 1-3 2-3
Col-Col 0.00 0.50 0.25
Sig-Sig 0.00 0.67 0.00
Algorithm for Clustering Near-Duplicate
Documents
1. Compute the sketch of each document
2. From each sketch, produce a list of <shingle, docID> pairs
3. Group all pairs by shingle value
4. For any shingle that is shared by more than one document, output a
triplet <smaller-docID, larger-docID, 1> for each pair of docIDs sharing
that shingle
5. Sort and aggregate the list of triplets, producing final triplets of the
form <smaller-docID, larger-docID, # common shingles>
6. Join any pair of documents whose number of common shingles
exceeds a chosen threshold using a “Union-Find” algorithm
7. Each resulting connected component of the UF algorithm is a cluster
of near-duplicate documents
Implementation nicely fits the “map-reduce” programming paradigm
Implementation Trick
• Permuting universe even once is prohibitive
• Row Hashing
– Pick P hash functions hk: {1,…,n}{1,…,O(n)}
– Ordering under hk gives random permutation of
rows
• One-pass Implementation
– For each Ci and hk, keep slot for min-hash value
– Initialize all slot(Ci,hk) to infinity
– Scan rows in arbitrary order looking for 1’s
• Suppose row Rj has 1 in column Ci
• For each hk,
– if hk(j) < slot(Ci,hk), then slot(Ci,hk)  hk(j)
Example
C1 C2
R1 1 0
R2 0 1
R3 1 1
R4 1 0
R5 0 1
h(x) = x mod 5
g(x) = 2x+1 mod 5
h(1) = 1 1 -
g(1) = 3 3 -
h(2) = 2 1 2
g(2) = 0 3 0
h(3) = 3 1 2
g(3) = 2 2 0
h(4) = 4 1 2
g(4) = 4 2 0
h(5) = 0 1 0
g(5) = 1 2 0
C1 slots C2 slots

Weitere ähnliche Inhalte

Was ist angesagt?

An Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellAn Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellMichel Rijnders
 
Functional programming with haskell
Functional programming with haskellFunctional programming with haskell
Functional programming with haskellfaradjpour
 
Chapter 7.3
Chapter 7.3Chapter 7.3
Chapter 7.3sotlsoc
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text filesEdwin de Jonge
 
Scala collections wizardry - Scalapeño
Scala collections wizardry - ScalapeñoScala collections wizardry - Scalapeño
Scala collections wizardry - ScalapeñoSagie Davidovich
 
Introduction To Algebird: Abstract algebra for analytics
Introduction To Algebird: Abstract algebra for analyticsIntroduction To Algebird: Abstract algebra for analytics
Introduction To Algebird: Abstract algebra for analyticsKnoldus Inc.
 
mat lab introduction and basics to learn
mat lab introduction and basics to learnmat lab introduction and basics to learn
mat lab introduction and basics to learnpavan373
 
Data Streaming Algorithms
Data Streaming AlgorithmsData Streaming Algorithms
Data Streaming Algorithms宇 傅
 
Java Arrays
Java ArraysJava Arrays
Java ArraysOXUS 20
 
1.Array and linklst definition
1.Array and linklst definition1.Array and linklst definition
1.Array and linklst definitionbalavigneshwari
 
IR-ranking
IR-rankingIR-ranking
IR-rankingFELIX75
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014Edwin de Jonge
 

Was ist angesagt? (20)

Arrays
ArraysArrays
Arrays
 
Cis166 final review c#
Cis166 final review c#Cis166 final review c#
Cis166 final review c#
 
An Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellAn Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using Haskell
 
Functional programming with haskell
Functional programming with haskellFunctional programming with haskell
Functional programming with haskell
 
2-D array
2-D array2-D array
2-D array
 
130717666736980000
130717666736980000130717666736980000
130717666736980000
 
Chapter 7.3
Chapter 7.3Chapter 7.3
Chapter 7.3
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text files
 
GLSL
GLSLGLSL
GLSL
 
Scala collections wizardry - Scalapeño
Scala collections wizardry - ScalapeñoScala collections wizardry - Scalapeño
Scala collections wizardry - Scalapeño
 
Introduction To Algebird: Abstract algebra for analytics
Introduction To Algebird: Abstract algebra for analyticsIntroduction To Algebird: Abstract algebra for analytics
Introduction To Algebird: Abstract algebra for analytics
 
mat lab introduction and basics to learn
mat lab introduction and basics to learnmat lab introduction and basics to learn
mat lab introduction and basics to learn
 
Data Streaming Algorithms
Data Streaming AlgorithmsData Streaming Algorithms
Data Streaming Algorithms
 
Java Arrays
Java ArraysJava Arrays
Java Arrays
 
1.Array and linklst definition
1.Array and linklst definition1.Array and linklst definition
1.Array and linklst definition
 
IR-ranking
IR-rankingIR-ranking
IR-ranking
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
Arrays in VbScript
Arrays in VbScriptArrays in VbScript
Arrays in VbScript
 
Vector3
Vector3Vector3
Vector3
 

Andere mochten auch

Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detectionieeepondy
 
Duplicate detection
Duplicate detectionDuplicate detection
Duplicate detectionjonecx
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismseSAT Journals
 
An adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsAn adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsLikan Patra
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingVipin Kp
 
The Duplicitous Duplicate
The Duplicitous DuplicateThe Duplicitous Duplicate
The Duplicitous DuplicateAnish Raivadera
 
Production&creative
Production&creativeProduction&creative
Production&creativeRed Keds
 
Post-buckling analysis of a simply supported compound beams made of two symme...
Post-buckling analysis of a simply supported compound beams made of two symme...Post-buckling analysis of a simply supported compound beams made of two symme...
Post-buckling analysis of a simply supported compound beams made of two symme...IOSR Journals
 
Pura Tirtha Empul (Bali)
Pura Tirtha Empul (Bali)Pura Tirtha Empul (Bali)
Pura Tirtha Empul (Bali)F. Ovies
 
От кирпича к мультиканальности, от fuckupов - к лидерству.
От кирпича к мультиканальности, от fuckupов  - к лидерству.От кирпича к мультиканальности, от fuckupов  - к лидерству.
От кирпича к мультиканальности, от fuckupов - к лидерству.Promodo
 
Engage 2013 - Webtrends Streams
Engage 2013 - Webtrends StreamsEngage 2013 - Webtrends Streams
Engage 2013 - Webtrends StreamsWebtrends
 
Engage 2013 - SEM Optimization
Engage 2013 - SEM OptimizationEngage 2013 - SEM Optimization
Engage 2013 - SEM OptimizationWebtrends
 
2009 Heinz Marketing Holiday Cocktail Collection
2009 Heinz Marketing Holiday Cocktail Collection2009 Heinz Marketing Holiday Cocktail Collection
2009 Heinz Marketing Holiday Cocktail CollectionHeinz Marketing Inc
 
From Mao to More: Catching up with the next generation of talent in China
From Mao to More: Catching up with the next generation of talent in China From Mao to More: Catching up with the next generation of talent in China
From Mao to More: Catching up with the next generation of talent in China MSL
 
Lean UX for Design Teams (Crushing the Boulder)
Lean UX for Design Teams (Crushing the Boulder)Lean UX for Design Teams (Crushing the Boulder)
Lean UX for Design Teams (Crushing the Boulder)Janice Fraser
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsPradeeban Kathiravelu, Ph.D.
 
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼GangSeok Lee
 

Andere mochten auch (20)

Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detection
 
Duplicate detection
Duplicate detectionDuplicate detection
Duplicate detection
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
 
An adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsAn adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate records
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawling
 
The Duplicitous Duplicate
The Duplicitous DuplicateThe Duplicitous Duplicate
The Duplicitous Duplicate
 
Production&creative
Production&creativeProduction&creative
Production&creative
 
TIPOGRAMAS LEGOLAND
TIPOGRAMAS LEGOLANDTIPOGRAMAS LEGOLAND
TIPOGRAMAS LEGOLAND
 
Post-buckling analysis of a simply supported compound beams made of two symme...
Post-buckling analysis of a simply supported compound beams made of two symme...Post-buckling analysis of a simply supported compound beams made of two symme...
Post-buckling analysis of a simply supported compound beams made of two symme...
 
Web components
Web componentsWeb components
Web components
 
Pura Tirtha Empul (Bali)
Pura Tirtha Empul (Bali)Pura Tirtha Empul (Bali)
Pura Tirtha Empul (Bali)
 
От кирпича к мультиканальности, от fuckupов - к лидерству.
От кирпича к мультиканальности, от fuckupов  - к лидерству.От кирпича к мультиканальности, от fuckupов  - к лидерству.
От кирпича к мультиканальности, от fuckupов - к лидерству.
 
Engage 2013 - Webtrends Streams
Engage 2013 - Webtrends StreamsEngage 2013 - Webtrends Streams
Engage 2013 - Webtrends Streams
 
Engage 2013 - SEM Optimization
Engage 2013 - SEM OptimizationEngage 2013 - SEM Optimization
Engage 2013 - SEM Optimization
 
2009 Heinz Marketing Holiday Cocktail Collection
2009 Heinz Marketing Holiday Cocktail Collection2009 Heinz Marketing Holiday Cocktail Collection
2009 Heinz Marketing Holiday Cocktail Collection
 
From Mao to More: Catching up with the next generation of talent in China
From Mao to More: Catching up with the next generation of talent in China From Mao to More: Catching up with the next generation of talent in China
From Mao to More: Catching up with the next generation of talent in China
 
EAGLENEST
EAGLENESTEAGLENEST
EAGLENEST
 
Lean UX for Design Teams (Crushing the Boulder)
Lean UX for Design Teams (Crushing the Boulder)Lean UX for Design Teams (Crushing the Boulder)
Lean UX for Design Teams (Crushing the Boulder)
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data Sets
 
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
 

Ähnlich wie Tutorial 4 (duplicate detection)

3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar itemsViet-Trung TRAN
 
image compresson
image compressonimage compresson
image compressonAjay Kumar
 
Image compression
Image compressionImage compression
Image compressionAle Johnsan
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGGeorge Simov
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Mail.ru Group
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Bekas for cognitive_speaker_series
Bekas for cognitive_speaker_seriesBekas for cognitive_speaker_series
Bekas for cognitive_speaker_seriesdiannepatricia
 
Bekas for cognitive_speaker_series
Bekas for cognitive_speaker_seriesBekas for cognitive_speaker_series
Bekas for cognitive_speaker_seriesdiannepatricia
 
Report on Efficient Estimation for High Similarities using Odd Sketches
Report on Efficient Estimation for High Similarities using Odd Sketches  Report on Efficient Estimation for High Similarities using Odd Sketches
Report on Efficient Estimation for High Similarities using Odd Sketches AXEL FOTSO
 
Introduction to complex networks
Introduction to complex networksIntroduction to complex networks
Introduction to complex networksVincent Traag
 
Exploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesExploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesLuiz Henrique Zambom Santana
 
ImageCompression.ppt
ImageCompression.pptImageCompression.ppt
ImageCompression.pptdudoo1
 
ImageCompression.ppt
ImageCompression.pptImageCompression.ppt
ImageCompression.pptssuser6d1fca
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
 
similarity1 (6).ppt
similarity1 (6).pptsimilarity1 (6).ppt
similarity1 (6).pptssuserf11a32
 

Ähnlich wie Tutorial 4 (duplicate detection) (20)

3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
 
image compresson
image compressonimage compresson
image compresson
 
Image compression
Image compressionImage compression
Image compression
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
 
Nn
NnNn
Nn
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Bekas for cognitive_speaker_series
Bekas for cognitive_speaker_seriesBekas for cognitive_speaker_series
Bekas for cognitive_speaker_series
 
Bekas for cognitive_speaker_series
Bekas for cognitive_speaker_seriesBekas for cognitive_speaker_series
Bekas for cognitive_speaker_series
 
Report on Efficient Estimation for High Similarities using Odd Sketches
Report on Efficient Estimation for High Similarities using Odd Sketches  Report on Efficient Estimation for High Similarities using Odd Sketches
Report on Efficient Estimation for High Similarities using Odd Sketches
 
Introduction to complex networks
Introduction to complex networksIntroduction to complex networks
Introduction to complex networks
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
Lec10 matching
Lec10 matchingLec10 matching
Lec10 matching
 
Exploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesExploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queries
 
ImageCompression.ppt
ImageCompression.pptImageCompression.ppt
ImageCompression.ppt
 
ImageCompression.ppt
ImageCompression.pptImageCompression.ppt
ImageCompression.ppt
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
 
similarity1 (6).ppt
similarity1 (6).pptsimilarity1 (6).ppt
similarity1 (6).ppt
 

Mehr von Kira

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Kira
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)Kira
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Kira
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Kira
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Kira
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Kira
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Kira
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Kira
 

Mehr von Kira (13)

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
 

Kürzlich hochgeladen

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Kürzlich hochgeladen (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Tutorial 4 (duplicate detection)

  • 1. Duplicate detection Kira Radinsky Based on the Standford slides by Christopher Manning and Prabhakar Raghavan
  • 2. Duplication • ~30% of the content on the Web is near-duplicate pages – Pages with content that is nearly identical to that of other pages • Issues: – Index duplicate content only once – Return only one version in the search results – How can near-duplicate pages be identified in a scalable and reliable manner?
  • 3. Duplicate/Near-Duplicate Detection • Duplication: Exact match can be detected with fingerprints • Near-Duplication: Approximate match – Compute syntactic similarity with an edit-distance measure – Use similarity threshold to detect near-duplicates • E.g., Similarity > 80% => Documents are near duplicates • Not transitive though sometimes used transitively
  • 4. Computing Similarity - Shingles Features for Similarity (Shingles (Word N-Grams)) • Segments of a document (natural or artificial breakpoints) • K-shingling of a document transforms the document into a set containing all windows of k contiguous terms • E.g., 4-shingling of “My name is Inigo Montoya. You killed my father. Prepare to die”: { my name is inigo name is inigo montoya is inigo montoya you inigo montoya you killed montoya you killed my, you killed my father killed my father prepare my father prepare to father prepare to die }
  • 5. Computing Similarity – distance metric Similarity Measurement • Denote by Sk(d) the k-shingling of document d • Definition: the resemblance of d1 and d2, R(d1,d2) = |Sk(d1)  Sk(d2)| / |Sk(d1)  Sk(d2)| • The distance measure Δ(d1,d2) = 1-R(d1,d2) is a metric
  • 6. Shingles + Set Intersection 1. Computing exact set intersection of shingles between all pairs of documents is expensive/intractable – Approximate using a cleverly chosen subset of shingles from each (a sketch) 2. Estimate (size_of_intersection / size_of_union) based on a short sketch Doc A Shingle set A Sketch A Doc B Shingle set B Sketch B Jaccard
  • 7. Sketch of a document Create a sketch vector (of size ~200) for each document – Documents that share ≥ t (usually 80%) corresponding vector elements will be considered near duplicates – Definitions: • Let f map all k-shingles in the universe to 0..2m (e.g., f = fingerprinting) • Let p be a random permutation on 0..2m – For doc D, sketchD is as follows: • Sketch Option 1: Fm(d)=minm(Π(Sk(d)) be the m smallest numbers after applying Π to the k-shingling of d • Sketch Option 2: Vn(d) = { t ε Π(Sk(d)) | t = 0 mod n }
  • 8. From Shingles to Sketches (cont.) • A function X is an unbiased estimator of a value Y if E(X)=Y • Theorem: when choosing Π u.a.r., the following functions are unbiased estimators of R(d1,d2): • | minm(Fm(d1)  Fm(d2))  Fm(d1)  Fm(d2) | / | minm(Fm(d1)  Fm(d2)) | • | Vn(d1)  Vn(d2) | / | Vn(d1)  Vn(d2) | • Either Fm(d) or Vn(d) can be chosen as d’s sketch – Fm(d) has the advantage that it is of fixed size – Vn(d) is easier to compute
  • 9. Multiple Sketches • Let pi be a random permutation on 0..2m • For doc D, sketchD[ i ] is as follows: – Let f map all shingles in the universe to 0..2m (e.g., f = fingerprinting) – Let pi be a random permutation on 0..2m – Pick MIN {pi(f(s))} over all shingles s in D
  • 10. Computing Sketch[i] for Doc1 Document 1 264 264 264 264 Start with 64-bit f(shingles) Permute on the number line with pi Pick the min value
  • 11. Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document 2 264 264 264 264 264 264 264 264 Are these equal? Test for 200 random permutations: p1, p2,… p200 A B
  • 12. However… Document 1 Document 2 264 264 264 264 264 264 264 264 A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection) Claim: This happens with probability Size_of_intersection / Size_of_union Why? BA
  • 13. Set Similarity of sets Ci , Cj • View sets as columns of a matrix A – one row for each element in the universe. – aij = 1 indicates presence of item i in set j • Example ji ji ji CC CC )C,Jaccard(C    C1 C2 0 1 1 0 1 1 Jaccard(C1,C2) = 2/5 = 0.4 0 0 1 1 0 1
  • 14. Key Observation • For columns Ci, Cj, four types of rows Ci Cj A 1 1 B 1 0 C 0 1 D 0 0 • Let A = # of rows of type A • Claim CBA A )C,Jaccard(C ji  
  • 15. Min Hashing • Randomly permute rows • Hash h(Ci) = index of first row with 1 in column Ci • Surprising Property • Why? – Both are A/(A+B+C) – Look down columns Ci, Cj until first non-Type-D row – h(Ci) = h(Cj)  type A row    jiji C,CJaccard)h(C)h(CP 
  • 16. Min-Hash sketches • Pick P random row permutations • MinHash sketch SketchD = list of P indexes of first rows with 1 in column C • Similarity of signatures – Let sim[sketch(Ci),sketch(Cj)] = fraction of permutations where MinHash values agree – Observe E[sim(sig(Ci),sig(Cj))] = Jaccard(Ci,Cj)
  • 17. Example C1 C2 C3 R1 1 0 1 R2 0 1 1 R3 1 0 0 R4 1 0 1 R5 0 1 0 Signatures S1 S2 S3 Perm 1 = (12345) 1 2 1 Perm 2 = (54321) 4 5 4 Perm 3 = (34512) 3 5 4 Similarities 1-2 1-3 2-3 Col-Col 0.00 0.50 0.25 Sig-Sig 0.00 0.67 0.00
  • 18. Algorithm for Clustering Near-Duplicate Documents 1. Compute the sketch of each document 2. From each sketch, produce a list of <shingle, docID> pairs 3. Group all pairs by shingle value 4. For any shingle that is shared by more than one document, output a triplet <smaller-docID, larger-docID, 1> for each pair of docIDs sharing that shingle 5. Sort and aggregate the list of triplets, producing final triplets of the form <smaller-docID, larger-docID, # common shingles> 6. Join any pair of documents whose number of common shingles exceeds a chosen threshold using a “Union-Find” algorithm 7. Each resulting connected component of the UF algorithm is a cluster of near-duplicate documents Implementation nicely fits the “map-reduce” programming paradigm
  • 19.
  • 20. Implementation Trick • Permuting universe even once is prohibitive • Row Hashing – Pick P hash functions hk: {1,…,n}{1,…,O(n)} – Ordering under hk gives random permutation of rows • One-pass Implementation – For each Ci and hk, keep slot for min-hash value – Initialize all slot(Ci,hk) to infinity – Scan rows in arbitrary order looking for 1’s • Suppose row Rj has 1 in column Ci • For each hk, – if hk(j) < slot(Ci,hk), then slot(Ci,hk)  hk(j)
  • 21. Example C1 C2 R1 1 0 R2 0 1 R3 1 1 R4 1 0 R5 0 1 h(x) = x mod 5 g(x) = 2x+1 mod 5 h(1) = 1 1 - g(1) = 3 3 - h(2) = 2 1 2 g(2) = 0 3 0 h(3) = 3 1 2 g(3) = 2 2 0 h(4) = 4 1 2 g(4) = 4 2 0 h(5) = 0 1 0 g(5) = 1 2 0 C1 slots C2 slots