Haystack Live tallison_202010_v2

Duplicate and Near Duplicate
Detection at Scale
Tim Allison, Ph.D.
Data Scientist/Relevance Engineer
Artificial Intelligence, Analytics and Innovative
Development Organization
© 2020 California Institute of Technology. Government sponsorship
acknowledged.
Reference herein to any specific commercial product, process,
or service by trade name, trademark, manufacturer, or
otherwise, does not constitute or imply its endorsement by the
United States Government or the Jet Propulsion Laboratory,
California Institute of Technology.

jpl.nasa.gov
About me
• Data scientist (files and search) Jet Propulsion
Laboratory, California Institute of Technology
• Chair/V.P. Apache Tika
• Committer Apache PDFBox, POI, Lucene/Solr,
OpenNLP
• Member Apache Software Foundation
2© 2020 California Institute of Technology. Government sponsorship acknowledged.10/22/20

jpl.nasa.gov
Outline
• Search system assessments, an overview of options
• Plug for text extraction assessment
• Duplicates and near duplicates – Case Study
• Exploration: Near duplicates with minhash
• Conclusion
310/22/20 © 2020 California Institute of Technology. Government sponsorship acknowledged.

jpl.nasa.gov
Search System Assessment: 20,000 ft view
• Offline
• Ground truth queries and expected docs
• Online
• User behavior
• User feedback
• Surveys, interviews
• Technical review
• System
• Data
10/22/20 4© 2020 California Institute of Technology. Government sponsorship acknowledged.

jpl.nasa.gov
System Assessment
Udo Kruschwitz and
Charlie Hull “Searching
the Enterprise”,
Foundations and
Trends® in Information
Retrieval. 11(1):1-142,
July 2017. p. 16.
| 5 |© 2020 California Institute of Technology. Government sponsorship acknowledged.

jpl.nasa.gov
System Assessment
• Crawler configurations
• Text extraction configurations
• Schema and field configuration
• Query Parser configuration
• Default Boolean operator
• Fields, field boosts
• …

jpl.nasa.gov
Data Assessment
• File types, parser coverage
• Quality of text extraction (…languages)
• Quality of metadata – dates, duplicate
titles/metadata
• Liveness of documents/URLs/URL redirects
• Duplicates and near duplicates

jpl.nasa.gov
Plug for Text Extraction Assessment

jpl.nasa.gov
Out of vocabulary (OOV) – Same file, different extractors
10/22/20 9
Tika 1.14 Tika 1.15-SNAPSHOT
Unique Tokens 786 156
Total Tokens 1603 272
LangId zh-ch de
Common Words 0 116
Alphabetic Tokens 1603 250
Top N Tokens 捳敨: 18 | 獴档: 14 | 略獴:
14 | m: 11 | 杮湥: 11 | 瑵捳:
11 | 畬杮: 11 | 档湥: 10 |
搠敩: 9 | 敮浨: 9
die: 11 | und: 8 | von: 8 |
deutschen: 7 | deutsche: 6 | 1:
5 | das: 5 | der: 5 |
finanzministerium: 5 | oder: 5
OOV% 1-(0/1603) = 100% 1-(116/250) = 54%
Fixed encoding detection between 1.14 and 1.15
© 2020 California Institute of Technology. Government sponsorship acknowledged.

jpl.nasa.gov
Quality of text extraction, an example
10/22/20 10
https://voyager.jpl.nasa.gov/pdf/sfos2003pdf/03_10_02-03_10_19.sfos.pdf
Language Id: Nepali (Out of Vocabulary 99%)

jpl.nasa.gov
Unexplained Garbage at Beginning of File(???)
10/22/20 11
This unexplained garbage
at the beginning of a file
also occurs in several other
PDF files identified as
Nepali

jpl.nasa.gov
From analytics to action
10/22/20 12
https://aviris.jpl.nasa.gov/proceedings/workshops/02_docs/2002_Ogura_1_web.pdf

jpl.nasa.gov
Stored text vs. Optical Character Recognition
10/22/20 13
Text As Stored in File
!"#%$& (') *,+-).' / 0 1,23 *. 457698;:;<>=75?&@78;ACB
D(B7E;FHGJICBK5MLNBKOPBKF;B DJD Q R S.TVU9WNXMY[ZT]^W_S `badc
5KICedFgfh5 cji :;edF;A^5KEk<>Imln:;e[<>EnloedACICe a
lo<p57Eg5Kqsr;E;<jloe[E 8;O 6hedA5Kq adc 57ItedFk:;B c qsICf;B a
Text from Tesseract OCR
Constrained Least Squares Linear Spectral Unmixture by the Hybrid Steepest
Descent Method
Nobuhiko Ogura’ and Isao Yamada”
1 Introduction
A closed polyhedron is the intersection of finite number of closed half
spaces, i.e., the setof points satisfying finite number of lincar
incqualitics, and is widely used as a constraint in various application, for
example specifications or constraints in signal processing or estimation
problems, resource restrictions in financial applications and feasible sets of

jpl.nasa.gov
Duplicates and Near Duplicates

jpl.nasa.gov
Experimental Setup
• Development “web_index” (~12.5 million documents)
• Slightly out of date compared with production, but close
enough
• Covers internal web, but not other “document-heavy”
indices
• Safer to avoid heavy computation on production cluster
• Small enough to reindex with different field settings on
dev cluster
• Use existing tools/metrics – no contrib modules/hand-
coded algorithms

jpl.nasa.gov
Duplicates!

jpl.nasa.gov
How big of a problem are duplicates?
10/22/20 17
First “lesson
learned” in
Oleksiy
Kovyrin’s
recent
“Sprinting to a
crawl: Building
an effective
web crawler”
on ElastiCON
Global 2020

jpl.nasa.gov
Google has several patents for (near)duplicate detection
https://patents.google.com/?q=%22duplicate+documents%22&assignee=Google%2c+Llc&n
um=100&oq=assignee:(Google%2c+Llc)+%22duplicate+documents%22&sort=new

jpl.nasa.gov
Google’s Guidance for Duplicates and Search Engine
Optimization (SEO)
10/22/20 19
https://support.google.com/webmasters/answer/66359?hl=en

jpl.nasa.gov
File Types – Top 10 file types in web_index
10/22/20 20
File Type Count
text/html 8,894,038
image/gif 1,870,136
image/jpeg 1,094,937
image/png 319,710
text/plain 109,516
application/pdf 105,081
application/x-hdf 64,194
image/x-ms-bmp 26,377
application/xml 8,734
application/msword 7,414

jpl.nasa.gov
Duplicates, near duplicates
• Digests
• Literal bytes of a file are the same
• Text Digests
• Extracted text from a document is the same
• Text Profile Digest (see next slide)
• Require all words
• Drop the rarer words in a document (default)

jpl.nasa.gov
Nutch’s TextProfile
Data Search | JPL's Earth Science Airborne Program Jump to navigation
Earth Science Airborne Program JPL's Suborbital Earth Science
Instruments & Measurements Home › All Products › Instrument: Fourier
Transform Infrared Spectrometer (FTS) › Product Type: FTS_L2QR ›
Platform: C-23 Sherpa › Parameter: Atmospheric Chemistry › Platform
Type: Airborne › Campaign: Carbon in Arctic Reservoirs Vulnerability
Experiment (CARVE) Data Search Show Advanced search Temporal
Search Start Date Stop Date Free Text Search Enter search text Spatial
Search (Hold Shift to draw bounding box) + - Perform Search Sort By
Popularity (All Time) Popularity (This Month) Popularity (Users) Long
Name (A-Z) Short Name (A-Z) Grid Spatial Resolution Satellite Spatial
Resolution Start Date Stop Date Found 0 matching products(s). Browse
Products Campaign Any campaign Carbon in Arctic Reservoirs
Vulnerability Experiment (CARVE) (261) Parameter Any parameter
Atmospheric Chemistry (261) Instrument Any instrument Fourier
Transform Infrared Spectrometer (FTS) (261) Platform Any platform C-23
Sherpa (261) Platform Type Any platform type Airborne (261) Product
Type Any product type FTS_L2QR (261)
10/22/20 22
Term
Quantized
Count
search 8
261 6
any 6
platform 6
type 6
airborne 4
date 4
Text Profile: “search 261 any
platform type airborne date…”
Quantize counts, sort by
descending order of
frequency, drop quantized
count below a thresholdhttps://airbornescience.jpl.nasa.gov/data

jpl.nasa.gov
Different Digest, Different Text Digest, Same Text Profile

jpl.nasa.gov
Digests vs Text Digests vs Text Profile Digests in non-
image documents
• Total non-image documents: 9.2 million
• Distinct digests: 8.6 million
• Distinct text digests: 5.2 million
• Distinct text profile (keep all words): 5.1 million
• Distinct text profile (drop infrequent words): 2.7 million

jpl.nasa.gov
Number of Non-Image Documents with a Distinct Digest
10/22/20 25
Digest Text Digest Text Profile Digest
digest1 27,874 2,810,868 2,810,868
digest2 10,089 73,203 489,821
digest3 1,565 27,874 73,225
digest4 1,170 10,089 63,818
digest5 1,166 7,926 58,271
digest6 1,128 2,589 27,874
digest7 1,072 2,557 25,311
digest8 990 1,911 12,222
digest9 933 1,616 11,973
digest10 841 1,573 10,089

jpl.nasa.gov
2.8 million?!
10/22/20 26
Yes! On development index.
In production, there are ONLY 880k!
Error page. The Web Server
encountered an unknown runtime
error. Cannot display page…

jpl.nasa.gov
Initial Takeaway
• Some easy fixes

jpl.nasa.gov
Exploration: Near Duplicates with
MinHash

jpl.nasa.gov
Experiments with MinHash
• Earlier proof-of-concept implemented by intern
• Filter available in Elasticsearch to allow for fuzzy hashing/near
duplicate detection
• Default settings – digest 5-grams (see next slide), summarize
digests into 512 tokens (buckets)
• Run a “MoreLikeThis” query – there is a more efficient algorithm,
but not built into ES yet*
10/22/20 29
Reference:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-minhash-tokenfilter.html

jpl.nasa.gov
What’s a 5-gram
• “the quick brown fox jumped over the lazy dog”
• “the quick brown fox jumped”
• “quick brown fox jumped over”
• ….

jpl.nasa.gov
Experiments with MinHash: Findings
• Worked really well on a toy set of synthetic
documents
• Performance is prohibitive on full web_index (even
with stored termvectors) – estimate ~1 year to query
every document in the index
• Note: speed was greatly improved by
programmatically retrieving termvectors and
creating own terms query, but still not acceptable

jpl.nasa.gov
Experiments with MinHash: Conclusion
• There may be ways of improving performance with
more shards, multithreading, smarter processing,
different algorithm
• At this point, however, the problems with exact
duplicates and/or text duplicates are sufficient so as
not to warrant further investigation of near
duplicates via minhash

jpl.nasa.gov
But why, why was MinHash SO slow?! Some ideas…
• Elasticsearch is optimized for queries of a few
words, not 512 “words”
• Aside from exact duplicates, how much duplication
do we have in 5-grams?

jpl.nasa.gov
Index 5-grams
• Intuition: in plagiarism detection, a single 5-gram is
indicative of duplication…should be extremely rare
• Finding: NOT AT ALL RARE on web_index
• The 10,000th most common appears in 12k files!
• Most common:
• “an unknown runtime error cannot” 2.6 million files

jpl.nasa.gov
Shared 5-grams – Some Categories of Causes
• Actual duplication or near duplication
• Boilerplate
• Web-page based (navigation, etc)
• Legal (copyright, branding)
• Machine generated logs

jpl.nasa.gov
Actual duplication or near duplication

jpl.nasa.gov
Boilerplate
• Webpage/Navigational
• “science technology launch vehicle” 1.4 million files
for Mars Odyssey pages
• “content announcements events opportunities
people” 500k on techconnect pages
• Legal
• “research and development center staffed” 640k

jpl.nasa.gov
Example of Indexed Boilerplate
10/22/20 38
“science technology launch
vehicle spacecraft”
1.4 million files!!!
https://mars.nasa.gov/odyssey/mission/time
line/communicationsrelay/

jpl.nasa.gov
Pause for relevance check
• If science, technology, “launch vehicle”
and spacecraft appear in 1.4 million documents,
how important will those words be in a user query?!

jpl.nasa.gov
Boilerpipe output
10/22/20 40
Demo: https://boilerpipe-web.appspot.com/
Available as a handler in Tika: BoilerpipeHandler
Available as a python library: https://pypi.org/project/boilerpy3/

jpl.nasa.gov
Google is removing boilerplate

jpl.nasa.gov
Machine Generated Logs
10/22/20 42
"downlink monitor block has
completed”
14k documents

jpl.nasa.gov
Takeaways from MinHash and 5gram
• We have enough to work with for now with digests,
text digests and text profile digests
• We can use 5grams to identify:
• Boilerplate content that we should remove if
boilerpipe isn’t sufficient
• Content that we might want to demote in relevance or
remove from the index (machine generated logs?!)

jpl.nasa.gov
Categories/causes of (near) duplication
• Exact duplicates
• Same document, different URL
• Documents with little or no text
• Near duplicates
• Different formats: PDF vs HTML of same content
• Versioning
• Documents with little text
• Asymmetric duplicates (A is contained entirely within
B, but B is larger), e.g. email included in reply

jpl.nasa.gov
Removal of (near duplicates) problematic if…
• “Duplicate” documents differ in other key features
(same text, but different images)
• Users need to find all versions of a versioned
document
• Small difference in text is important or main point of
page is non-textual (see next slide)

jpl.nasa.gov
Slightly different photo metadata

jpl.nasa.gov
Recommendations, step 1
• Experiment with boilerpipe handler vs. top n 5-
grams. Confirm that this doesn’t remove desired
text; or identify triggers for boilerpipe handler
• Index token count, lang id, digest and text digest
along with documents
• Add major sources of malignant duplicates to “skip
list” at crawling stage

jpl.nasa.gov
Recommendations, step 2…some options
• Remove duplicates or prevent from insertion
• Add a duplicate identification process and
• Group by duplicate digest in search results
• Demote duplicates in search results
• Allow users to select “include duplicates”

jpl.nasa.gov
Tools
• Quaerite (https://github.com/tballison/quaerite)
• Copy indices Solr->ES and vice versa
• List top n tokens (Solr only):TopNTokens
• tika-eval (https://cwiki.apache.org/confluence/display/TIKA/TikaEval )
• Token counts
• Language identification
• Out of vocabulary %
• Digest, Text digest, Text profile

jpl.nasa.gov
Conclusion
• It depends™
• There is no easy button, but this analysis and
discovery reveal critical areas for improvement and
get us closer to solutions

jpl.nasa.gov
Some References
• Manku, G., Jain, A. and Dash, A. “Detecting near-duplicates for web
crawling.” WWW’07
https://static.googleusercontent.com/media/research.google.com/en//
pubs/archive/33026.pdf
• Early patented work at Google:
https://www.cs.umd.edu/~pugh/google/Duplicates.pdf
• LSH at Uber for fraudulent trip detection: https://eng.uber.com/lsh/
• Minhash vs. SimHash:
http://proceedings.mlr.press/v33/shrivastava14.pdf

jpl.nasa.gov
Some Other References
• KNN and LSH in Elasticsearch:
https://blog.insightdatascience.com/elastik-nearest-neighbors-
4b1f6821bd62
• Minhash in Lucene:
https://medium.com/@xingzeng/understanding-minhash-in-
lucene-elasticsearch-e6799b78c0d7
• ssdeep and elastic: https://www.intezer.com/blog/intezer-analyze-
community/intezer-community-tip-ssdeep-comparisons-with-
elasticsearch/

jpl.nasa.gov

Haystack Live tallison_202010_v2

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Haystack Live tallison_202010_v2

Ähnlich wie Haystack Live tallison_202010_v2 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Haystack Live tallison_202010_v2