SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
Duplicate and Near Duplicate
Detection at Scale
Tim Allison, Ph.D.
Data Scientist/Relevance Engineer
Artificial Intelligence, Analytics and Innovative
Development Organization
© 2020 California Institute of Technology. Government sponsorship
acknowledged.
Reference herein to any specific commercial product, process,
or service by trade name, trademark, manufacturer, or
otherwise, does not constitute or imply its endorsement by the
United States Government or the Jet Propulsion Laboratory,
California Institute of Technology.
jpl.nasa.gov
About me
• Data scientist (files and search) Jet Propulsion
Laboratory, California Institute of Technology
• Chair/V.P. Apache Tika
• Committer Apache PDFBox, POI, Lucene/Solr,
OpenNLP
• Member Apache Software Foundation
2© 2020 California Institute of Technology. Government sponsorship acknowledged.10/22/20
jpl.nasa.gov
Outline
• Search system assessments, an overview of options
• Plug for text extraction assessment
• Duplicates and near duplicates – Case Study
• Exploration: Near duplicates with minhash
• Conclusion
310/22/20 © 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Search System Assessment: 20,000 ft view
• Offline
• Ground truth queries and expected docs
• Online
• User behavior
• User feedback
• Surveys, interviews
• Technical review
• System
• Data
10/22/20 4© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
System Assessment
Udo Kruschwitz and
Charlie Hull “Searching
the Enterprise”,
Foundations and
Trends® in Information
Retrieval. 11(1):1-142,
July 2017. p. 16.
| 5 |© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
System Assessment
• Crawler configurations
• Text extraction configurations
• Schema and field configuration
• Query Parser configuration
• Default Boolean operator
• Fields, field boosts
• …
10/22/20 6© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Data Assessment
• File types, parser coverage
• Quality of text extraction (…languages)
• Quality of metadata – dates, duplicate
titles/metadata
• Liveness of documents/URLs/URL redirects
• Duplicates and near duplicates
10/22/20 7© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Plug for Text Extraction Assessment
10/22/20 8© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Out of vocabulary (OOV) – Same file, different extractors
10/22/20 9
Tika 1.14 Tika 1.15-SNAPSHOT
Unique Tokens 786 156
Total Tokens 1603 272
LangId zh-ch de
Common Words 0 116
Alphabetic Tokens 1603 250
Top N Tokens 捳敨: 18 | 獴档: 14 | 略獴:
14 | m: 11 | 杮湥: 11 | 瑵捳:
11 | 畬杮: 11 | 档湥: 10 |
搠敩: 9 | 敮浨: 9
die: 11 | und: 8 | von: 8 |
deutschen: 7 | deutsche: 6 | 1:
5 | das: 5 | der: 5 |
finanzministerium: 5 | oder: 5
OOV% 1-(0/1603) = 100% 1-(116/250) = 54%
Fixed encoding detection between 1.14 and 1.15
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Quality of text extraction, an example
10/22/20 10
https://voyager.jpl.nasa.gov/pdf/sfos2003pdf/03_10_02-03_10_19.sfos.pdf
Language Id: Nepali (Out of Vocabulary 99%)
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Unexplained Garbage at Beginning of File(???)
10/22/20 11
This unexplained garbage
at the beginning of a file
also occurs in several other
PDF files identified as
Nepali
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
From analytics to action
10/22/20 12
https://aviris.jpl.nasa.gov/proceedings/workshops/02_docs/2002_Ogura_1_web.pdf
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Stored text vs. Optical Character Recognition
10/22/20 13
Text As Stored in File
!"#%$& (') *,+-).' / 0 1,23 *. 457698;:;<>=75?&@78;ACB
D(B7E;FHGJICBK5MLNBKOPBKF;B DJD Q R S.TVU9WNXMY[ZT]^W_S `badc
5KICedFgfh5 cji :;edF;A^5KEk<>Imln:;e[<>EnloedACICe a
lo<p57Eg5Kqsr;E;<jloe[E 8;O 6hedA5Kq adc 57ItedFk:;B c qsICf;B a
Text from Tesseract OCR
Constrained Least Squares Linear Spectral Unmixture by the Hybrid Steepest
Descent Method
Nobuhiko Ogura’ and Isao Yamada”
1 Introduction
A closed polyhedron is the intersection of finite number of closed half
spaces, i.e., the setof points satisfying finite number of lincar
incqualitics, and is widely used as a constraint in various application, for
example specifications or constraints in signal processing or estimation
problems, resource restrictions in financial applications and feasible sets of
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Duplicates and Near Duplicates
10/22/20 14© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Experimental Setup
• Development “web_index” (~12.5 million documents)
• Slightly out of date compared with production, but close
enough
• Covers internal web, but not other “document-heavy”
indices
• Safer to avoid heavy computation on production cluster
• Small enough to reindex with different field settings on
dev cluster
• Use existing tools/metrics – no contrib modules/hand-
coded algorithms
10/22/20 15© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Duplicates!
10/22/20 16© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
How big of a problem are duplicates?
10/22/20 17
First “lesson
learned” in
Oleksiy
Kovyrin’s
recent
“Sprinting to a
crawl: Building
an effective
web crawler”
on ElastiCON
Global 2020
jpl.nasa.gov
Google has several patents for (near)duplicate detection
https://patents.google.com/?q=%22duplicate+documents%22&assignee=Google%2c+Llc&n
um=100&oq=assignee:(Google%2c+Llc)+%22duplicate+documents%22&sort=new
10/22/20 18© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Google’s Guidance for Duplicates and Search Engine
Optimization (SEO)
10/22/20 19
https://support.google.com/webmasters/answer/66359?hl=en
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
File Types – Top 10 file types in web_index
10/22/20 20
File Type Count
text/html 8,894,038
image/gif 1,870,136
image/jpeg 1,094,937
image/png 319,710
text/plain 109,516
application/pdf 105,081
application/x-hdf 64,194
image/x-ms-bmp 26,377
application/xml 8,734
application/msword 7,414
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Duplicates, near duplicates
• Digests
• Literal bytes of a file are the same
• Text Digests
• Extracted text from a document is the same
• Text Profile Digest (see next slide)
• Require all words
• Drop the rarer words in a document (default)
10/22/20 21© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Nutch’s TextProfile
Data Search | JPL's Earth Science Airborne Program Jump to navigation
Earth Science Airborne Program JPL's Suborbital Earth Science
Instruments & Measurements Home › All Products › Instrument: Fourier
Transform Infrared Spectrometer (FTS) › Product Type: FTS_L2QR ›
Platform: C-23 Sherpa › Parameter: Atmospheric Chemistry › Platform
Type: Airborne › Campaign: Carbon in Arctic Reservoirs Vulnerability
Experiment (CARVE) Data Search Show Advanced search Temporal
Search Start Date Stop Date Free Text Search Enter search text Spatial
Search (Hold Shift to draw bounding box) + - Perform Search Sort By
Popularity (All Time) Popularity (This Month) Popularity (Users) Long
Name (A-Z) Short Name (A-Z) Grid Spatial Resolution Satellite Spatial
Resolution Start Date Stop Date Found 0 matching products(s). Browse
Products Campaign Any campaign Carbon in Arctic Reservoirs
Vulnerability Experiment (CARVE) (261) Parameter Any parameter
Atmospheric Chemistry (261) Instrument Any instrument Fourier
Transform Infrared Spectrometer (FTS) (261) Platform Any platform C-23
Sherpa (261) Platform Type Any platform type Airborne (261) Product
Type Any product type FTS_L2QR (261)
10/22/20 22
Term
Quantized
Count
search 8
261 6
any 6
platform 6
type 6
airborne 4
date 4
Text Profile: “search 261 any
platform type airborne date…”
Quantize counts, sort by
descending order of
frequency, drop quantized
count below a thresholdhttps://airbornescience.jpl.nasa.gov/data
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Different Digest, Different Text Digest, Same Text Profile
10/22/20 23© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Digests vs Text Digests vs Text Profile Digests in non-
image documents
• Total non-image documents: 9.2 million
• Distinct digests: 8.6 million
• Distinct text digests: 5.2 million
• Distinct text profile (keep all words): 5.1 million
• Distinct text profile (drop infrequent words): 2.7 million
10/22/20 24© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Number of Non-Image Documents with a Distinct Digest
10/22/20 25
Digest Text Digest Text Profile Digest
digest1 27,874 2,810,868 2,810,868
digest2 10,089 73,203 489,821
digest3 1,565 27,874 73,225
digest4 1,170 10,089 63,818
digest5 1,166 7,926 58,271
digest6 1,128 2,589 27,874
digest7 1,072 2,557 25,311
digest8 990 1,911 12,222
digest9 933 1,616 11,973
digest10 841 1,573 10,089
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
2.8 million?!
10/22/20 26
Yes! On development index.
In production, there are ONLY 880k!
© 2020 California Institute of Technology. Government sponsorship acknowledged.
Error page. The Web Server
encountered an unknown runtime
error. Cannot display page…
jpl.nasa.gov
Initial Takeaway
• Some easy fixes
10/22/20 27© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Exploration: Near Duplicates with
MinHash
10/22/20 28© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Experiments with MinHash
• Earlier proof-of-concept implemented by intern
• Filter available in Elasticsearch to allow for fuzzy hashing/near
duplicate detection
• Default settings – digest 5-grams (see next slide), summarize
digests into 512 tokens (buckets)
• Run a “MoreLikeThis” query – there is a more efficient algorithm,
but not built into ES yet*
10/22/20 29
Reference:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-minhash-tokenfilter.html
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
What’s a 5-gram
• “the quick brown fox jumped over the lazy dog”
• “the quick brown fox jumped”
• “quick brown fox jumped over”
• ….
10/22/20 30© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Experiments with MinHash: Findings
• Worked really well on a toy set of synthetic
documents
• Performance is prohibitive on full web_index (even
with stored termvectors) – estimate ~1 year to query
every document in the index
• Note: speed was greatly improved by
programmatically retrieving termvectors and
creating own terms query, but still not acceptable
10/22/20 31© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Experiments with MinHash: Conclusion
• There may be ways of improving performance with
more shards, multithreading, smarter processing,
different algorithm
• At this point, however, the problems with exact
duplicates and/or text duplicates are sufficient so as
not to warrant further investigation of near
duplicates via minhash
10/22/20 32© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
But why, why was MinHash SO slow?! Some ideas…
• Elasticsearch is optimized for queries of a few
words, not 512 “words”
• Aside from exact duplicates, how much duplication
do we have in 5-grams?
10/22/20 33© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Index 5-grams
• Intuition: in plagiarism detection, a single 5-gram is
indicative of duplication…should be extremely rare
• Finding: NOT AT ALL RARE on web_index
• The 10,000th most common appears in 12k files!
• Most common:
• “an unknown runtime error cannot” 2.6 million files
10/22/20 34© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Shared 5-grams – Some Categories of Causes
• Actual duplication or near duplication
• Boilerplate
• Web-page based (navigation, etc)
• Legal (copyright, branding)
• Machine generated logs
10/22/20 35© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Actual duplication or near duplication
10/22/20 36© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Boilerplate
• Webpage/Navigational
• “science technology launch vehicle” 1.4 million files
for Mars Odyssey pages
• “content announcements events opportunities
people” 500k on techconnect pages
• Legal
• “research and development center staffed” 640k
10/22/20 37© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Example of Indexed Boilerplate
10/22/20 38
“science technology launch
vehicle spacecraft”
1.4 million files!!!
https://mars.nasa.gov/odyssey/mission/time
line/communicationsrelay/
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Pause for relevance check
• If science, technology, “launch vehicle”
and spacecraft appear in 1.4 million documents,
how important will those words be in a user query?!
10/22/20 39© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Boilerpipe output
10/22/20 40
Demo: https://boilerpipe-web.appspot.com/
Available as a handler in Tika: BoilerpipeHandler
Available as a python library: https://pypi.org/project/boilerpy3/
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Google is removing boilerplate
10/22/20 41© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Machine Generated Logs
10/22/20 42
"downlink monitor block has
completed”
14k documents
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Takeaways from MinHash and 5gram
• We have enough to work with for now with digests,
text digests and text profile digests
• We can use 5grams to identify:
• Boilerplate content that we should remove if
boilerpipe isn’t sufficient
• Content that we might want to demote in relevance or
remove from the index (machine generated logs?!)
10/22/20 43© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Categories/causes of (near) duplication
• Exact duplicates
• Same document, different URL
• Documents with little or no text
• Near duplicates
• Different formats: PDF vs HTML of same content
• Versioning
• Documents with little text
• Asymmetric duplicates (A is contained entirely within
B, but B is larger), e.g. email included in reply
10/22/20 44© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Removal of (near duplicates) problematic if…
• “Duplicate” documents differ in other key features
(same text, but different images)
• Users need to find all versions of a versioned
document
• Small difference in text is important or main point of
page is non-textual (see next slide)
10/22/20 45© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Slightly different photo metadata
10/22/20 46© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Recommendations, step 1
• Experiment with boilerpipe handler vs. top n 5-
grams. Confirm that this doesn’t remove desired
text; or identify triggers for boilerpipe handler
• Index token count, lang id, digest and text digest
along with documents
• Add major sources of malignant duplicates to “skip
list” at crawling stage
10/22/20 47© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Recommendations, step 2…some options
• Remove duplicates or prevent from insertion
• Add a duplicate identification process and
• Group by duplicate digest in search results
• Demote duplicates in search results
• Allow users to select “include duplicates”
10/22/20 48© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Tools
• Quaerite (https://github.com/tballison/quaerite)
• Copy indices Solr->ES and vice versa
• List top n tokens (Solr only):TopNTokens
• tika-eval (https://cwiki.apache.org/confluence/display/TIKA/TikaEval )
• Token counts
• Language identification
• Out of vocabulary %
• Digest, Text digest, Text profile
10/22/20 49© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Conclusion
• It depends™
• There is no easy button, but this analysis and
discovery reveal critical areas for improvement and
get us closer to solutions
10/22/20 50© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Some References
• Manku, G., Jain, A. and Dash, A. “Detecting near-duplicates for web
crawling.” WWW’07
https://static.googleusercontent.com/media/research.google.com/en//
pubs/archive/33026.pdf
• Early patented work at Google:
https://www.cs.umd.edu/~pugh/google/Duplicates.pdf
• LSH at Uber for fraudulent trip detection: https://eng.uber.com/lsh/
• Minhash vs. SimHash:
http://proceedings.mlr.press/v33/shrivastava14.pdf
10/22/20 51© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Some Other References
• KNN and LSH in Elasticsearch:
https://blog.insightdatascience.com/elastik-nearest-neighbors-
4b1f6821bd62
• Minhash in Lucene:
https://medium.com/@xingzeng/understanding-minhash-in-
lucene-elasticsearch-e6799b78c0d7
• ssdeep and elastic: https://www.intezer.com/blog/intezer-analyze-
community/intezer-community-tip-ssdeep-comparisons-with-
elasticsearch/
10/22/20 52© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
10/22/20 53© 2020 California Institute of Technology. Government sponsorship acknowledged.

Weitere ähnliche Inhalte

Was ist angesagt?

Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
Ziemowit Jankowski
 

Was ist angesagt? (20)

FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
University program - writing an apache apex application
University program  - writing an apache apex applicationUniversity program  - writing an apache apex application
University program - writing an apache apex application
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
 
Apache flink
Apache flinkApache flink
Apache flink
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
 
Zurich Flink Meetup
Zurich Flink MeetupZurich Flink Meetup
Zurich Flink Meetup
 
Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2
 
Introduction to the Processor API
Introduction to the Processor APIIntroduction to the Processor API
Introduction to the Processor API
 
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 

Ähnlich wie Haystack Live tallison_202010_v2

Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
OSTHUS
 
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous GraphsTackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
The Innovative Data Intelligence Research (IDIR) Laboratory, University of Texas at Arlington
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
Ian Foster
 

Ähnlich wie Haystack Live tallison_202010_v2 (20)

Evaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache TikaEvaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache Tika
 
Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017
 
How to valuate and determine standard essential patents
How to valuate and determine standard essential patentsHow to valuate and determine standard essential patents
How to valuate and determine standard essential patents
 
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscapeVisualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscape
 
"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild""Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"
 
BioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogueBioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogue
 
Louise McCluskey, Kx Engineer at Kx Systems
Louise McCluskey, Kx Engineer at Kx SystemsLouise McCluskey, Kx Engineer at Kx Systems
Louise McCluskey, Kx Engineer at Kx Systems
 
Text and Data Mining explained at FTDM
Text and Data Mining explained at FTDMText and Data Mining explained at FTDM
Text and Data Mining explained at FTDM
 
Content Mining of Science and Medicine
Content Mining of Science and MedicineContent Mining of Science and Medicine
Content Mining of Science and Medicine
 
MPLS/SDN 2013 Intercloud Standardization and Testbeds - Sill
MPLS/SDN 2013 Intercloud Standardization and Testbeds - SillMPLS/SDN 2013 Intercloud Standardization and Testbeds - Sill
MPLS/SDN 2013 Intercloud Standardization and Testbeds - Sill
 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation Model
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
 
iMicrobe_ASLO_2015
iMicrobe_ASLO_2015iMicrobe_ASLO_2015
iMicrobe_ASLO_2015
 
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
 
Getting Access to ALCF Resources and Services
Getting Access to ALCF Resources and ServicesGetting Access to ALCF Resources and Services
Getting Access to ALCF Resources and Services
 
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous GraphsTackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
 
Grid Projects In The US July 2008
Grid Projects In The US July 2008Grid Projects In The US July 2008
Grid Projects In The US July 2008
 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of Information
 
Ogf27 Ligo
Ogf27 LigoOgf27 Ligo
Ogf27 Ligo
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 

Kürzlich hochgeladen

Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 

Kürzlich hochgeladen (20)

Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 

Haystack Live tallison_202010_v2

  • 1. Duplicate and Near Duplicate Detection at Scale Tim Allison, Ph.D. Data Scientist/Relevance Engineer Artificial Intelligence, Analytics and Innovative Development Organization © 2020 California Institute of Technology. Government sponsorship acknowledged. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.
  • 2. jpl.nasa.gov About me • Data scientist (files and search) Jet Propulsion Laboratory, California Institute of Technology • Chair/V.P. Apache Tika • Committer Apache PDFBox, POI, Lucene/Solr, OpenNLP • Member Apache Software Foundation 2© 2020 California Institute of Technology. Government sponsorship acknowledged.10/22/20
  • 3. jpl.nasa.gov Outline • Search system assessments, an overview of options • Plug for text extraction assessment • Duplicates and near duplicates – Case Study • Exploration: Near duplicates with minhash • Conclusion 310/22/20 © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 4. jpl.nasa.gov Search System Assessment: 20,000 ft view • Offline • Ground truth queries and expected docs • Online • User behavior • User feedback • Surveys, interviews • Technical review • System • Data 10/22/20 4© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 5. jpl.nasa.gov System Assessment Udo Kruschwitz and Charlie Hull “Searching the Enterprise”, Foundations and Trends® in Information Retrieval. 11(1):1-142, July 2017. p. 16. | 5 |© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 6. jpl.nasa.gov System Assessment • Crawler configurations • Text extraction configurations • Schema and field configuration • Query Parser configuration • Default Boolean operator • Fields, field boosts • … 10/22/20 6© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 7. jpl.nasa.gov Data Assessment • File types, parser coverage • Quality of text extraction (…languages) • Quality of metadata – dates, duplicate titles/metadata • Liveness of documents/URLs/URL redirects • Duplicates and near duplicates 10/22/20 7© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 8. jpl.nasa.gov Plug for Text Extraction Assessment 10/22/20 8© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 9. jpl.nasa.gov Out of vocabulary (OOV) – Same file, different extractors 10/22/20 9 Tika 1.14 Tika 1.15-SNAPSHOT Unique Tokens 786 156 Total Tokens 1603 272 LangId zh-ch de Common Words 0 116 Alphabetic Tokens 1603 250 Top N Tokens 捳敨: 18 | 獴档: 14 | 略獴: 14 | m: 11 | 杮湥: 11 | 瑵捳: 11 | 畬杮: 11 | 档湥: 10 | 搠敩: 9 | 敮浨: 9 die: 11 | und: 8 | von: 8 | deutschen: 7 | deutsche: 6 | 1: 5 | das: 5 | der: 5 | finanzministerium: 5 | oder: 5 OOV% 1-(0/1603) = 100% 1-(116/250) = 54% Fixed encoding detection between 1.14 and 1.15 © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 10. jpl.nasa.gov Quality of text extraction, an example 10/22/20 10 https://voyager.jpl.nasa.gov/pdf/sfos2003pdf/03_10_02-03_10_19.sfos.pdf Language Id: Nepali (Out of Vocabulary 99%) © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 11. jpl.nasa.gov Unexplained Garbage at Beginning of File(???) 10/22/20 11 This unexplained garbage at the beginning of a file also occurs in several other PDF files identified as Nepali © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 12. jpl.nasa.gov From analytics to action 10/22/20 12 https://aviris.jpl.nasa.gov/proceedings/workshops/02_docs/2002_Ogura_1_web.pdf © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 13. jpl.nasa.gov Stored text vs. Optical Character Recognition 10/22/20 13 Text As Stored in File !"#%$& (') *,+-).' / 0 1,23 *. 457698;:;<>=75?&@78;ACB D(B7E;FHGJICBK5MLNBKOPBKF;B DJD Q R S.TVU9WNXMY[ZT]^W_S `badc 5KICedFgfh5 cji :;edF;A^5KEk<>Imln:;e[<>EnloedACICe a lo<p57Eg5Kqsr;E;<jloe[E 8;O 6hedA5Kq adc 57ItedFk:;B c qsICf;B a Text from Tesseract OCR Constrained Least Squares Linear Spectral Unmixture by the Hybrid Steepest Descent Method Nobuhiko Ogura’ and Isao Yamada” 1 Introduction A closed polyhedron is the intersection of finite number of closed half spaces, i.e., the setof points satisfying finite number of lincar incqualitics, and is widely used as a constraint in various application, for example specifications or constraints in signal processing or estimation problems, resource restrictions in financial applications and feasible sets of © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 14. jpl.nasa.gov Duplicates and Near Duplicates 10/22/20 14© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 15. jpl.nasa.gov Experimental Setup • Development “web_index” (~12.5 million documents) • Slightly out of date compared with production, but close enough • Covers internal web, but not other “document-heavy” indices • Safer to avoid heavy computation on production cluster • Small enough to reindex with different field settings on dev cluster • Use existing tools/metrics – no contrib modules/hand- coded algorithms 10/22/20 15© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 16. jpl.nasa.gov Duplicates! 10/22/20 16© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 17. jpl.nasa.gov How big of a problem are duplicates? 10/22/20 17 First “lesson learned” in Oleksiy Kovyrin’s recent “Sprinting to a crawl: Building an effective web crawler” on ElastiCON Global 2020
  • 18. jpl.nasa.gov Google has several patents for (near)duplicate detection https://patents.google.com/?q=%22duplicate+documents%22&assignee=Google%2c+Llc&n um=100&oq=assignee:(Google%2c+Llc)+%22duplicate+documents%22&sort=new 10/22/20 18© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 19. jpl.nasa.gov Google’s Guidance for Duplicates and Search Engine Optimization (SEO) 10/22/20 19 https://support.google.com/webmasters/answer/66359?hl=en © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 20. jpl.nasa.gov File Types – Top 10 file types in web_index 10/22/20 20 File Type Count text/html 8,894,038 image/gif 1,870,136 image/jpeg 1,094,937 image/png 319,710 text/plain 109,516 application/pdf 105,081 application/x-hdf 64,194 image/x-ms-bmp 26,377 application/xml 8,734 application/msword 7,414 © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 21. jpl.nasa.gov Duplicates, near duplicates • Digests • Literal bytes of a file are the same • Text Digests • Extracted text from a document is the same • Text Profile Digest (see next slide) • Require all words • Drop the rarer words in a document (default) 10/22/20 21© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 22. jpl.nasa.gov Nutch’s TextProfile Data Search | JPL's Earth Science Airborne Program Jump to navigation Earth Science Airborne Program JPL's Suborbital Earth Science Instruments & Measurements Home › All Products › Instrument: Fourier Transform Infrared Spectrometer (FTS) › Product Type: FTS_L2QR › Platform: C-23 Sherpa › Parameter: Atmospheric Chemistry › Platform Type: Airborne › Campaign: Carbon in Arctic Reservoirs Vulnerability Experiment (CARVE) Data Search Show Advanced search Temporal Search Start Date Stop Date Free Text Search Enter search text Spatial Search (Hold Shift to draw bounding box) + - Perform Search Sort By Popularity (All Time) Popularity (This Month) Popularity (Users) Long Name (A-Z) Short Name (A-Z) Grid Spatial Resolution Satellite Spatial Resolution Start Date Stop Date Found 0 matching products(s). Browse Products Campaign Any campaign Carbon in Arctic Reservoirs Vulnerability Experiment (CARVE) (261) Parameter Any parameter Atmospheric Chemistry (261) Instrument Any instrument Fourier Transform Infrared Spectrometer (FTS) (261) Platform Any platform C-23 Sherpa (261) Platform Type Any platform type Airborne (261) Product Type Any product type FTS_L2QR (261) 10/22/20 22 Term Quantized Count search 8 261 6 any 6 platform 6 type 6 airborne 4 date 4 Text Profile: “search 261 any platform type airborne date…” Quantize counts, sort by descending order of frequency, drop quantized count below a thresholdhttps://airbornescience.jpl.nasa.gov/data © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 23. jpl.nasa.gov Different Digest, Different Text Digest, Same Text Profile 10/22/20 23© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 24. jpl.nasa.gov Digests vs Text Digests vs Text Profile Digests in non- image documents • Total non-image documents: 9.2 million • Distinct digests: 8.6 million • Distinct text digests: 5.2 million • Distinct text profile (keep all words): 5.1 million • Distinct text profile (drop infrequent words): 2.7 million 10/22/20 24© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 25. jpl.nasa.gov Number of Non-Image Documents with a Distinct Digest 10/22/20 25 Digest Text Digest Text Profile Digest digest1 27,874 2,810,868 2,810,868 digest2 10,089 73,203 489,821 digest3 1,565 27,874 73,225 digest4 1,170 10,089 63,818 digest5 1,166 7,926 58,271 digest6 1,128 2,589 27,874 digest7 1,072 2,557 25,311 digest8 990 1,911 12,222 digest9 933 1,616 11,973 digest10 841 1,573 10,089 © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 26. jpl.nasa.gov 2.8 million?! 10/22/20 26 Yes! On development index. In production, there are ONLY 880k! © 2020 California Institute of Technology. Government sponsorship acknowledged. Error page. The Web Server encountered an unknown runtime error. Cannot display page…
  • 27. jpl.nasa.gov Initial Takeaway • Some easy fixes 10/22/20 27© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 28. jpl.nasa.gov Exploration: Near Duplicates with MinHash 10/22/20 28© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 29. jpl.nasa.gov Experiments with MinHash • Earlier proof-of-concept implemented by intern • Filter available in Elasticsearch to allow for fuzzy hashing/near duplicate detection • Default settings – digest 5-grams (see next slide), summarize digests into 512 tokens (buckets) • Run a “MoreLikeThis” query – there is a more efficient algorithm, but not built into ES yet* 10/22/20 29 Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-minhash-tokenfilter.html © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 30. jpl.nasa.gov What’s a 5-gram • “the quick brown fox jumped over the lazy dog” • “the quick brown fox jumped” • “quick brown fox jumped over” • …. 10/22/20 30© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 31. jpl.nasa.gov Experiments with MinHash: Findings • Worked really well on a toy set of synthetic documents • Performance is prohibitive on full web_index (even with stored termvectors) – estimate ~1 year to query every document in the index • Note: speed was greatly improved by programmatically retrieving termvectors and creating own terms query, but still not acceptable 10/22/20 31© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 32. jpl.nasa.gov Experiments with MinHash: Conclusion • There may be ways of improving performance with more shards, multithreading, smarter processing, different algorithm • At this point, however, the problems with exact duplicates and/or text duplicates are sufficient so as not to warrant further investigation of near duplicates via minhash 10/22/20 32© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 33. jpl.nasa.gov But why, why was MinHash SO slow?! Some ideas… • Elasticsearch is optimized for queries of a few words, not 512 “words” • Aside from exact duplicates, how much duplication do we have in 5-grams? 10/22/20 33© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 34. jpl.nasa.gov Index 5-grams • Intuition: in plagiarism detection, a single 5-gram is indicative of duplication…should be extremely rare • Finding: NOT AT ALL RARE on web_index • The 10,000th most common appears in 12k files! • Most common: • “an unknown runtime error cannot” 2.6 million files 10/22/20 34© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 35. jpl.nasa.gov Shared 5-grams – Some Categories of Causes • Actual duplication or near duplication • Boilerplate • Web-page based (navigation, etc) • Legal (copyright, branding) • Machine generated logs 10/22/20 35© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 36. jpl.nasa.gov Actual duplication or near duplication 10/22/20 36© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 37. jpl.nasa.gov Boilerplate • Webpage/Navigational • “science technology launch vehicle” 1.4 million files for Mars Odyssey pages • “content announcements events opportunities people” 500k on techconnect pages • Legal • “research and development center staffed” 640k 10/22/20 37© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 38. jpl.nasa.gov Example of Indexed Boilerplate 10/22/20 38 “science technology launch vehicle spacecraft” 1.4 million files!!! https://mars.nasa.gov/odyssey/mission/time line/communicationsrelay/ © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 39. jpl.nasa.gov Pause for relevance check • If science, technology, “launch vehicle” and spacecraft appear in 1.4 million documents, how important will those words be in a user query?! 10/22/20 39© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 40. jpl.nasa.gov Boilerpipe output 10/22/20 40 Demo: https://boilerpipe-web.appspot.com/ Available as a handler in Tika: BoilerpipeHandler Available as a python library: https://pypi.org/project/boilerpy3/ © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 41. jpl.nasa.gov Google is removing boilerplate 10/22/20 41© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 42. jpl.nasa.gov Machine Generated Logs 10/22/20 42 "downlink monitor block has completed” 14k documents © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 43. jpl.nasa.gov Takeaways from MinHash and 5gram • We have enough to work with for now with digests, text digests and text profile digests • We can use 5grams to identify: • Boilerplate content that we should remove if boilerpipe isn’t sufficient • Content that we might want to demote in relevance or remove from the index (machine generated logs?!) 10/22/20 43© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 44. jpl.nasa.gov Categories/causes of (near) duplication • Exact duplicates • Same document, different URL • Documents with little or no text • Near duplicates • Different formats: PDF vs HTML of same content • Versioning • Documents with little text • Asymmetric duplicates (A is contained entirely within B, but B is larger), e.g. email included in reply 10/22/20 44© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 45. jpl.nasa.gov Removal of (near duplicates) problematic if… • “Duplicate” documents differ in other key features (same text, but different images) • Users need to find all versions of a versioned document • Small difference in text is important or main point of page is non-textual (see next slide) 10/22/20 45© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 46. jpl.nasa.gov Slightly different photo metadata 10/22/20 46© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 47. jpl.nasa.gov Recommendations, step 1 • Experiment with boilerpipe handler vs. top n 5- grams. Confirm that this doesn’t remove desired text; or identify triggers for boilerpipe handler • Index token count, lang id, digest and text digest along with documents • Add major sources of malignant duplicates to “skip list” at crawling stage 10/22/20 47© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 48. jpl.nasa.gov Recommendations, step 2…some options • Remove duplicates or prevent from insertion • Add a duplicate identification process and • Group by duplicate digest in search results • Demote duplicates in search results • Allow users to select “include duplicates” 10/22/20 48© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 49. jpl.nasa.gov Tools • Quaerite (https://github.com/tballison/quaerite) • Copy indices Solr->ES and vice versa • List top n tokens (Solr only):TopNTokens • tika-eval (https://cwiki.apache.org/confluence/display/TIKA/TikaEval ) • Token counts • Language identification • Out of vocabulary % • Digest, Text digest, Text profile 10/22/20 49© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 50. jpl.nasa.gov Conclusion • It depends™ • There is no easy button, but this analysis and discovery reveal critical areas for improvement and get us closer to solutions 10/22/20 50© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 51. jpl.nasa.gov Some References • Manku, G., Jain, A. and Dash, A. “Detecting near-duplicates for web crawling.” WWW’07 https://static.googleusercontent.com/media/research.google.com/en// pubs/archive/33026.pdf • Early patented work at Google: https://www.cs.umd.edu/~pugh/google/Duplicates.pdf • LSH at Uber for fraudulent trip detection: https://eng.uber.com/lsh/ • Minhash vs. SimHash: http://proceedings.mlr.press/v33/shrivastava14.pdf 10/22/20 51© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 52. jpl.nasa.gov Some Other References • KNN and LSH in Elasticsearch: https://blog.insightdatascience.com/elastik-nearest-neighbors- 4b1f6821bd62 • Minhash in Lucene: https://medium.com/@xingzeng/understanding-minhash-in- lucene-elasticsearch-e6799b78c0d7 • ssdeep and elastic: https://www.intezer.com/blog/intezer-analyze- community/intezer-community-tip-ssdeep-comparisons-with- elasticsearch/ 10/22/20 52© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 53. jpl.nasa.gov 10/22/20 53© 2020 California Institute of Technology. Government sponsorship acknowledged.