Suche senden
Hochladen
Haystack Live tallison_202010_v2
•
0 gefällt mir
•
126 views
T
Tim Allison
Folgen
Duplicate and Near Duplicate Detection at Scale
Weniger lesen
Mehr lesen
Daten & Analysen
Melden
Teilen
Melden
Teilen
1 von 53
Jetzt herunterladen
Downloaden Sie, um offline zu lesen
Empfohlen
Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019
Petr Zapletal
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
Yaroslav Tkachenko
Streaming Data from Cassandra into Kafka
Streaming Data from Cassandra into Kafka
Abrar Sheikh
Principles in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, Confluent
HostedbyConfluent
Scaling up uber's real time data analytics
Scaling up uber's real time data analytics
Xiang Fu
Structured streaming in Spark
Structured streaming in Spark
Giri R Varatharajan
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
Empfohlen
Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019
Petr Zapletal
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
Yaroslav Tkachenko
Streaming Data from Cassandra into Kafka
Streaming Data from Cassandra into Kafka
Abrar Sheikh
Principles in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, Confluent
HostedbyConfluent
Scaling up uber's real time data analytics
Scaling up uber's real time data analytics
Xiang Fu
Structured streaming in Spark
Structured streaming in Spark
Giri R Varatharajan
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
FlinkML - Big data application meetup
FlinkML - Big data application meetup
Theodoros Vasiloudis
Apache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
Yaroslav Tkachenko
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
kbajda
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)
Chris Riccomini
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
Yaroslav Tkachenko
University program - writing an apache apex application
University program - writing an apache apex application
Akshay Gore
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Till Rohrmann
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
Ziemowit Jankowski
Apache flink
Apache flink
pranay kumar
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
HostedbyConfluent
Zurich Flink Meetup
Zurich Flink Meetup
Konstantinos Kloudas
Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2
Corley S.r.l.
Introduction to the Processor API
Introduction to the Processor API
confluent
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
DataStax Academy
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
kbajda
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
Evaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache Tika
Tim Allison
Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017
Jonathan Yu
Weitere ähnliche Inhalte
Was ist angesagt?
FlinkML - Big data application meetup
FlinkML - Big data application meetup
Theodoros Vasiloudis
Apache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
Yaroslav Tkachenko
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
kbajda
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)
Chris Riccomini
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
Yaroslav Tkachenko
University program - writing an apache apex application
University program - writing an apache apex application
Akshay Gore
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Till Rohrmann
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
Ziemowit Jankowski
Apache flink
Apache flink
pranay kumar
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
HostedbyConfluent
Zurich Flink Meetup
Zurich Flink Meetup
Konstantinos Kloudas
Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2
Corley S.r.l.
Introduction to the Processor API
Introduction to the Processor API
confluent
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
DataStax Academy
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
kbajda
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
Was ist angesagt?
(20)
FlinkML - Big data application meetup
FlinkML - Big data application meetup
Apache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
University program - writing an apache apex application
University program - writing an apache apex application
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
Apache flink
Apache flink
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
Zurich Flink Meetup
Zurich Flink Meetup
Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2
Introduction to the Processor API
Introduction to the Processor API
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Ähnlich wie Haystack Live tallison_202010_v2
Evaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache Tika
Tim Allison
Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017
Jonathan Yu
How to valuate and determine standard essential patents
How to valuate and determine standard essential patents
MIPLM
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscape
Jonathan Yu
"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"
Tim Allison
BioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogue
BioCatalogue
Louise McCluskey, Kx Engineer at Kx Systems
Louise McCluskey, Kx Engineer at Kx Systems
Dataconomy Media
Text and Data Mining explained at FTDM
Text and Data Mining explained at FTDM
petermurrayrust
Content Mining of Science and Medicine
Content Mining of Science and Medicine
TheContentMine
MPLS/SDN 2013 Intercloud Standardization and Testbeds - Sill
MPLS/SDN 2013 Intercloud Standardization and Testbeds - Sill
Alan Sill
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model
Thamme Gowda
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
OSTHUS
iMicrobe_ASLO_2015
iMicrobe_ASLO_2015
Bonnie Hurwitz
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
WSO2
Getting Access to ALCF Resources and Services
Getting Access to ALCF Resources and Services
davidemartin
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
The Innovative Data Intelligence Research (IDIR) Laboratory, University of Texas at Arlington
Grid Projects In The US July 2008
Grid Projects In The US July 2008
Ian Foster
The Nature of Information
The Nature of Information
Adrian Paschke
Ogf27 Ligo
Ogf27 Ligo
kentblackburn
So Long Computer Overlords
So Long Computer Overlords
Ian Foster
Ähnlich wie Haystack Live tallison_202010_v2
(20)
Evaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache Tika
Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017
How to valuate and determine standard essential patents
How to valuate and determine standard essential patents
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscape
"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"
BioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogue
Louise McCluskey, Kx Engineer at Kx Systems
Louise McCluskey, Kx Engineer at Kx Systems
Text and Data Mining explained at FTDM
Text and Data Mining explained at FTDM
Content Mining of Science and Medicine
Content Mining of Science and Medicine
MPLS/SDN 2013 Intercloud Standardization and Testbeds - Sill
MPLS/SDN 2013 Intercloud Standardization and Testbeds - Sill
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
iMicrobe_ASLO_2015
iMicrobe_ASLO_2015
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Getting Access to ALCF Resources and Services
Getting Access to ALCF Resources and Services
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
Grid Projects In The US July 2008
Grid Projects In The US July 2008
The Nature of Information
The Nature of Information
Ogf27 Ligo
Ogf27 Ligo
So Long Computer Overlords
So Long Computer Overlords
Kürzlich hochgeladen
社内勉強会資料 Mamba - A new era or ephemeral
社内勉強会資料 Mamba - A new era or ephemeral
NABLAS株式会社
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
Jon Hansen
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
Payment Village
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
Stephen266013
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
Alison Pitt
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
ssuserf63bd7
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
scitechtalktv
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
basics of data science with application areas.pdf
basics of data science with application areas.pdf
vyankatesh1
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Jack Cole
Easy and simple project file on mp online
Easy and simple project file on mp online
balibahu1313
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
Bisnar Chase Personal Injury Attorneys
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
lward7
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
Boston Institute of Analytics
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
Kürzlich hochgeladen
(20)
社内勉強会資料 Mamba - A new era or ephemeral
社内勉強会資料 Mamba - A new era or ephemeral
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
basics of data science with application areas.pdf
basics of data science with application areas.pdf
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Easy and simple project file on mp online
Easy and simple project file on mp online
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
Haystack Live tallison_202010_v2
1.
Duplicate and Near
Duplicate Detection at Scale Tim Allison, Ph.D. Data Scientist/Relevance Engineer Artificial Intelligence, Analytics and Innovative Development Organization © 2020 California Institute of Technology. Government sponsorship acknowledged. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.
2.
jpl.nasa.gov About me • Data
scientist (files and search) Jet Propulsion Laboratory, California Institute of Technology • Chair/V.P. Apache Tika • Committer Apache PDFBox, POI, Lucene/Solr, OpenNLP • Member Apache Software Foundation 2© 2020 California Institute of Technology. Government sponsorship acknowledged.10/22/20
3.
jpl.nasa.gov Outline • Search system
assessments, an overview of options • Plug for text extraction assessment • Duplicates and near duplicates – Case Study • Exploration: Near duplicates with minhash • Conclusion 310/22/20 © 2020 California Institute of Technology. Government sponsorship acknowledged.
4.
jpl.nasa.gov Search System Assessment:
20,000 ft view • Offline • Ground truth queries and expected docs • Online • User behavior • User feedback • Surveys, interviews • Technical review • System • Data 10/22/20 4© 2020 California Institute of Technology. Government sponsorship acknowledged.
5.
jpl.nasa.gov System Assessment Udo Kruschwitz
and Charlie Hull “Searching the Enterprise”, Foundations and Trends® in Information Retrieval. 11(1):1-142, July 2017. p. 16. | 5 |© 2020 California Institute of Technology. Government sponsorship acknowledged.
6.
jpl.nasa.gov System Assessment • Crawler
configurations • Text extraction configurations • Schema and field configuration • Query Parser configuration • Default Boolean operator • Fields, field boosts • … 10/22/20 6© 2020 California Institute of Technology. Government sponsorship acknowledged.
7.
jpl.nasa.gov Data Assessment • File
types, parser coverage • Quality of text extraction (…languages) • Quality of metadata – dates, duplicate titles/metadata • Liveness of documents/URLs/URL redirects • Duplicates and near duplicates 10/22/20 7© 2020 California Institute of Technology. Government sponsorship acknowledged.
8.
jpl.nasa.gov Plug for Text
Extraction Assessment 10/22/20 8© 2020 California Institute of Technology. Government sponsorship acknowledged.
9.
jpl.nasa.gov Out of vocabulary
(OOV) – Same file, different extractors 10/22/20 9 Tika 1.14 Tika 1.15-SNAPSHOT Unique Tokens 786 156 Total Tokens 1603 272 LangId zh-ch de Common Words 0 116 Alphabetic Tokens 1603 250 Top N Tokens 捳敨: 18 | 獴档: 14 | 略獴: 14 | m: 11 | 杮湥: 11 | 瑵捳: 11 | 畬杮: 11 | 档湥: 10 | 搠敩: 9 | 敮浨: 9 die: 11 | und: 8 | von: 8 | deutschen: 7 | deutsche: 6 | 1: 5 | das: 5 | der: 5 | finanzministerium: 5 | oder: 5 OOV% 1-(0/1603) = 100% 1-(116/250) = 54% Fixed encoding detection between 1.14 and 1.15 © 2020 California Institute of Technology. Government sponsorship acknowledged.
10.
jpl.nasa.gov Quality of text
extraction, an example 10/22/20 10 https://voyager.jpl.nasa.gov/pdf/sfos2003pdf/03_10_02-03_10_19.sfos.pdf Language Id: Nepali (Out of Vocabulary 99%) © 2020 California Institute of Technology. Government sponsorship acknowledged.
11.
jpl.nasa.gov Unexplained Garbage at
Beginning of File(???) 10/22/20 11 This unexplained garbage at the beginning of a file also occurs in several other PDF files identified as Nepali © 2020 California Institute of Technology. Government sponsorship acknowledged.
12.
jpl.nasa.gov From analytics to
action 10/22/20 12 https://aviris.jpl.nasa.gov/proceedings/workshops/02_docs/2002_Ogura_1_web.pdf © 2020 California Institute of Technology. Government sponsorship acknowledged.
13.
jpl.nasa.gov Stored text vs.
Optical Character Recognition 10/22/20 13 Text As Stored in File !"#%$& (') *,+-).' / 0 1,23 *. 457698;:;<>=75?&@78;ACB D(B7E;FHGJICBK5MLNBKOPBKF;B DJD Q R S.TVU9WNXMY[ZT]^W_S `badc 5KICedFgfh5 cji :;edF;A^5KEk<>Imln:;e[<>EnloedACICe a lo<p57Eg5Kqsr;E;<jloe[E 8;O 6hedA5Kq adc 57ItedFk:;B c qsICf;B a Text from Tesseract OCR Constrained Least Squares Linear Spectral Unmixture by the Hybrid Steepest Descent Method Nobuhiko Ogura’ and Isao Yamada” 1 Introduction A closed polyhedron is the intersection of finite number of closed half spaces, i.e., the setof points satisfying finite number of lincar incqualitics, and is widely used as a constraint in various application, for example specifications or constraints in signal processing or estimation problems, resource restrictions in financial applications and feasible sets of © 2020 California Institute of Technology. Government sponsorship acknowledged.
14.
jpl.nasa.gov Duplicates and Near
Duplicates 10/22/20 14© 2020 California Institute of Technology. Government sponsorship acknowledged.
15.
jpl.nasa.gov Experimental Setup • Development
“web_index” (~12.5 million documents) • Slightly out of date compared with production, but close enough • Covers internal web, but not other “document-heavy” indices • Safer to avoid heavy computation on production cluster • Small enough to reindex with different field settings on dev cluster • Use existing tools/metrics – no contrib modules/hand- coded algorithms 10/22/20 15© 2020 California Institute of Technology. Government sponsorship acknowledged.
16.
jpl.nasa.gov Duplicates! 10/22/20 16© 2020
California Institute of Technology. Government sponsorship acknowledged.
17.
jpl.nasa.gov How big of
a problem are duplicates? 10/22/20 17 First “lesson learned” in Oleksiy Kovyrin’s recent “Sprinting to a crawl: Building an effective web crawler” on ElastiCON Global 2020
18.
jpl.nasa.gov Google has several
patents for (near)duplicate detection https://patents.google.com/?q=%22duplicate+documents%22&assignee=Google%2c+Llc&n um=100&oq=assignee:(Google%2c+Llc)+%22duplicate+documents%22&sort=new 10/22/20 18© 2020 California Institute of Technology. Government sponsorship acknowledged.
19.
jpl.nasa.gov Google’s Guidance for
Duplicates and Search Engine Optimization (SEO) 10/22/20 19 https://support.google.com/webmasters/answer/66359?hl=en © 2020 California Institute of Technology. Government sponsorship acknowledged.
20.
jpl.nasa.gov File Types –
Top 10 file types in web_index 10/22/20 20 File Type Count text/html 8,894,038 image/gif 1,870,136 image/jpeg 1,094,937 image/png 319,710 text/plain 109,516 application/pdf 105,081 application/x-hdf 64,194 image/x-ms-bmp 26,377 application/xml 8,734 application/msword 7,414 © 2020 California Institute of Technology. Government sponsorship acknowledged.
21.
jpl.nasa.gov Duplicates, near duplicates •
Digests • Literal bytes of a file are the same • Text Digests • Extracted text from a document is the same • Text Profile Digest (see next slide) • Require all words • Drop the rarer words in a document (default) 10/22/20 21© 2020 California Institute of Technology. Government sponsorship acknowledged.
22.
jpl.nasa.gov Nutch’s TextProfile Data Search
| JPL's Earth Science Airborne Program Jump to navigation Earth Science Airborne Program JPL's Suborbital Earth Science Instruments & Measurements Home › All Products › Instrument: Fourier Transform Infrared Spectrometer (FTS) › Product Type: FTS_L2QR › Platform: C-23 Sherpa › Parameter: Atmospheric Chemistry › Platform Type: Airborne › Campaign: Carbon in Arctic Reservoirs Vulnerability Experiment (CARVE) Data Search Show Advanced search Temporal Search Start Date Stop Date Free Text Search Enter search text Spatial Search (Hold Shift to draw bounding box) + - Perform Search Sort By Popularity (All Time) Popularity (This Month) Popularity (Users) Long Name (A-Z) Short Name (A-Z) Grid Spatial Resolution Satellite Spatial Resolution Start Date Stop Date Found 0 matching products(s). Browse Products Campaign Any campaign Carbon in Arctic Reservoirs Vulnerability Experiment (CARVE) (261) Parameter Any parameter Atmospheric Chemistry (261) Instrument Any instrument Fourier Transform Infrared Spectrometer (FTS) (261) Platform Any platform C-23 Sherpa (261) Platform Type Any platform type Airborne (261) Product Type Any product type FTS_L2QR (261) 10/22/20 22 Term Quantized Count search 8 261 6 any 6 platform 6 type 6 airborne 4 date 4 Text Profile: “search 261 any platform type airborne date…” Quantize counts, sort by descending order of frequency, drop quantized count below a thresholdhttps://airbornescience.jpl.nasa.gov/data © 2020 California Institute of Technology. Government sponsorship acknowledged.
23.
jpl.nasa.gov Different Digest, Different
Text Digest, Same Text Profile 10/22/20 23© 2020 California Institute of Technology. Government sponsorship acknowledged.
24.
jpl.nasa.gov Digests vs Text
Digests vs Text Profile Digests in non- image documents • Total non-image documents: 9.2 million • Distinct digests: 8.6 million • Distinct text digests: 5.2 million • Distinct text profile (keep all words): 5.1 million • Distinct text profile (drop infrequent words): 2.7 million 10/22/20 24© 2020 California Institute of Technology. Government sponsorship acknowledged.
25.
jpl.nasa.gov Number of Non-Image
Documents with a Distinct Digest 10/22/20 25 Digest Text Digest Text Profile Digest digest1 27,874 2,810,868 2,810,868 digest2 10,089 73,203 489,821 digest3 1,565 27,874 73,225 digest4 1,170 10,089 63,818 digest5 1,166 7,926 58,271 digest6 1,128 2,589 27,874 digest7 1,072 2,557 25,311 digest8 990 1,911 12,222 digest9 933 1,616 11,973 digest10 841 1,573 10,089 © 2020 California Institute of Technology. Government sponsorship acknowledged.
26.
jpl.nasa.gov 2.8 million?! 10/22/20 26 Yes!
On development index. In production, there are ONLY 880k! © 2020 California Institute of Technology. Government sponsorship acknowledged. Error page. The Web Server encountered an unknown runtime error. Cannot display page…
27.
jpl.nasa.gov Initial Takeaway • Some
easy fixes 10/22/20 27© 2020 California Institute of Technology. Government sponsorship acknowledged.
28.
jpl.nasa.gov Exploration: Near Duplicates
with MinHash 10/22/20 28© 2020 California Institute of Technology. Government sponsorship acknowledged.
29.
jpl.nasa.gov Experiments with MinHash •
Earlier proof-of-concept implemented by intern • Filter available in Elasticsearch to allow for fuzzy hashing/near duplicate detection • Default settings – digest 5-grams (see next slide), summarize digests into 512 tokens (buckets) • Run a “MoreLikeThis” query – there is a more efficient algorithm, but not built into ES yet* 10/22/20 29 Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-minhash-tokenfilter.html © 2020 California Institute of Technology. Government sponsorship acknowledged.
30.
jpl.nasa.gov What’s a 5-gram •
“the quick brown fox jumped over the lazy dog” • “the quick brown fox jumped” • “quick brown fox jumped over” • …. 10/22/20 30© 2020 California Institute of Technology. Government sponsorship acknowledged.
31.
jpl.nasa.gov Experiments with MinHash:
Findings • Worked really well on a toy set of synthetic documents • Performance is prohibitive on full web_index (even with stored termvectors) – estimate ~1 year to query every document in the index • Note: speed was greatly improved by programmatically retrieving termvectors and creating own terms query, but still not acceptable 10/22/20 31© 2020 California Institute of Technology. Government sponsorship acknowledged.
32.
jpl.nasa.gov Experiments with MinHash:
Conclusion • There may be ways of improving performance with more shards, multithreading, smarter processing, different algorithm • At this point, however, the problems with exact duplicates and/or text duplicates are sufficient so as not to warrant further investigation of near duplicates via minhash 10/22/20 32© 2020 California Institute of Technology. Government sponsorship acknowledged.
33.
jpl.nasa.gov But why, why
was MinHash SO slow?! Some ideas… • Elasticsearch is optimized for queries of a few words, not 512 “words” • Aside from exact duplicates, how much duplication do we have in 5-grams? 10/22/20 33© 2020 California Institute of Technology. Government sponsorship acknowledged.
34.
jpl.nasa.gov Index 5-grams • Intuition:
in plagiarism detection, a single 5-gram is indicative of duplication…should be extremely rare • Finding: NOT AT ALL RARE on web_index • The 10,000th most common appears in 12k files! • Most common: • “an unknown runtime error cannot” 2.6 million files 10/22/20 34© 2020 California Institute of Technology. Government sponsorship acknowledged.
35.
jpl.nasa.gov Shared 5-grams –
Some Categories of Causes • Actual duplication or near duplication • Boilerplate • Web-page based (navigation, etc) • Legal (copyright, branding) • Machine generated logs 10/22/20 35© 2020 California Institute of Technology. Government sponsorship acknowledged.
36.
jpl.nasa.gov Actual duplication or
near duplication 10/22/20 36© 2020 California Institute of Technology. Government sponsorship acknowledged.
37.
jpl.nasa.gov Boilerplate • Webpage/Navigational • “science
technology launch vehicle” 1.4 million files for Mars Odyssey pages • “content announcements events opportunities people” 500k on techconnect pages • Legal • “research and development center staffed” 640k 10/22/20 37© 2020 California Institute of Technology. Government sponsorship acknowledged.
38.
jpl.nasa.gov Example of Indexed
Boilerplate 10/22/20 38 “science technology launch vehicle spacecraft” 1.4 million files!!! https://mars.nasa.gov/odyssey/mission/time line/communicationsrelay/ © 2020 California Institute of Technology. Government sponsorship acknowledged.
39.
jpl.nasa.gov Pause for relevance
check • If science, technology, “launch vehicle” and spacecraft appear in 1.4 million documents, how important will those words be in a user query?! 10/22/20 39© 2020 California Institute of Technology. Government sponsorship acknowledged.
40.
jpl.nasa.gov Boilerpipe output 10/22/20 40 Demo:
https://boilerpipe-web.appspot.com/ Available as a handler in Tika: BoilerpipeHandler Available as a python library: https://pypi.org/project/boilerpy3/ © 2020 California Institute of Technology. Government sponsorship acknowledged.
41.
jpl.nasa.gov Google is removing
boilerplate 10/22/20 41© 2020 California Institute of Technology. Government sponsorship acknowledged.
42.
jpl.nasa.gov Machine Generated Logs 10/22/20
42 "downlink monitor block has completed” 14k documents © 2020 California Institute of Technology. Government sponsorship acknowledged.
43.
jpl.nasa.gov Takeaways from MinHash
and 5gram • We have enough to work with for now with digests, text digests and text profile digests • We can use 5grams to identify: • Boilerplate content that we should remove if boilerpipe isn’t sufficient • Content that we might want to demote in relevance or remove from the index (machine generated logs?!) 10/22/20 43© 2020 California Institute of Technology. Government sponsorship acknowledged.
44.
jpl.nasa.gov Categories/causes of (near)
duplication • Exact duplicates • Same document, different URL • Documents with little or no text • Near duplicates • Different formats: PDF vs HTML of same content • Versioning • Documents with little text • Asymmetric duplicates (A is contained entirely within B, but B is larger), e.g. email included in reply 10/22/20 44© 2020 California Institute of Technology. Government sponsorship acknowledged.
45.
jpl.nasa.gov Removal of (near
duplicates) problematic if… • “Duplicate” documents differ in other key features (same text, but different images) • Users need to find all versions of a versioned document • Small difference in text is important or main point of page is non-textual (see next slide) 10/22/20 45© 2020 California Institute of Technology. Government sponsorship acknowledged.
46.
jpl.nasa.gov Slightly different photo
metadata 10/22/20 46© 2020 California Institute of Technology. Government sponsorship acknowledged.
47.
jpl.nasa.gov Recommendations, step 1 •
Experiment with boilerpipe handler vs. top n 5- grams. Confirm that this doesn’t remove desired text; or identify triggers for boilerpipe handler • Index token count, lang id, digest and text digest along with documents • Add major sources of malignant duplicates to “skip list” at crawling stage 10/22/20 47© 2020 California Institute of Technology. Government sponsorship acknowledged.
48.
jpl.nasa.gov Recommendations, step 2…some
options • Remove duplicates or prevent from insertion • Add a duplicate identification process and • Group by duplicate digest in search results • Demote duplicates in search results • Allow users to select “include duplicates” 10/22/20 48© 2020 California Institute of Technology. Government sponsorship acknowledged.
49.
jpl.nasa.gov Tools • Quaerite (https://github.com/tballison/quaerite) •
Copy indices Solr->ES and vice versa • List top n tokens (Solr only):TopNTokens • tika-eval (https://cwiki.apache.org/confluence/display/TIKA/TikaEval ) • Token counts • Language identification • Out of vocabulary % • Digest, Text digest, Text profile 10/22/20 49© 2020 California Institute of Technology. Government sponsorship acknowledged.
50.
jpl.nasa.gov Conclusion • It depends™ •
There is no easy button, but this analysis and discovery reveal critical areas for improvement and get us closer to solutions 10/22/20 50© 2020 California Institute of Technology. Government sponsorship acknowledged.
51.
jpl.nasa.gov Some References • Manku,
G., Jain, A. and Dash, A. “Detecting near-duplicates for web crawling.” WWW’07 https://static.googleusercontent.com/media/research.google.com/en// pubs/archive/33026.pdf • Early patented work at Google: https://www.cs.umd.edu/~pugh/google/Duplicates.pdf • LSH at Uber for fraudulent trip detection: https://eng.uber.com/lsh/ • Minhash vs. SimHash: http://proceedings.mlr.press/v33/shrivastava14.pdf 10/22/20 51© 2020 California Institute of Technology. Government sponsorship acknowledged.
52.
jpl.nasa.gov Some Other References •
KNN and LSH in Elasticsearch: https://blog.insightdatascience.com/elastik-nearest-neighbors- 4b1f6821bd62 • Minhash in Lucene: https://medium.com/@xingzeng/understanding-minhash-in- lucene-elasticsearch-e6799b78c0d7 • ssdeep and elastic: https://www.intezer.com/blog/intezer-analyze- community/intezer-community-tip-ssdeep-comparisons-with- elasticsearch/ 10/22/20 52© 2020 California Institute of Technology. Government sponsorship acknowledged.
53.
jpl.nasa.gov 10/22/20 53© 2020
California Institute of Technology. Government sponsorship acknowledged.
Jetzt herunterladen