SlideShare ist ein Scribd-Unternehmen logo
1 von 98
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
September 12, 2019
(R)Evolving Relevance Tuning
with Genetic Algorithms
Tim Allison
Activate Conference 2019
Washington, DC
Approved for Public Release;
Distribution Unlimited.
Case Number 18-3138-12
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
About me
 Chair/V.P. Apache Tika
 Committer/PMC Apache PDFBox
 Committer/PMC Apache POI
 Committer Apache Lucene/Solr
 Member ASF
 Ph.D. Classical Studies
| 2 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Hat Tip – Simon Hughes
| 3 |
Simon Hughes “Evolving The Optimal Relevancy Scoring Model at Dice.com”
https://www.youtube.com/watch?v=z4c1xU7arhc
https://github.com/DiceTechJobs/RelevancyTuning
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Some Open Source Relevance Tools
 Quaerite (focus of this talk): https://github.com/mitre/quaerite
 Quepid (Open Source Connections): https://github.com/o19s/quepid
 Rated Ranking Evaluator RRE (Sease Ltd): https://github.com/SeaseLtd/rated-ranking-
evaluator
 Others?!
| 4 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Outline
 Introduction/Motivation
 Evolution of methods
– Generation 0
– Generation 1
– Generation 2
 Findings
 Next Steps
| 5 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Search is easy!
| 6 |
© 2019 The MITRE Corporation. All rights reserved.
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Search Engines – A Quick Overview
| 7 |
© 2019 The MITRE Corporation. All rights reserved.
Martin White “The
Technology of Search.
Search Insights 2018, The
Search Network. p. 9.
http://www.flax.co.uk/blog/20
18/03/26/search-insights-
2018-free-independent-
report-search
Figure originally published in
“Searching the Enterprise”,
Foundations and Trends® in
Information Retrieval
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Available Parameters
 14 tokenizers https://lucene.apache.org/solr/guide/7_1/tokenizers.html
 ~45 token filters (not including language-specific token filters – see next slide)
https://lucene.apache.org/solr/guide/7_1/filter-descriptions.html
 Query parsers
 Boosting: fields, queries, functions
 Phrasal boosting/shingling
 Query operators, minimum should match, should, must, not
 Token/field based scoring – best_fields, most_fields, cross_fields
 Synonym lists, taxonomies
 Similarity scoring parameters (with BM25)
 Elevate
 External signal enrichment
– manual or automatic (NLP – entity extraction, categorization, etc.)
 Reranking via machine learning (Learning to Rank)
| 8 |
© 2019 The MITRE Corporation. All rights reserved.
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Each Token Filter Can Have Many Parameters
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"/>
| 9 |
© 2019 The MITRE Corporation. All rights reserved.
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
What to do, what to do…
| 10 |
“Relevant Search
With applications for Solr and Elasticsearch”
Doug Turnbull and John Berryman
https://www.manning.com/books/relevant-
search
Thank you, Doug Turnbull and John Berryman
for permission to use the search engineer in this talk!
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Ground-truth based relevance tuning
 Requires ground truth
 Good ground truth
 Overfitting…be careful!
– Please use responsible train/test splits
– LOL: https://www.gwern.net/Tanks
| 11 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Example Ground Truth
| 12 |
Thank you, Doug Turnbull,
John Berryman and “Open
Source Connections” for
the inspiration for using
tmdb and for generating
and sharing a ground truth
set!
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Generation 0: Run Some Experiments
 Assuming a static corpus, results should be reproducible
 Keep track of previous experiments
 Allow standard output and flexibility of scoring metrics
| 13 |
Risk of overfitting
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Key components for running experiments
| 14 |
{
"scorers": [
],
"experiments": {
}
}
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Basic Experiment Configuration: Scorers and Experiments
| 15 |
{
"scorers": [
{
"class": "NDCG",
"atN": 10,
"params": {
"useForTrain": true,
"useForTest": true,
"exportPMatrix": true
}, …
],
"experiments": {
"title": {
"searchServerUrl": "http://.../solr/tmdb",
"query": {
"edismax": {
"qf": [
"title"
]
}
}
}, …
}
}
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Scorers – More Scorers
| 16 |
"scorers": [
{
"class": "AtLeastOneAtN",
"atN": 1
},
{
"class": "AtLeastOneAtN",
"atN": 5
},
{
"class": "AtLeastOneAtN",
"atN": 10
},
"class": "NDCG",
"atN": 10,
"params": {
"useForTrain": true,
"useForTest": true,
"exportPMatrix": true
}
},
{
"class": "TotalDocsReturned"
},
{
"class": "ZeroResults"
}
]
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Experiments – A Slightly More Interesting Experiment
| 17 |
"title_cast_pf_tie_0_8_mm2": {
"searchServerUrl": "http://localhost:8983/solr/tmdb",
"query": {
"edismax": {
"qf": [
"title^10",
"cast^2"
],
"tie" : 0.8,
"pf": [
"title^10",
"cast^2"
],
"q.op": {
"mm": "2"
}}}}
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Output: Per Query/Per Experiment Scores
| 18 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Output: Per Experiment Summary Analytics
| 19 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Output: Pairwise P-Value for Diffs in NDCG@10
| 20 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Generation 1: Automatically Generate Experiments
 If I know the parameters I want to experiment with, why should I have to specify the
combinations?!
 Different analyzer chains
| 21 |
Risk of overfitting
Stubb’s
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Generation 1: Automatically Generate Experiments
 If I know the parameters I want to experiment with, why should I have to specify the
combinations?!
 Different analyzer chains
 field boosts and ranges
| 22 |
Risk of overfitting
Stubb’s Stubb’sStubb’s
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Generation 1a: Automatically Generate All Experiments
(Brute Force/Grid Search)
 If I know the parameters I want to experiment with, why should I have to specify the
combinations?!
 Different analyzer chains
 field boosts and ranges
 boolean/min should match
 tie
 pf, pf2, pf3, ps, ps2, ps3
 bq, boost
| 23 |
Risk of overfitting
Stubb’s Stubb’sStubb’s
Stubb’s Stubb’sStubb’s
Stubb’s Stubb’sStubb’s
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Experiment Features – Scorers and FeatureFactories
| 24 |
{"scorers": [
{
"class": "NDCG",
"atN": 10
}
],
"featureFactories": {
"urls": [ "http://localhost:8983/solr/tmdb" ],
"query": {
"edismax": {
…
} } }
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Query Features – A Bit More Interesting
| 25 |
"query": {
"edismax": {
"qf": {
"fields": [ "title", "overview", "cast" ],
"defaultWeights": [ 0.0, 2.0, 10.0 ],
"minSetSize": 1,
"maxSetSize": 3
},
"tie": [0.0, 0.2, 0.8],
"q.op": {
"operators" : ["or", "and"],
"mmInts" : [1,2,3]
} } }
Generates
390
experiments!
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Parameterizable Strings
| 26 |
"boost": [
"max(recip(ms(NOW/DAY, ds_field_last_modified), 3.16e-11,[1,2,3],[$1]),
[0.1, 0.9])"
],
"max(recip(ms(NOW/DAY, ds_field_last_modified), 3.16e-11,1,1), 0.1)"
"max(recip(ms(NOW/DAY, ds_field_last_modified), 3.16e-11,2,2), 0.1)"
…
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Permutation explosion – Beware!
Number of Fields
Number of
Experiments
No Weights*
Number of
Experiments
Two Weights*
2 3 8
3 7 26
4 15 80
5 31 242
6 63 728
7 127 2,186
| 27 |
*No Weights: a given field may or may not exist
*Two Weights: a given field may not be used or have one of two weights, e.g. text^2, text^10
And, that’s just field weights!!!
The following are held constant: tie, operator, pf, pf2, ps, ps2, bq, boost
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Generation 1b: Automatically Generate Random Experiments
(Random Search)
 If I know the parameters I want to experiment with, how about running only SOME
combinations?!
 Different analyzer chains
 field boosts and ranges
 boolean/min should match
 tie
 pf, pf2, pf3, ps, ps2, ps3
 bq, boost
| 28 |
Stubb’s Stubb’s
Stubb’s Stubb’s
Stubb’s
Risk of overfitting
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Generation 2: Genetic Algorithm
 Perhaps I could improve on random search? At each generation, only let the top
experiments into the next generation – mutate, crossover, random…repeat…
 Different analyzer chains
 field boosts and ranges
 boolean/min should match
 tie
 pf, pf2, pf3, ps, ps2, ps3
 bq, boost
| 29 |
Risk of overfitting
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Genetic Algorithm Terms
 Population
 Generations
 Operations
– Random
– Crossover
– Mutate
| 30 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
| 31 |
Generation 0
Genetic Algorithm Basics
0.30.4 0.5 0.4
Generation 1 0.25 0.4 0.5
0.1
0.4
Crossover Mutate Random
0.3
Generation X
…
NDCG@10
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Interlude: How Does this Differ from Learning to Rank (LTR)
 Still need:
– All the sound search engineering decisions (sane analysis chain, etc.)
– Ground truth; good ground truth
 Difference:
– Learns settings for overall initial search, not a reranking function on (typically) a subset
| 32 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Generation 2: Genetic Algorithm – Cross-fold Validation Built-in
 Perhaps I could improve on random search? At each generation, only let the top
experiments into the next generation – mutate, crossover, random…repeat…
 Different analyzer chains
 field boosts and ranges
 boolean/min should match
 tie
 pf, pf2, pf3, ps, ps2, ps3
 bq, boost
| 33 |
Risk of overfitting
Cross-fold
Validation
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
N-Fold Cross-Validation
| 34 |
Training
Testing
Testing
NDCG@10
0.45
Testing
NDCG@10
0.47
Testing
NDCG@10
0.50
Testing
NDCG@10
0.42
NDCG@10 Average
Testing : 0.46
Fold 0 Fold 1 Fold 2 Fold 3
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Results Per Fold
| 35 |
FOLD 0 TRAINING
experiment 'train_fold_0_gen_4_exp_2': .678
experiment 'train_fold_0_gen_4_exp_7': .678
experiment 'train_fold_0_gen_0_exp_13': .662
experiment 'train_fold_0_gen_2_exp_13': .640
experiment 'train_fold_0_gen_2_exp_12': .640
experiment 'train_fold_0_gen_1_exp_3': .640
experiment 'train_fold_0_gen_3_exp_4': .640
experiment 'train_fold_0_gen_3_exp_7': .640
experiment 'train_fold_0_gen_3_exp_3': .640
experiment 'train_fold_0_gen_3_exp_2': .640
FOLD 0 TESTING
experiment 'test_fold_0_gen_4_exp_2': .552
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Results: Overall, across folds
| 36 |
FINAL RESULTS ON TESTING:
experiment 'test_fold_2_gen_4_exp_1': .790
experiment 'test_fold_0_gen_4_exp_2': .552
experiment 'test_fold_1_gen_3_exp_6': .540
mean: .627
median: .552
stdev:.141
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Initial Findings
| 37 |
The Good:
Boosted NDCG@10 from
0.25->0.3
The Bad:
Worse than baseline on
huge parameter set with
insufficient(?)
generations
The Great:
I can spend more time
on feature
engineering/signal
enrichment.
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Initial Findings – L-Value
| 38 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Next Steps
 Documentation
 Documentation
 Documentation
 Documentation
 Documentation
 Documentation
 Documentation
 Documentation
 Documentation
 Documentation
 Documentation
 Documentation
| 39 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Next Steps
 Documentation
 Finalize (ish) API for 1.0.0 release
 Add ground truth-free measures (overlap, rank correlation)
 Add descriptors for features so that the results are somewhat interpretable
 Bayesian optimization?!
| 40 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Questions?
Quaerite
– https://github.com/mitre/quaerite
– https://github.com/mitre/quaerite/blob/master/quaerite-examples/README.md
Contact:
–tallison@apache.org
–@_tallison
| 41 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
| 42 |
Stubb’s
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
On the one hand…on the other
 On the one hand
– This amount of precise control is great
 On the other
– Permutations are mind-boggling
– Defaults used to be abysmal, but they are much better now…generally
 On the third hand
– The lists and commercial support are amazingly responsive and helpful
| 43 |
© 2019 The MITRE Corporation. All rights reserved.
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Debts of Gratitude
 David Smiley
 Nick Burch
 Chris Mattmann
 Tilman Hausherr
 Dominik Stadler
 Fellow devs:
– Apache Lucene/Solr, Apache Commons, Apache POI, Apache PDFBox, Apache Tika
 ASF Community and users!
 Common Crawl and govdocs1
 Rackspace
| 44 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Overview
 Intro to Tika: content and metadata extraction in the ETL stack
 Motivation for tika-eval: what can go wrong?
 tika-eval overview and workflow – single vm
 tika-eval at scale
| 45 |
ApacheCon 2019ApacheCon 2017
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Content Extraction and HLT
| 46 |
1001010010010010001001
0101001010011010111111
0101010101101101110110
1110110101110110110111
0110111101101101101101
1111100000011010100000
0110010000011010010010
‫شوارزنيجر‬ ‫ألويس‬ ‫أرنولد‬(‫في‬ ‫ولد‬8‫أغسطس‬1947‫في‬ ،‫ستيريا‬،‫النمسا‬)
Bytes
Text
Machine Translation:
Arnold Alois
Schwarzenegger (born
August 8, 1947, in
Styria, Austria)
Entity Extraction:
‫شوارزنيجر‬ ‫ألويس‬ ‫أرنولد‬
(‫في‬ ‫ولد‬8‫أغسطس‬1947،
‫في‬،‫ستيريا‬‫النمسا‬)
=
Search:
2~“‫شوارزنيجر‬ ‫”أرنولد‬
Search:
“Arnold
Schwarzenegger”~2
Content
Extraction
Traditional
Human
Language
Technologies
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Forensics:
Carving and Advanced Methods
Files
High Level Components of a Media Processing Stack
Search/Entity Extraction/MT, etc.
User Interface
Text Extraction and
Metadata
Extraction
Structured
Data-store
| 47 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Let’s not forget Metadata!
 Various formats store useful information
 Who: author (first, last, commenters, editors), digital signature, company, from/to/cc/bcc
(emails)
 What: hardware version/name, software version/name, globally unique file/heritage id (XMP),
title, keywords, description
 Where: geo (latitude, longitude), file location (file paths embedded inside documents)
 When: created, last modified, last printed
 Beyond the standard types…custom metadata
| 48 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Example Application: Search
When Things Go Wrong with Text
Extraction
| 49 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
What the User Sees in a Search System
Content/
Metadata
Extraction
Indexer/
Search
System
User Interface
Structured
Data-store
| 50 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
When Things Go Wrong with a Foundation
W.LloydMacKenzie,viaFlickr
@http://www.flickr.com/photos/saffron_blaze/
| 51 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
What can go wrong? Basic problems: thrown exceptions
 Parser has a problem with non-corrupt file (and admits it…thank you!!!)
 Password/access protected files
 Format version not handled (add new parser?)
 Corrupt files – can’t be opened by primary application or parsed by other
parsers
 Corrupt files – slight variant from spec/other parsers can handle it
 Truncated files
| 52 |
Note: some text/metadata may or may not be
extracted before the exception is thrown
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
What can go wrong? Catastrophic problems
 OutOfMemoryError – potentially corrupting the JVM
– Inefficient parsers DOM vs SAX on rare docx (TIKA-2170) and pptx (TIKA-2201). NOTE:
with multithreaded garbage collection, a single thread running Tika can cause a quad-
core system to grind to a snail’s pace before hitting OOM.
– Four bytes of a compressed file (TIKA-2330)
 Slowly building memory leak
– See above on quad-core, gc and snails (TIKA-2180?)
 Permanent Hang
– TIKA-1132
 Security Vulnerabilities
– XXE (CVE-2016-4334), arbitrary code execution (CVE-2016-6809)
| 53 |
These are extremely rare, and we try to
fix them when we’re aware of them!
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
What can go wrong? Hidden problems (no exceptions!)…
 Garbled text
– From slightly to…fully
 Missing text/metadata
– From missing some text to … no text at all
 Missing attachments
 Silently swallowed exceptions of embedded documents
– Classic Tika xhtml silently swallows embedded exceptions!!!
| 54 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Corrupt Text (Upgrade from PDFBox 1.8.6->1.8.7)
| 55 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Missing Text (TIKA-1130)
| 56 |
Document available: https://issues.apache.org/jira/browse/TIKA-1130
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
When Things Go Not as Well as They
Might with Content Extraction – OCR
I9 There was documcntation of calibration but not ofobscrvation of
tlic actual iiionitoring of tlic critical limits during production.
Image:
Text Extracted:
Search Results:
| 57 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Take-away #1
 If you don’t evaluate content extraction…
| 58 |
You don’t know
what you can’t find
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Take-away #2
A small problem for me can be a big
problem for you!
| 59 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
TIKA-1302: The Dream
 Motivation
– All of the above
– We have only roughly 1,000 test files in unit tests in Apache POI, Apache PDFBox and
Apache Tika
– Apache POI/PDFBox/Tika mistakenly made me a committer
 Run Tika on much larger corpus nightly/weekly
 Automatically recognize regressions
| 60 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Available since Apache Tika 1.15
tika-eval
| 61 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
High-level overview
 tika-eval’s scope
– Single vm, file share to file share (with embedded H2 db), ~few million files is a reasonable
size
– Not currently cloud-scale
 Random sampling – should be good enough
 Our Jira is open and committers are standing by!
 tika-eval’s two modes
– Profile single extraction run
– Compare two extraction runs
 Ground truth vs. particular tool
 Tool A vs. tool B
 Tool A with settings X vs. Tool A with settings Y
| 62 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Definitions
 “original documents” or “container documents” – the original binary documents from
which you’d like to extract text, whether or not they actually have attachments.
 “embedded documents” – any document contained within another document,
including those that only ever exist as embedded docs: emf/wmf/xmp/xfa.
 “extract” – .txt or .json representation of the extracted text/metadata.
– tika-eval was designed for .json
 RecursiveParserWrapper via API
 (-J) for tika-app
 /rmeta for tika-server
– tika-eval can handle .txt files – details on our wiki
| 63 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Why the RecursiveParserWrapper?
| 64 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Classic XHTML
| 65 |
<?xml version="1.0" encoding="UTF-8"?>
<meta name="Content-Type" .../>
…
<p>embed_0 </p>
<p><div class="embedded" id="rId7"/>
<p>embed1.zip</p>
<div class="embedded" id="embed1/embed1a.txt"/>
<div class="package-entry">
<p>embed_1a</p>
</div>
...
• Metadata from embedded docs is lost
• Exceptions from embedded docs are swallowed
• Metadata from the container document may be
incomplete
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
RecursiveParserWrapper
| 66 |
[
{
"Application-Name": "Microsoft Office Word",
"Content-Length": "27082",
"Content-Type": "application/....wordprocessingml.document",
"X-TIKA:content": "embed_0 ",
...
},
{
"Content-Type": "text/plain; charset=ISO-8859-1",
"Last-Modified": "2014-06-04T04:08:28Z",
"X-TIKA:content": "embed_1a",
"X-TIKA:embedded_resource_path": "/embed1.zip/embed1a.txt",
...
},
{
"Content-Type": "application/zip",
"Last-Modified": "2014-06-04T04:09:40Z",
"X-TIKA:content": "embed4.txt",
"X-TIKA:embedded_resource_path": "/embed1.zip/embed2.zip/embed3.zip/embed4.zip"
...
}, ...]
• Embedded metadata (e.g.
mime/author/lat-long, etc.)
are retained
• Embedded exceptions are
stored in a metadata key
• All metadata is extracted
stored
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Workflow – Profile
1. Generate extracts with parallel directory structure to original documents, append
“.txt” or “.json” into, say my_extracts directory
2. Run profiler to populate in-process H2 DB
java –jar tika-eval.jar Profile
–extracts my_extracts
–db my_db
3. Dump reports
java –jar tika-eval.jar Report –db my_db
Excel reports will be dumped to the reports directory
| 67 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Workflow – Compare
1. Generate extracts with parallel directory structure to original documents, append
“.txt” or “.json” into, say my_extractsA and my_extractsB directories
2. Run profiler to populate in-process H2 DB
java –jar tika-eval.jar Compare
–extractsA my_extractsA
–extractsB my_extractsB
–db my_db
3. Dump reports
java –jar tika-eval.jar Report –db my_db
Excel reports will be dumped to the reports directory
| 68 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Workflow – StartDB
 Start db:
java –jar tika-eval.jar StartDB
 Open browser to localhost:8082
 Select db (full path!):
– jdbc:h2:/C:/data/my_db
 Notes on db structure: https://wiki.apache.org/tika/TikaEvalDbDesign
| 69 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Reports (Profile)
 Metadata – count of metadata values
 Attachments – counts
 Mimes – mime counts for containers and embedded docs
 Exceptions
– Counts by type (e.g. password vs. actual exception)
– Counts by mime
– Counts by normalized stacktrace
– All stack traces
 Content
– Language id
– Word count
– Common words count
– Word length stats
– Page count
| 70 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Reports (Compare)*
 Metadata – comparison counts A to B
 Attachments – comparison counts A to B
 Mimes
– Comparison mime counts for containers and embedded docs
– Counts of mime changes mimeA->mimeB
 Exceptions
– Comparisons of counts by mime
– Counts by mime
– Counts by normalized stacktrace
– All stack traces
 Content
– Language id
– Word count
– Word length stats
– Page count
| 71 |
Includes Profile data for both A and B and then also some comparison reports
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Content – “Common words” and their Utility in Profile
 Top 30k most common words per language* in Leipzig Corpus**
 To find PDFs that are mostly image only:
– number of words/number of pages
 To find very corrupt text***:
– “In vocabulary %”:(number of common words/number of
alphabetic words)
– “Out of vocabulary (OOV)%”: 1-“in vocabulary %”
| 72 |
* Many thanks, Apache Lucene!
** http://wortschatz.uni-leipzig.de/en/download/ and Apache
OpenNLP’s: https://svn.apache.org/repos/bigdata/opennlp/
*** Metric was recommended by Tilman Hausherr
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Content comparisons
 Similarity metrics between A and B
– how many words in common/total number of words (with counts normalized to
0/1 per doc)
– how many words in common/total number of words (with actual counts)
 Improvement in “common words”
– number of Common Words in B – number of Common Words in A
– Per mime
| 73 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Content Comparison Example – Junk -> Better
Tika 1.14 Tika 1.15-SNAPSHOT
Unique Tokens 786 156
Total Tokens 1603 272
LangId zh-ch de
Common Words 0 116
Alphabetic Tokens 1603 250
Top N Tokens 捳敨: 18 | 獴档: 14 | 略獴: 14 | m:
11 | 杮湥: 11 | 瑵捳: 11 | 畬杮: 11
| 档湥: 10 | 搠敩: 9 | 敮浨: 9
die: 11 | und: 8 | von: 8 |
deutschen: 7 | deutsche: 6 | 1: 5 |
das: 5 | der: 5 | finanzministerium:
5 | oder: 5
OOV% 1-(0/1603) = 100% 1-(116/250) = 54%
| 74 |
Overlap: 0%
Increase in Common Words: 116
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Content Comparison Example – Small Regression
Tika 1.14 Tika 1.15-SNAPSHOT
Unique Tokens 1916 1995
Total Tokens 14187 14302
LangId en en
Common Words 7498 7409
Alphabetic Tokens 13472 13587
Top 10 Unique Tokens applicant's: 8 | 1.69: 1 | arbitrary: 1 |
collecting: 1 | constitution: 1 | e112:
1 | ei.b: 1 | equating: 1 |
magnetically: 1 | o: 1
ss: 106 | applicantis: 8 | ssss: 7 |
iactsi: 4 | ithe: 4 | imeansi: 3 |
iprocessi: 3 | calculations.i: 2 |
iabstract: 2 | idata: 2
OOV% 1-(7498/13472) = 44% 1-(7409/13587) = 45%
| 75 |
Overlap: 95.5%
Increase in Common Words: -89
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Taking tika-eval public
 Rackspace kindly hosts a vm for ongoing evals (TIKA-1302)
 1 TB (~3 million files) from Common Crawl and govdocs1
 Collaborating with Apache PDFBox and Apache POI to run evals as part of the release
process
 Critical to identifying regressions and building new parsers
 Stacktraces created by public documents are critical for the hey-I’m-getting-this
parse-exception-but-can’t-share-the-document-with-you problem
 See Dominik Stadler’s Common Crawl download tool:
https://github.com/centic9/CommonCrawlDocumentDownload
| 76 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Community collaboration
| 77 |
Thank you, Tilman
Hausherr!
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Limits of Automated Metrics without Ground Truth
 More exceptions – We have a problem! Wait…
– New parser, we were entirely skipping those file types before
– Parser was yielding junk before on this file, now it is letting us know there’s a problem
 Fewer exceptions – Great! Wait…
– Mime detection not working – skipping files that we used to parse (theoretical)
– Now we’re getting junk
 More Common Words – Great! Wait…
– Serious bug that duplicates worksheets in some xlsx files (TIKA-2356…my fault…ugh!)
– More non-html markup/xml tags incorrectly getting through
 Fewer Common Words – Problem! Wait…
 More attachments, fewer attachments (Your turn!)
| 78 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
TIKA-1302: The Ticket is Grown; the Dream is Gone
 Without ground truth, humans need to interpret differences
 This only makes building a gui more important!!! (TIKA-1334)
 Collaborative tagging? As a human reviews diffs, flag document as “hopeless” or a
given extract as “great” or “awful” )Again, thanks to Tilman Hausherr(
 Dream of TIKA-1302 ran into reality, but we’re far better than where we were
| 79 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
| 80 |
Scaling tika-eval
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
tika-eval at Scale (with tika-eval >= 1.22)
 Motivation
– Running into unacceptably long processing times for analysis on ~3 million documents in
H2. Current work around: Postgres…still not speedy (~40 minutes)
– Big data frameworks are built for analysis and scale! Use them!
 General Process – two steps
1. Calculating content statistics (as of Tika 1.22: decoupled tika-eval text stats calculator)
2. Rollups/aggregations
 Limitations
– Moving beyond the file share leads to an explosion of big data frameworks…no single
solution
– There is no one answer…must be customized per framework…
| 81 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Step 1: Calculating Content Statistics – An Example with SolrJ
| 82 |
https://github.com/tballison/tika-addons/tree/master/tika-eval-solrj
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Average OOV% in English PDFs
| 83 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
That document all the way on the left in the previous slide
| 84 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Which PDFs had little to no content?
| 85 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Most Common Stacktraces for Epubs?
| 86 |
http://localhost:8983/solr/tika-eval/select?facet.pivot=mime_facet,stacktrace_facet&facet=on
&fq=mime:epub&q=*:*
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
From Analytics to Action
 PDFs – when to run OCR
 Charset detection – which detector to trust…building a better charset detector
| 87 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
| 88 |
Prioritizing OCR via “sort oov desc”
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Stored text vs. OCR’d text
| 89 |
Text as stored in PDF (OOV 84%)
GO Obpermtmn of the Establishment
I : S I . 11.Itmpacudo~+r~ A pn~ccssing1C:ontinent:tl. San l'edro S~tla.I-Iondurus; April
7, 2UO8 ( h ~ c l ' s l a t ~ g h ~ ~ r Ilurillg optraticlnol sanitation insprutiol~in
tlre slaughter rrjom, carcuss i ; ~ 5icr.c c~l~scl-isl1~1t1k3 r'd
contacting visccrri cart wllecls. 'I'his .as cc&gt;rrcctcc[immcdiiiiuly by tllc
csli~blishnicn~I T C . I S I ~ H I I ~ ~ . 1Kcgu1atc11.yrulirencc: 9CIf:li 41 (1.1 3(c)(
51 NAME OF AUDlTOR
60 Observation of the Establishment
Est. 12. Empacadora Continental. San Pedro Sula, Honduras; April 7, 2008 (beet aughter &
processing) 10 During operational sanitation inspection in the slaughter room, carcass fore
shanks were observed contacting viscera cart wheels. This was corrected immediately by the
establishment personne! [Regulatory reference: 9CFR 416.13(¢)|
AND DATE 51 NAME OF AUDITOR
Text extracted from Tesseract (OOV 30%)
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
WWGD*?
| 90 |
*WWGD: “What would Google Do?” (h/t Grant Ingersoll)
Text in Google’s Cache
Popat, Ashok. (2009). A panlingual anomalous text detector.
201-204. 10.1145/1600193.1600237.
See also:
60 Observation of the Establishment Est. 12. Empacadora Continental, San
Pedro Sula, Honduras; April 7, 2008 (heel slaughter & processing) 10 During
operational sanitation inspection in the slaughter room, carcass fore shanks
Tiere observedcontacung viscera cart wheels. This was corrected immediately
by the establishment personnel.Regulatory reference: 9CFR 416.13(0
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Charset detection – HTML meta charset (HTMLDefault) vs.
Mozilla’s chardet (Universal)
HTMLDefault Universal HTMLDefault Sum
Common Tokens
Universal Sum
Common
Tokens
Difference
in Sums
UTF-8 EUC-JP 4,437 481,919 477,482
EUC-JP Shift_JIS 1,512 391,126 389,614
UTF-16 windows-1252 1,240 368,496 367,256
UTF-16 UTF-8 2,563 321,717 319,154
EUC-JP UTF-8 764,957 1,047,029 282,072
windows-1255 UTF-8 17,450 246,271 228,821
windows-1256 UTF-8 52,185 249,105 196,920
EUC-KR UTF-8 1,081,986 1,274,249 192,263
UTF-8 Shift_JIS 2,040 191,757 189,717
windows-1252 UTF-8 427,997 554,311 126,314
| 91 |
See initial charset study draft:
https://github.com/tballison/share/blob/master/slides/Tika_charset_detector_study_201909.docx
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Next Steps
Streaming expressions: histograms and ?
Visualizations…please help!
–Zeppelin?
“Compare” mode at scale?
Community feedback – please help!
| 92 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
To conclude
 Text extraction is critical
 A small problem for us could be a big problem for you…please evaluate!
 Seriously, please evaluate – you don’t know what you can’t find!
 Join the Apache Tika community and its evaluation efforts!
 Email: tallison@apache.org
 Twitter: @_tallison
| 93 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
OOV% by Language, Mean and 1 StdDev on
1.5 million text-based files
| 94 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Some Resources
 Nick Burch’s talk on Tika:
http://events.linuxfoundation.org/sites/events/files/slides/WhatsNew
WithApacheTika_2.pdf
 tika-eval wiki:https://cwiki.apache.org/confluence/display/tika/TikaEval
 Fellow traveler Ryan Bauman’s “Automatic evaluation of OCR”:
https://ryanfb.github.io/etc/2015/03/16/automatic_evaluation_of_ocr_qual
ity.html
 Ted Underwood’s earlier post: https://tedunderwood.com/2012/04/26/the-
obvious-thing-were-lacking/
| 95 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Extras
| 96 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
Apache Tika
File Type Identification
PDF MSOffice HTML JPEG MP3 WAV
Metadata and Text Content
…
Uniform Metadata and Text Content for text
processing and/or ingestion into search engine
| 97 |
© 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
| 98 |

Weitere ähnliche Inhalte

Ähnlich wie (R)evolving Relevance Tuning with Genetic Algorithms

RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...
RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...
RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...Adam Pennington
 
How to build containerized architectures for deep learning - Data Festival 20...
How to build containerized architectures for deep learning - Data Festival 20...How to build containerized architectures for deep learning - Data Festival 20...
How to build containerized architectures for deep learning - Data Festival 20...Antje Barth
 
RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...
RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...
RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...Adam Pennington
 
DataRobot - 머신러닝 자동화 플랫폼
DataRobot - 머신러닝 자동화 플랫폼DataRobot - 머신러닝 자동화 플랫폼
DataRobot - 머신러닝 자동화 플랫폼Sutaek Kim
 
Implementing AI: Running AI at the Edge: Adapting AI to available resource in...
Implementing AI: Running AI at the Edge: Adapting AI to available resource in...Implementing AI: Running AI at the Edge: Adapting AI to available resource in...
Implementing AI: Running AI at the Edge: Adapting AI to available resource in...KTN
 
Introduction to Time Series Analytics with Microsoft Azure
Introduction to Time Series Analytics with Microsoft AzureIntroduction to Time Series Analytics with Microsoft Azure
Introduction to Time Series Analytics with Microsoft AzureCodit
 
Aviation Digital Disruption
Aviation Digital Disruption Aviation Digital Disruption
Aviation Digital Disruption Michael Denis
 
Datarobot, 자동화된 분석 적용 시 분석 절차의 변화 및 효용 - 홍운표 데이터 사이언티스트, DataRobot :: AWS Sum...
Datarobot, 자동화된 분석 적용 시 분석 절차의 변화 및 효용 - 홍운표 데이터 사이언티스트, DataRobot :: AWS Sum...Datarobot, 자동화된 분석 적용 시 분석 절차의 변화 및 효용 - 홍운표 데이터 사이언티스트, DataRobot :: AWS Sum...
Datarobot, 자동화된 분석 적용 시 분석 절차의 변화 및 효용 - 홍운표 데이터 사이언티스트, DataRobot :: AWS Sum...Amazon Web Services Korea
 
Reinforcement Learning in the Wild and Lessons Learned
Reinforcement Learning in the Wild and Lessons LearnedReinforcement Learning in the Wild and Lessons Learned
Reinforcement Learning in the Wild and Lessons Learnedmcharafeddine
 
Slideshare.net rh-isac summit 2019 - adam pennington - leveraging mitre at ta...
Slideshare.net rh-isac summit 2019 - adam pennington - leveraging mitre at ta...Slideshare.net rh-isac summit 2019 - adam pennington - leveraging mitre at ta...
Slideshare.net rh-isac summit 2019 - adam pennington - leveraging mitre at ta...Robert Brandel
 
FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™
FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™
FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™Katie Nickels
 
How to analyze and tune sql queries for better performance vts2016
How to analyze and tune sql queries for better performance vts2016How to analyze and tune sql queries for better performance vts2016
How to analyze and tune sql queries for better performance vts2016oysteing
 
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...MITRE - ATT&CKcon
 
Improving Software quality for the Modern Web
Improving Software quality for the Modern WebImproving Software quality for the Modern Web
Improving Software quality for the Modern WebEuan Garden
 
Detecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking DataDetecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking DataJames Sirota
 
AWS Toronto Summit 2019 - AIM302 - Build, train, and deploy ML models with Am...
AWS Toronto Summit 2019 - AIM302 - Build, train, and deploy ML models with Am...AWS Toronto Summit 2019 - AIM302 - Build, train, and deploy ML models with Am...
AWS Toronto Summit 2019 - AIM302 - Build, train, and deploy ML models with Am...Jonathan Dion
 
Ben ford intro
Ben ford introBen ford intro
Ben ford introPuppet
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordPuppet
 

Ähnlich wie (R)evolving Relevance Tuning with Genetic Algorithms (20)

RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...
RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...
RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...
 
How to build containerized architectures for deep learning - Data Festival 20...
How to build containerized architectures for deep learning - Data Festival 20...How to build containerized architectures for deep learning - Data Festival 20...
How to build containerized architectures for deep learning - Data Festival 20...
 
RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...
RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...
RH-ISAC Summit 2019 - Adam Pennington - Leveraging MITRE ATT&CK™ for Detectio...
 
DataRobot - 머신러닝 자동화 플랫폼
DataRobot - 머신러닝 자동화 플랫폼DataRobot - 머신러닝 자동화 플랫폼
DataRobot - 머신러닝 자동화 플랫폼
 
Implementing AI: Running AI at the Edge: Adapting AI to available resource in...
Implementing AI: Running AI at the Edge: Adapting AI to available resource in...Implementing AI: Running AI at the Edge: Adapting AI to available resource in...
Implementing AI: Running AI at the Edge: Adapting AI to available resource in...
 
Introduction to Time Series Analytics with Microsoft Azure
Introduction to Time Series Analytics with Microsoft AzureIntroduction to Time Series Analytics with Microsoft Azure
Introduction to Time Series Analytics with Microsoft Azure
 
Aviation Digital Disruption
Aviation Digital Disruption Aviation Digital Disruption
Aviation Digital Disruption
 
Datarobot, 자동화된 분석 적용 시 분석 절차의 변화 및 효용 - 홍운표 데이터 사이언티스트, DataRobot :: AWS Sum...
Datarobot, 자동화된 분석 적용 시 분석 절차의 변화 및 효용 - 홍운표 데이터 사이언티스트, DataRobot :: AWS Sum...Datarobot, 자동화된 분석 적용 시 분석 절차의 변화 및 효용 - 홍운표 데이터 사이언티스트, DataRobot :: AWS Sum...
Datarobot, 자동화된 분석 적용 시 분석 절차의 변화 및 효용 - 홍운표 데이터 사이언티스트, DataRobot :: AWS Sum...
 
Mohamad C
Mohamad CMohamad C
Mohamad C
 
Reinforcement Learning in the Wild and Lessons Learned
Reinforcement Learning in the Wild and Lessons LearnedReinforcement Learning in the Wild and Lessons Learned
Reinforcement Learning in the Wild and Lessons Learned
 
Slideshare.net rh-isac summit 2019 - adam pennington - leveraging mitre at ta...
Slideshare.net rh-isac summit 2019 - adam pennington - leveraging mitre at ta...Slideshare.net rh-isac summit 2019 - adam pennington - leveraging mitre at ta...
Slideshare.net rh-isac summit 2019 - adam pennington - leveraging mitre at ta...
 
FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™
FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™
FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™
 
How to analyze and tune sql queries for better performance vts2016
How to analyze and tune sql queries for better performance vts2016How to analyze and tune sql queries for better performance vts2016
How to analyze and tune sql queries for better performance vts2016
 
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...
 
Improving Software quality for the Modern Web
Improving Software quality for the Modern WebImproving Software quality for the Modern Web
Improving Software quality for the Modern Web
 
Pace IT - Command Line Networking
Pace IT - Command Line NetworkingPace IT - Command Line Networking
Pace IT - Command Line Networking
 
Detecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking DataDetecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking Data
 
AWS Toronto Summit 2019 - AIM302 - Build, train, and deploy ML models with Am...
AWS Toronto Summit 2019 - AIM302 - Build, train, and deploy ML models with Am...AWS Toronto Summit 2019 - AIM302 - Build, train, and deploy ML models with Am...
AWS Toronto Summit 2019 - AIM302 - Build, train, and deploy ML models with Am...
 
Ben ford intro
Ben ford introBen ford intro
Ben ford intro
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben Ford
 

Mehr von Lucidworks

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceLucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesLucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchLucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondLucidworks
 

Mehr von Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Kürzlich hochgeladen

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Kürzlich hochgeladen (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

(R)evolving Relevance Tuning with Genetic Algorithms

  • 1. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. September 12, 2019 (R)Evolving Relevance Tuning with Genetic Algorithms Tim Allison Activate Conference 2019 Washington, DC Approved for Public Release; Distribution Unlimited. Case Number 18-3138-12 © 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
  • 2. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. About me  Chair/V.P. Apache Tika  Committer/PMC Apache PDFBox  Committer/PMC Apache POI  Committer Apache Lucene/Solr  Member ASF  Ph.D. Classical Studies | 2 |
  • 3. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Hat Tip – Simon Hughes | 3 | Simon Hughes “Evolving The Optimal Relevancy Scoring Model at Dice.com” https://www.youtube.com/watch?v=z4c1xU7arhc https://github.com/DiceTechJobs/RelevancyTuning
  • 4. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Some Open Source Relevance Tools  Quaerite (focus of this talk): https://github.com/mitre/quaerite  Quepid (Open Source Connections): https://github.com/o19s/quepid  Rated Ranking Evaluator RRE (Sease Ltd): https://github.com/SeaseLtd/rated-ranking- evaluator  Others?! | 4 |
  • 5. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Outline  Introduction/Motivation  Evolution of methods – Generation 0 – Generation 1 – Generation 2  Findings  Next Steps | 5 |
  • 6. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Search is easy! | 6 | © 2019 The MITRE Corporation. All rights reserved.
  • 7. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Search Engines – A Quick Overview | 7 | © 2019 The MITRE Corporation. All rights reserved. Martin White “The Technology of Search. Search Insights 2018, The Search Network. p. 9. http://www.flax.co.uk/blog/20 18/03/26/search-insights- 2018-free-independent- report-search Figure originally published in “Searching the Enterprise”, Foundations and Trends® in Information Retrieval
  • 8. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Available Parameters  14 tokenizers https://lucene.apache.org/solr/guide/7_1/tokenizers.html  ~45 token filters (not including language-specific token filters – see next slide) https://lucene.apache.org/solr/guide/7_1/filter-descriptions.html  Query parsers  Boosting: fields, queries, functions  Phrasal boosting/shingling  Query operators, minimum should match, should, must, not  Token/field based scoring – best_fields, most_fields, cross_fields  Synonym lists, taxonomies  Similarity scoring parameters (with BM25)  Elevate  External signal enrichment – manual or automatic (NLP – entity extraction, categorization, etc.)  Reranking via machine learning (Learning to Rank) | 8 | © 2019 The MITRE Corporation. All rights reserved.
  • 9. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Each Token Filter Can Have Many Parameters <filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/> | 9 | © 2019 The MITRE Corporation. All rights reserved.
  • 10. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. What to do, what to do… | 10 | “Relevant Search With applications for Solr and Elasticsearch” Doug Turnbull and John Berryman https://www.manning.com/books/relevant- search Thank you, Doug Turnbull and John Berryman for permission to use the search engineer in this talk!
  • 11. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Ground-truth based relevance tuning  Requires ground truth  Good ground truth  Overfitting…be careful! – Please use responsible train/test splits – LOL: https://www.gwern.net/Tanks | 11 |
  • 12. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Example Ground Truth | 12 | Thank you, Doug Turnbull, John Berryman and “Open Source Connections” for the inspiration for using tmdb and for generating and sharing a ground truth set!
  • 13. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Generation 0: Run Some Experiments  Assuming a static corpus, results should be reproducible  Keep track of previous experiments  Allow standard output and flexibility of scoring metrics | 13 | Risk of overfitting
  • 14. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Key components for running experiments | 14 | { "scorers": [ ], "experiments": { } }
  • 15. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Basic Experiment Configuration: Scorers and Experiments | 15 | { "scorers": [ { "class": "NDCG", "atN": 10, "params": { "useForTrain": true, "useForTest": true, "exportPMatrix": true }, … ], "experiments": { "title": { "searchServerUrl": "http://.../solr/tmdb", "query": { "edismax": { "qf": [ "title" ] } } }, … } }
  • 16. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Scorers – More Scorers | 16 | "scorers": [ { "class": "AtLeastOneAtN", "atN": 1 }, { "class": "AtLeastOneAtN", "atN": 5 }, { "class": "AtLeastOneAtN", "atN": 10 }, "class": "NDCG", "atN": 10, "params": { "useForTrain": true, "useForTest": true, "exportPMatrix": true } }, { "class": "TotalDocsReturned" }, { "class": "ZeroResults" } ]
  • 17. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Experiments – A Slightly More Interesting Experiment | 17 | "title_cast_pf_tie_0_8_mm2": { "searchServerUrl": "http://localhost:8983/solr/tmdb", "query": { "edismax": { "qf": [ "title^10", "cast^2" ], "tie" : 0.8, "pf": [ "title^10", "cast^2" ], "q.op": { "mm": "2" }}}}
  • 18. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Output: Per Query/Per Experiment Scores | 18 |
  • 19. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Output: Per Experiment Summary Analytics | 19 |
  • 20. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Output: Pairwise P-Value for Diffs in NDCG@10 | 20 |
  • 21. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Generation 1: Automatically Generate Experiments  If I know the parameters I want to experiment with, why should I have to specify the combinations?!  Different analyzer chains | 21 | Risk of overfitting Stubb’s
  • 22. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Generation 1: Automatically Generate Experiments  If I know the parameters I want to experiment with, why should I have to specify the combinations?!  Different analyzer chains  field boosts and ranges | 22 | Risk of overfitting Stubb’s Stubb’sStubb’s
  • 23. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Generation 1a: Automatically Generate All Experiments (Brute Force/Grid Search)  If I know the parameters I want to experiment with, why should I have to specify the combinations?!  Different analyzer chains  field boosts and ranges  boolean/min should match  tie  pf, pf2, pf3, ps, ps2, ps3  bq, boost | 23 | Risk of overfitting Stubb’s Stubb’sStubb’s Stubb’s Stubb’sStubb’s Stubb’s Stubb’sStubb’s
  • 24. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Experiment Features – Scorers and FeatureFactories | 24 | {"scorers": [ { "class": "NDCG", "atN": 10 } ], "featureFactories": { "urls": [ "http://localhost:8983/solr/tmdb" ], "query": { "edismax": { … } } }
  • 25. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Query Features – A Bit More Interesting | 25 | "query": { "edismax": { "qf": { "fields": [ "title", "overview", "cast" ], "defaultWeights": [ 0.0, 2.0, 10.0 ], "minSetSize": 1, "maxSetSize": 3 }, "tie": [0.0, 0.2, 0.8], "q.op": { "operators" : ["or", "and"], "mmInts" : [1,2,3] } } } Generates 390 experiments!
  • 26. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Parameterizable Strings | 26 | "boost": [ "max(recip(ms(NOW/DAY, ds_field_last_modified), 3.16e-11,[1,2,3],[$1]), [0.1, 0.9])" ], "max(recip(ms(NOW/DAY, ds_field_last_modified), 3.16e-11,1,1), 0.1)" "max(recip(ms(NOW/DAY, ds_field_last_modified), 3.16e-11,2,2), 0.1)" …
  • 27. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Permutation explosion – Beware! Number of Fields Number of Experiments No Weights* Number of Experiments Two Weights* 2 3 8 3 7 26 4 15 80 5 31 242 6 63 728 7 127 2,186 | 27 | *No Weights: a given field may or may not exist *Two Weights: a given field may not be used or have one of two weights, e.g. text^2, text^10 And, that’s just field weights!!! The following are held constant: tie, operator, pf, pf2, ps, ps2, bq, boost
  • 28. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Generation 1b: Automatically Generate Random Experiments (Random Search)  If I know the parameters I want to experiment with, how about running only SOME combinations?!  Different analyzer chains  field boosts and ranges  boolean/min should match  tie  pf, pf2, pf3, ps, ps2, ps3  bq, boost | 28 | Stubb’s Stubb’s Stubb’s Stubb’s Stubb’s Risk of overfitting
  • 29. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Generation 2: Genetic Algorithm  Perhaps I could improve on random search? At each generation, only let the top experiments into the next generation – mutate, crossover, random…repeat…  Different analyzer chains  field boosts and ranges  boolean/min should match  tie  pf, pf2, pf3, ps, ps2, ps3  bq, boost | 29 | Risk of overfitting
  • 30. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Genetic Algorithm Terms  Population  Generations  Operations – Random – Crossover – Mutate | 30 |
  • 31. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. | 31 | Generation 0 Genetic Algorithm Basics 0.30.4 0.5 0.4 Generation 1 0.25 0.4 0.5 0.1 0.4 Crossover Mutate Random 0.3 Generation X … NDCG@10
  • 32. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Interlude: How Does this Differ from Learning to Rank (LTR)  Still need: – All the sound search engineering decisions (sane analysis chain, etc.) – Ground truth; good ground truth  Difference: – Learns settings for overall initial search, not a reranking function on (typically) a subset | 32 |
  • 33. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Generation 2: Genetic Algorithm – Cross-fold Validation Built-in  Perhaps I could improve on random search? At each generation, only let the top experiments into the next generation – mutate, crossover, random…repeat…  Different analyzer chains  field boosts and ranges  boolean/min should match  tie  pf, pf2, pf3, ps, ps2, ps3  bq, boost | 33 | Risk of overfitting Cross-fold Validation
  • 34. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. N-Fold Cross-Validation | 34 | Training Testing Testing NDCG@10 0.45 Testing NDCG@10 0.47 Testing NDCG@10 0.50 Testing NDCG@10 0.42 NDCG@10 Average Testing : 0.46 Fold 0 Fold 1 Fold 2 Fold 3
  • 35. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Results Per Fold | 35 | FOLD 0 TRAINING experiment 'train_fold_0_gen_4_exp_2': .678 experiment 'train_fold_0_gen_4_exp_7': .678 experiment 'train_fold_0_gen_0_exp_13': .662 experiment 'train_fold_0_gen_2_exp_13': .640 experiment 'train_fold_0_gen_2_exp_12': .640 experiment 'train_fold_0_gen_1_exp_3': .640 experiment 'train_fold_0_gen_3_exp_4': .640 experiment 'train_fold_0_gen_3_exp_7': .640 experiment 'train_fold_0_gen_3_exp_3': .640 experiment 'train_fold_0_gen_3_exp_2': .640 FOLD 0 TESTING experiment 'test_fold_0_gen_4_exp_2': .552
  • 36. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Results: Overall, across folds | 36 | FINAL RESULTS ON TESTING: experiment 'test_fold_2_gen_4_exp_1': .790 experiment 'test_fold_0_gen_4_exp_2': .552 experiment 'test_fold_1_gen_3_exp_6': .540 mean: .627 median: .552 stdev:.141
  • 37. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Initial Findings | 37 | The Good: Boosted NDCG@10 from 0.25->0.3 The Bad: Worse than baseline on huge parameter set with insufficient(?) generations The Great: I can spend more time on feature engineering/signal enrichment.
  • 38. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Initial Findings – L-Value | 38 |
  • 39. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Next Steps  Documentation  Documentation  Documentation  Documentation  Documentation  Documentation  Documentation  Documentation  Documentation  Documentation  Documentation  Documentation | 39 |
  • 40. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Next Steps  Documentation  Finalize (ish) API for 1.0.0 release  Add ground truth-free measures (overlap, rank correlation)  Add descriptors for features so that the results are somewhat interpretable  Bayesian optimization?! | 40 |
  • 41. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Questions? Quaerite – https://github.com/mitre/quaerite – https://github.com/mitre/quaerite/blob/master/quaerite-examples/README.md Contact: –tallison@apache.org –@_tallison | 41 |
  • 42. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. | 42 | Stubb’s
  • 43. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. On the one hand…on the other  On the one hand – This amount of precise control is great  On the other – Permutations are mind-boggling – Defaults used to be abysmal, but they are much better now…generally  On the third hand – The lists and commercial support are amazingly responsive and helpful | 43 | © 2019 The MITRE Corporation. All rights reserved.
  • 44. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Debts of Gratitude  David Smiley  Nick Burch  Chris Mattmann  Tilman Hausherr  Dominik Stadler  Fellow devs: – Apache Lucene/Solr, Apache Commons, Apache POI, Apache PDFBox, Apache Tika  ASF Community and users!  Common Crawl and govdocs1  Rackspace | 44 | © 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
  • 45. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Overview  Intro to Tika: content and metadata extraction in the ETL stack  Motivation for tika-eval: what can go wrong?  tika-eval overview and workflow – single vm  tika-eval at scale | 45 | ApacheCon 2019ApacheCon 2017
  • 46. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Content Extraction and HLT | 46 | 1001010010010010001001 0101001010011010111111 0101010101101101110110 1110110101110110110111 0110111101101101101101 1111100000011010100000 0110010000011010010010 ‫شوارزنيجر‬ ‫ألويس‬ ‫أرنولد‬(‫في‬ ‫ولد‬8‫أغسطس‬1947‫في‬ ،‫ستيريا‬،‫النمسا‬) Bytes Text Machine Translation: Arnold Alois Schwarzenegger (born August 8, 1947, in Styria, Austria) Entity Extraction: ‫شوارزنيجر‬ ‫ألويس‬ ‫أرنولد‬ (‫في‬ ‫ولد‬8‫أغسطس‬1947، ‫في‬،‫ستيريا‬‫النمسا‬) = Search: 2~“‫شوارزنيجر‬ ‫”أرنولد‬ Search: “Arnold Schwarzenegger”~2 Content Extraction Traditional Human Language Technologies
  • 47. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Forensics: Carving and Advanced Methods Files High Level Components of a Media Processing Stack Search/Entity Extraction/MT, etc. User Interface Text Extraction and Metadata Extraction Structured Data-store | 47 |
  • 48. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Let’s not forget Metadata!  Various formats store useful information  Who: author (first, last, commenters, editors), digital signature, company, from/to/cc/bcc (emails)  What: hardware version/name, software version/name, globally unique file/heritage id (XMP), title, keywords, description  Where: geo (latitude, longitude), file location (file paths embedded inside documents)  When: created, last modified, last printed  Beyond the standard types…custom metadata | 48 |
  • 49. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Example Application: Search When Things Go Wrong with Text Extraction | 49 | © 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
  • 50. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. What the User Sees in a Search System Content/ Metadata Extraction Indexer/ Search System User Interface Structured Data-store | 50 |
  • 51. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. When Things Go Wrong with a Foundation W.LloydMacKenzie,viaFlickr @http://www.flickr.com/photos/saffron_blaze/ | 51 |
  • 52. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. What can go wrong? Basic problems: thrown exceptions  Parser has a problem with non-corrupt file (and admits it…thank you!!!)  Password/access protected files  Format version not handled (add new parser?)  Corrupt files – can’t be opened by primary application or parsed by other parsers  Corrupt files – slight variant from spec/other parsers can handle it  Truncated files | 52 | Note: some text/metadata may or may not be extracted before the exception is thrown
  • 53. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. What can go wrong? Catastrophic problems  OutOfMemoryError – potentially corrupting the JVM – Inefficient parsers DOM vs SAX on rare docx (TIKA-2170) and pptx (TIKA-2201). NOTE: with multithreaded garbage collection, a single thread running Tika can cause a quad- core system to grind to a snail’s pace before hitting OOM. – Four bytes of a compressed file (TIKA-2330)  Slowly building memory leak – See above on quad-core, gc and snails (TIKA-2180?)  Permanent Hang – TIKA-1132  Security Vulnerabilities – XXE (CVE-2016-4334), arbitrary code execution (CVE-2016-6809) | 53 | These are extremely rare, and we try to fix them when we’re aware of them!
  • 54. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. What can go wrong? Hidden problems (no exceptions!)…  Garbled text – From slightly to…fully  Missing text/metadata – From missing some text to … no text at all  Missing attachments  Silently swallowed exceptions of embedded documents – Classic Tika xhtml silently swallows embedded exceptions!!! | 54 |
  • 55. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Corrupt Text (Upgrade from PDFBox 1.8.6->1.8.7) | 55 |
  • 56. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Missing Text (TIKA-1130) | 56 | Document available: https://issues.apache.org/jira/browse/TIKA-1130
  • 57. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. When Things Go Not as Well as They Might with Content Extraction – OCR I9 There was documcntation of calibration but not ofobscrvation of tlic actual iiionitoring of tlic critical limits during production. Image: Text Extracted: Search Results: | 57 |
  • 58. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Take-away #1  If you don’t evaluate content extraction… | 58 | You don’t know what you can’t find
  • 59. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Take-away #2 A small problem for me can be a big problem for you! | 59 |
  • 60. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. TIKA-1302: The Dream  Motivation – All of the above – We have only roughly 1,000 test files in unit tests in Apache POI, Apache PDFBox and Apache Tika – Apache POI/PDFBox/Tika mistakenly made me a committer  Run Tika on much larger corpus nightly/weekly  Automatically recognize regressions | 60 |
  • 61. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Available since Apache Tika 1.15 tika-eval | 61 | © 2019 The MITRE Corporation. ALL RIGHTS RESERVED.
  • 62. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. High-level overview  tika-eval’s scope – Single vm, file share to file share (with embedded H2 db), ~few million files is a reasonable size – Not currently cloud-scale  Random sampling – should be good enough  Our Jira is open and committers are standing by!  tika-eval’s two modes – Profile single extraction run – Compare two extraction runs  Ground truth vs. particular tool  Tool A vs. tool B  Tool A with settings X vs. Tool A with settings Y | 62 |
  • 63. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Definitions  “original documents” or “container documents” – the original binary documents from which you’d like to extract text, whether or not they actually have attachments.  “embedded documents” – any document contained within another document, including those that only ever exist as embedded docs: emf/wmf/xmp/xfa.  “extract” – .txt or .json representation of the extracted text/metadata. – tika-eval was designed for .json  RecursiveParserWrapper via API  (-J) for tika-app  /rmeta for tika-server – tika-eval can handle .txt files – details on our wiki | 63 |
  • 64. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Why the RecursiveParserWrapper? | 64 |
  • 65. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Classic XHTML | 65 | <?xml version="1.0" encoding="UTF-8"?> <meta name="Content-Type" .../> … <p>embed_0 </p> <p><div class="embedded" id="rId7"/> <p>embed1.zip</p> <div class="embedded" id="embed1/embed1a.txt"/> <div class="package-entry"> <p>embed_1a</p> </div> ... • Metadata from embedded docs is lost • Exceptions from embedded docs are swallowed • Metadata from the container document may be incomplete
  • 66. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. RecursiveParserWrapper | 66 | [ { "Application-Name": "Microsoft Office Word", "Content-Length": "27082", "Content-Type": "application/....wordprocessingml.document", "X-TIKA:content": "embed_0 ", ... }, { "Content-Type": "text/plain; charset=ISO-8859-1", "Last-Modified": "2014-06-04T04:08:28Z", "X-TIKA:content": "embed_1a", "X-TIKA:embedded_resource_path": "/embed1.zip/embed1a.txt", ... }, { "Content-Type": "application/zip", "Last-Modified": "2014-06-04T04:09:40Z", "X-TIKA:content": "embed4.txt", "X-TIKA:embedded_resource_path": "/embed1.zip/embed2.zip/embed3.zip/embed4.zip" ... }, ...] • Embedded metadata (e.g. mime/author/lat-long, etc.) are retained • Embedded exceptions are stored in a metadata key • All metadata is extracted stored
  • 67. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Workflow – Profile 1. Generate extracts with parallel directory structure to original documents, append “.txt” or “.json” into, say my_extracts directory 2. Run profiler to populate in-process H2 DB java –jar tika-eval.jar Profile –extracts my_extracts –db my_db 3. Dump reports java –jar tika-eval.jar Report –db my_db Excel reports will be dumped to the reports directory | 67 |
  • 68. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Workflow – Compare 1. Generate extracts with parallel directory structure to original documents, append “.txt” or “.json” into, say my_extractsA and my_extractsB directories 2. Run profiler to populate in-process H2 DB java –jar tika-eval.jar Compare –extractsA my_extractsA –extractsB my_extractsB –db my_db 3. Dump reports java –jar tika-eval.jar Report –db my_db Excel reports will be dumped to the reports directory | 68 |
  • 69. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Workflow – StartDB  Start db: java –jar tika-eval.jar StartDB  Open browser to localhost:8082  Select db (full path!): – jdbc:h2:/C:/data/my_db  Notes on db structure: https://wiki.apache.org/tika/TikaEvalDbDesign | 69 |
  • 70. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Reports (Profile)  Metadata – count of metadata values  Attachments – counts  Mimes – mime counts for containers and embedded docs  Exceptions – Counts by type (e.g. password vs. actual exception) – Counts by mime – Counts by normalized stacktrace – All stack traces  Content – Language id – Word count – Common words count – Word length stats – Page count | 70 |
  • 71. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Reports (Compare)*  Metadata – comparison counts A to B  Attachments – comparison counts A to B  Mimes – Comparison mime counts for containers and embedded docs – Counts of mime changes mimeA->mimeB  Exceptions – Comparisons of counts by mime – Counts by mime – Counts by normalized stacktrace – All stack traces  Content – Language id – Word count – Word length stats – Page count | 71 | Includes Profile data for both A and B and then also some comparison reports
  • 72. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Content – “Common words” and their Utility in Profile  Top 30k most common words per language* in Leipzig Corpus**  To find PDFs that are mostly image only: – number of words/number of pages  To find very corrupt text***: – “In vocabulary %”:(number of common words/number of alphabetic words) – “Out of vocabulary (OOV)%”: 1-“in vocabulary %” | 72 | * Many thanks, Apache Lucene! ** http://wortschatz.uni-leipzig.de/en/download/ and Apache OpenNLP’s: https://svn.apache.org/repos/bigdata/opennlp/ *** Metric was recommended by Tilman Hausherr
  • 73. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Content comparisons  Similarity metrics between A and B – how many words in common/total number of words (with counts normalized to 0/1 per doc) – how many words in common/total number of words (with actual counts)  Improvement in “common words” – number of Common Words in B – number of Common Words in A – Per mime | 73 |
  • 74. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Content Comparison Example – Junk -> Better Tika 1.14 Tika 1.15-SNAPSHOT Unique Tokens 786 156 Total Tokens 1603 272 LangId zh-ch de Common Words 0 116 Alphabetic Tokens 1603 250 Top N Tokens 捳敨: 18 | 獴档: 14 | 略獴: 14 | m: 11 | 杮湥: 11 | 瑵捳: 11 | 畬杮: 11 | 档湥: 10 | 搠敩: 9 | 敮浨: 9 die: 11 | und: 8 | von: 8 | deutschen: 7 | deutsche: 6 | 1: 5 | das: 5 | der: 5 | finanzministerium: 5 | oder: 5 OOV% 1-(0/1603) = 100% 1-(116/250) = 54% | 74 | Overlap: 0% Increase in Common Words: 116
  • 75. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Content Comparison Example – Small Regression Tika 1.14 Tika 1.15-SNAPSHOT Unique Tokens 1916 1995 Total Tokens 14187 14302 LangId en en Common Words 7498 7409 Alphabetic Tokens 13472 13587 Top 10 Unique Tokens applicant's: 8 | 1.69: 1 | arbitrary: 1 | collecting: 1 | constitution: 1 | e112: 1 | ei.b: 1 | equating: 1 | magnetically: 1 | o: 1 ss: 106 | applicantis: 8 | ssss: 7 | iactsi: 4 | ithe: 4 | imeansi: 3 | iprocessi: 3 | calculations.i: 2 | iabstract: 2 | idata: 2 OOV% 1-(7498/13472) = 44% 1-(7409/13587) = 45% | 75 | Overlap: 95.5% Increase in Common Words: -89
  • 76. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Taking tika-eval public  Rackspace kindly hosts a vm for ongoing evals (TIKA-1302)  1 TB (~3 million files) from Common Crawl and govdocs1  Collaborating with Apache PDFBox and Apache POI to run evals as part of the release process  Critical to identifying regressions and building new parsers  Stacktraces created by public documents are critical for the hey-I’m-getting-this parse-exception-but-can’t-share-the-document-with-you problem  See Dominik Stadler’s Common Crawl download tool: https://github.com/centic9/CommonCrawlDocumentDownload | 76 |
  • 77. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Community collaboration | 77 | Thank you, Tilman Hausherr!
  • 78. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Limits of Automated Metrics without Ground Truth  More exceptions – We have a problem! Wait… – New parser, we were entirely skipping those file types before – Parser was yielding junk before on this file, now it is letting us know there’s a problem  Fewer exceptions – Great! Wait… – Mime detection not working – skipping files that we used to parse (theoretical) – Now we’re getting junk  More Common Words – Great! Wait… – Serious bug that duplicates worksheets in some xlsx files (TIKA-2356…my fault…ugh!) – More non-html markup/xml tags incorrectly getting through  Fewer Common Words – Problem! Wait…  More attachments, fewer attachments (Your turn!) | 78 |
  • 79. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. TIKA-1302: The Ticket is Grown; the Dream is Gone  Without ground truth, humans need to interpret differences  This only makes building a gui more important!!! (TIKA-1334)  Collaborative tagging? As a human reviews diffs, flag document as “hopeless” or a given extract as “great” or “awful” )Again, thanks to Tilman Hausherr(  Dream of TIKA-1302 ran into reality, but we’re far better than where we were | 79 |
  • 80. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. | 80 | Scaling tika-eval
  • 81. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. tika-eval at Scale (with tika-eval >= 1.22)  Motivation – Running into unacceptably long processing times for analysis on ~3 million documents in H2. Current work around: Postgres…still not speedy (~40 minutes) – Big data frameworks are built for analysis and scale! Use them!  General Process – two steps 1. Calculating content statistics (as of Tika 1.22: decoupled tika-eval text stats calculator) 2. Rollups/aggregations  Limitations – Moving beyond the file share leads to an explosion of big data frameworks…no single solution – There is no one answer…must be customized per framework… | 81 |
  • 82. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Step 1: Calculating Content Statistics – An Example with SolrJ | 82 | https://github.com/tballison/tika-addons/tree/master/tika-eval-solrj
  • 83. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Average OOV% in English PDFs | 83 |
  • 84. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. That document all the way on the left in the previous slide | 84 |
  • 85. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Which PDFs had little to no content? | 85 |
  • 86. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Most Common Stacktraces for Epubs? | 86 | http://localhost:8983/solr/tika-eval/select?facet.pivot=mime_facet,stacktrace_facet&facet=on &fq=mime:epub&q=*:*
  • 87. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. From Analytics to Action  PDFs – when to run OCR  Charset detection – which detector to trust…building a better charset detector | 87 |
  • 88. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. | 88 | Prioritizing OCR via “sort oov desc”
  • 89. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Stored text vs. OCR’d text | 89 | Text as stored in PDF (OOV 84%) GO Obpermtmn of the Establishment I : S I . 11.Itmpacudo~+r~ A pn~ccssing1C:ontinent:tl. San l'edro S~tla.I-Iondurus; April 7, 2UO8 ( h ~ c l ' s l a t ~ g h ~ ~ r Ilurillg optraticlnol sanitation insprutiol~in tlre slaughter rrjom, carcuss i ; ~ 5icr.c c~l~scl-isl1~1t1k3 r'd contacting visccrri cart wllecls. 'I'his .as cc&gt;rrcctcc[immcdiiiiuly by tllc csli~blishnicn~I T C . I S I ~ H I I ~ ~ . 1Kcgu1atc11.yrulirencc: 9CIf:li 41 (1.1 3(c)( 51 NAME OF AUDlTOR 60 Observation of the Establishment Est. 12. Empacadora Continental. San Pedro Sula, Honduras; April 7, 2008 (beet aughter & processing) 10 During operational sanitation inspection in the slaughter room, carcass fore shanks were observed contacting viscera cart wheels. This was corrected immediately by the establishment personne! [Regulatory reference: 9CFR 416.13(¢)| AND DATE 51 NAME OF AUDITOR Text extracted from Tesseract (OOV 30%)
  • 90. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. WWGD*? | 90 | *WWGD: “What would Google Do?” (h/t Grant Ingersoll) Text in Google’s Cache Popat, Ashok. (2009). A panlingual anomalous text detector. 201-204. 10.1145/1600193.1600237. See also: 60 Observation of the Establishment Est. 12. Empacadora Continental, San Pedro Sula, Honduras; April 7, 2008 (heel slaughter & processing) 10 During operational sanitation inspection in the slaughter room, carcass fore shanks Tiere observedcontacung viscera cart wheels. This was corrected immediately by the establishment personnel.Regulatory reference: 9CFR 416.13(0
  • 91. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Charset detection – HTML meta charset (HTMLDefault) vs. Mozilla’s chardet (Universal) HTMLDefault Universal HTMLDefault Sum Common Tokens Universal Sum Common Tokens Difference in Sums UTF-8 EUC-JP 4,437 481,919 477,482 EUC-JP Shift_JIS 1,512 391,126 389,614 UTF-16 windows-1252 1,240 368,496 367,256 UTF-16 UTF-8 2,563 321,717 319,154 EUC-JP UTF-8 764,957 1,047,029 282,072 windows-1255 UTF-8 17,450 246,271 228,821 windows-1256 UTF-8 52,185 249,105 196,920 EUC-KR UTF-8 1,081,986 1,274,249 192,263 UTF-8 Shift_JIS 2,040 191,757 189,717 windows-1252 UTF-8 427,997 554,311 126,314 | 91 | See initial charset study draft: https://github.com/tballison/share/blob/master/slides/Tika_charset_detector_study_201909.docx
  • 92. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Next Steps Streaming expressions: histograms and ? Visualizations…please help! –Zeppelin? “Compare” mode at scale? Community feedback – please help! | 92 |
  • 93. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. To conclude  Text extraction is critical  A small problem for us could be a big problem for you…please evaluate!  Seriously, please evaluate – you don’t know what you can’t find!  Join the Apache Tika community and its evaluation efforts!  Email: tallison@apache.org  Twitter: @_tallison | 93 |
  • 94. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. OOV% by Language, Mean and 1 StdDev on 1.5 million text-based files | 94 |
  • 95. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Some Resources  Nick Burch’s talk on Tika: http://events.linuxfoundation.org/sites/events/files/slides/WhatsNew WithApacheTika_2.pdf  tika-eval wiki:https://cwiki.apache.org/confluence/display/tika/TikaEval  Fellow traveler Ryan Bauman’s “Automatic evaluation of OCR”: https://ryanfb.github.io/etc/2015/03/16/automatic_evaluation_of_ocr_qual ity.html  Ted Underwood’s earlier post: https://tedunderwood.com/2012/04/26/the- obvious-thing-were-lacking/ | 95 |
  • 96. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Extras | 96 |
  • 97. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. Apache Tika File Type Identification PDF MSOffice HTML JPEG MP3 WAV Metadata and Text Content … Uniform Metadata and Text Content for text processing and/or ingestion into search engine | 97 |
  • 98. © 2019 The MITRE Corporation. ALL RIGHTS RESERVED. | 98 |

Hinweis der Redaktion

  1. Machine translation via Google translate Text retrieved from https://ar.wikipedia.org/wiki/%D8%A3%D8%B1%D9%86%D9%88%D9%84%D8%AF_%D8%B4%D9%88%D8%A7%D8%B1%D8%B2%D9%86%D9%8A%D8%AC%D8%B1 on 2/1/2017
  2. Image from: http://upload.wikimedia.org/wikipedia/commons/6/66/The_Leaning_Tower_of_Pisa_SB.jpeg Permission details Outside of Wikimedia Foundation projects, attribution is to be made to: W. Lloyd MacKenzie, via Flickr @http://www.flickr.com/photos/saffron_blaze/