SlideShare a Scribd company logo
1 of 36
Download to read offline
Evaluating Text
Extraction At Scale: A Case Study
from Apache Tika
Tim Allison, Ph.D.
Data Scientist/Relevance Engineer
Artificial Intelligence, Analytics and Innovative
Development Organization(1740)
ITSD
The research was carried out at the NASA (National Aeronautics
and Space Administration) Jet Propulsion Laboratory, California
Institute of Technology under a contract with the Defense
Advanced Research Projects Agency (DARPA) SafeDocs
program. © 2020 California Institute of Technology. Government
sponsorship acknowledged.
Reference herein to any specific commercial product, process,
or service by trade name, trademark, manufacturer, or
otherwise, does not constitute or imply its endorsement by the
United States Government or the Jet Propulsion Laboratory,
California Institute of Technology.
jpl.nasa.gov
About me
• Data scientist (files and search) Jet Propulsion
Laboratory, California Institute of Technology
• Chair/V.P. Apache Tika
• Committer Apache PDFBox, POI, Lucene/Solr,
OpenNLP
• Member Apache Software Foundation
2© 2020 California Institute of Technology. Government sponsorship acknowledged.10/20/20
The research was carried out at the NASA (National Aeronautics and Space Administration)
Jet Propulsion Laboratory, California Institute of Technology under a contract with the
Defense Advanced Research Projects Agency (DARPA) SafeDocs program.
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Outline
• Apache Tika, an overview
• What can possibly go wrong?
• tika-eval module
• Evaluation at scale and public corpora
10/20/20 3© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Apache Tika, an overview
10/20/20 4
bytes
text and metadata
Framework for file
type detection,
parsing and uniform
output for ~75
parsers, ~100+
formats
https://tika.apache.org/
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Apache Tika, features
• Easy to add new file types for detection
• Easy to add new parsers
• Works recursively with embedded files/attachments
• Integration with tesseract-ocr
10/20/20 5© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Embedded files/attachments
10/20/20 6© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Embedded files extracted
10/20/20 7© 2020 California Institute of Technology. Government sponsorship acknowledged.
[ {
"Application-Name": "Microsoft Office Word",
"Content-Length": "27082",
"Content-Type": "application/....wordprocessingml.document",
"X-TIKA:content": "embed_0 ",
... },
{ "Content-Type": "text/plain; charset=ISO-8859-1",
"Last-Modified": "2014-06-04T04:08:28Z",
"X-TIKA:content": "embed_1a",
"X-TIKA:embedded_resource_path": "/embed1.zip/embed1a.txt",
... },
{
"Content-Type": "application/zip",
"Last-Modified": "2014-06-04T04:09:40Z",
"X-TIKA:content": "embed4.txt",
"X-TIKA:embedded_resource_path":
"/embed1.zip/embed2.zip/embed3.zip/embed4.zip"
...}, ...]
jpl.nasa.gov
From PDF to …
10/20/20 8© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Metadata
10/20/20 9© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Text
10/20/20 10© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Metadata and Content in JSON
10/20/20 11© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
What could possibly go wrong?
• Basic
• Thrown Exceptions – corrupt files, bad parser?
• Image only – need to have mechanism to determine
when to run OCR
• Password protected files
• Catastrophic
• Infinite loops/out of memory/seg fault
• Security compromises
10/20/20 12© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
What could subtly go wrong?
• Missing text
• Missing attachments
• Garbled text
10/20/20 13© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Missing Text
10/20/20 14
File publicly available: https://issues.apache.org/jira/browse/TIKA-1130
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Garbled Text
10/20/20 15
Upgrade from
Apache PDFBox
1.8.6 to 1.8.7
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
tika-eval
10/20/20 16
• Library and commandline tool to calculate these
stats
• Profile a single batch of text extracts
• Compare two batches of extracts
• Different tools
• Different settings
• Coming soon – multi-compare
https://cwiki.apache.org/confluence/display/TIKA/TikaEval
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Out of vocabulary (OOV) – Same file, different Tika
10/20/20 17
Tika 1.14 Tika 1.15-SNAPSHOT
Unique Tokens 786 156
Total Tokens 1603 272
LangId zh-ch de
Common Words 0 116
Alphabetic Tokens 1603 250
Top N Tokens 捳敨: 18 | 獴档: 14 | 略獴:
14 | m: 11 | 杮湥: 11 | 瑵
捳: 11 | 畬杮: 11 | 档湥:
10 | 搠敩: 9 | 敮浨: 9
die: 11 | und: 8 | von: 8 |
deutschen: 7 | deutsche: 6 |
1: 5 | das: 5 | der: 5 |
finanzministerium: 5 | oder: 5
OOV% 1-(0/1603) = 100% 1-(116/250) = 54%
© 2020 California Institute of Technology. Government sponsorship acknowledged.
Overlap: 0%
Increase in Common Words: 116
jpl.nasa.gov
Out of vocabulary (OOV) – Same file, different Tika
10/20/20 18
Tika 1.14 Tika 1.15-SNAPSHOT
Unique Tokens 1916 1995
Total Tokens 14187 14302
LangId en en
Common Words 7498 7409
Alphabetic Tokens 13472 13587
Top 10 Unique
Tokens
applicant's: 8 | 1.69: 1 | arbitrary:
1 | collecting: 1 | constitution: 1 |
e112: 1 | ei.b: 1 | equating: 1 |
magnetically: 1 | o: 1
ss: 106 | applicantis: 8 | ssss: 7 |
iactsi: 4 | ithe: 4 | imeansi: 3 |
iprocessi: 3 | calculations.i: 2 |
iabstract: 2 | idata: 2
OOV% 1-(7498/13472) = 44% 1-(7409/13587) = 45%
© 2020 California Institute of Technology. Government sponsorship acknowledged.
Overlap: 95.5%
Increase in Common Words: -89
jpl.nasa.gov
OOV/Common Tokens corpus wide
Comparing the HTMLDefault encoding detector with Mozilla’s Universal
Encoding Detector on HTML encoding detection
10/20/20 19
HTMLDefault Universal HTMLDefault Sum
Common Tokens
Universal Sum
Common Tokens
Difference
in Sums
UTF-8 EUC-JP 4,437 481,919 477,482
EUC-JP Shift_JIS 1,512 391,126 389,614
UTF-16 windows-1252 1,240 368,496 367,256
UTF-16 UTF-8 2,563 321,717 319,154
EUC-JP UTF-8 764,957 1,047,029 282,072
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Scaling OOV in a set of English PDFs
10/20/20 20© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
From analytics to action
10/20/20 21
https://aviris.jpl.nasa.gov/proceedings/workshops/02_docs/2002_Ogura_1_web.pdf
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Stored text vs. Optical Character Recognition
10/20/20 22
Text As Stored in File
!"#%$& (') *,+-).' / 0 1,23 *. 457698;:;<>=75?&@78;ACB
D(B7E;FHGJICBK5MLNBKOPBKF;B DJD Q R S.TVU9WNXMY[ZT]^W_S `badc
5KICedFgfh5 cji :;edF;A^5KEk<>Imln:;e[<>EnloedACICe a
lo<p57Eg5Kqsr;E;<jloe[E 8;O 6hedA5Kq adc 57ItedFk:;B c qsICf;B a
Text from Tesseract OCR
Constrained Least Squares Linear Spectral Unmixture by the Hybrid Steepest
Descent Method
Nobuhiko Ogura’ and Isao Yamada”
1 Introduction
A closed polyhedron is the intersection of finite number of closed half
spaces, i.e., the setof points satisfying finite number of lincar
incqualitics, and is widely used as a constraint in various application, for
example specifications or constraints in signal processing or estimation
problems, resource restrictions in financial applications and feasible sets of
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
What would Google do?
10/20/20 23
Text in Google’s Cache
Popat, Ashok. (2009). A panlingual anomalous text detector. 201-204.
10.1145/1600193.1600237.
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Uses
• Parser developers
• Production – search, Natural Language Processing,
etc…
• Which parsers/extract to use?
• Which encoding detector to trust?
• When to run OCR?
• Do not index/relevance weights
10/20/20 24© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Other indicators
• Low token/page ratio
• Language identification – Catalan in a largely
English corpus?!
• Median/mean word length and stdev
• Number of Unicode code blocks
• Alphabetic/non-alphabetic
10/20/20 25© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Limitations in absence of ground truth
• More exceptions – We have a problem! Wait…
• New parser, we were entirely skipping those file types
before
• Parser was yielding junk before on this file, now it is
letting us know there’s a problem
• Fewer exceptions – Great! Wait…
• Mime detection not working – skipping files that we
used to parse (theoretical)
• Now we’re getting junk
10/20/20 26© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Limitations in absence of ground truth
• More Common Words – Great! Wait…
• Serious bug that duplicates worksheets in some xlsx
files (TIKA-2356…my fault…ugh!)
• Fewer Common Words – Problem! Wait…
• More attachments, fewer attachments (Your turn!)
10/20/20 27© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Multi-compare example…in the works
10/20/20 28
• Set 0: 0 bytes
• Set 1: HÕll World!
• Set 2: H’ll World!
• Set 3: HÕllÿ World!
file1
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Taking tika-eval public
• Maruan Sahyoun kindly hosts a vm for ongoing
evals (TIKA-1302)
• 1 TB (~3 million files) from govdocs1, Common
Crawl and bug trackers
• Collaborating with Apache PDFBox and Apache POI
to run evals as part of the release process
10/20/20 29© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Taking tika-eval public
• Critical to identifying regressions and building new
parsers
• Stacktraces created by public documents are critical
for the hey-I’m-getting-this parse-
exception-but-can’t-share-the-
document-with-you problem
10/20/20 30© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Corpora
• Govdocs1 (https://digitalcorpora.org/corpora/files)
• Common Crawl (https://commoncrawl.org/)
• Bug trackers (see Peter Wyatt’s article:
https://www.pdfa.org/a-new-stressful-pdf-corpus/)
• Coming soon(?): Unit-test files from github parser
projects
10/20/20 31© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Corpora
10/20/20 32
https://corpora.tika.apache.org/base/docs/
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Bug-trackers
10/20/20 33
https://corpora.tika.apache.org/base/docs/bug_trackers/
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Please join the fun!
• https://tika.apache.org/
• https://cwiki.apache.org/confluence/display/TIKA/Tik
aEval
• corpora-dev@tika.apache.org
• https://issues.apache.org/jira/projects/TIKA
• @ApacheTika
10/20/20 34© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov10/20/20 35
Questions?
timothy.b.allison@jpl.nasa.gov
@_tallison
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
10/20/20 36© 2020 California Institute of Technology. Government sponsorship acknowledged.

More Related Content

Similar to Evaluating Text Extraction at Scale: A case study from Apache Tika

S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 

Similar to Evaluating Text Extraction at Scale: A case study from Apache Tika (20)

Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2
 
Embedded Files: Risks, Challenges and Options
Embedded Files: Risks, Challenges and OptionsEmbedded Files: Risks, Challenges and Options
Embedded Files: Risks, Challenges and Options
 
"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild""Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"
 
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscapeVisualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscape
 
Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017
 
The State of Linked Government Data
The State of Linked Government DataThe State of Linked Government Data
The State of Linked Government Data
 
A novel programmable attenuator based low Gm-OTA for biomedical applications
A novel programmable attenuator based low Gm-OTA for biomedical applicationsA novel programmable attenuator based low Gm-OTA for biomedical applications
A novel programmable attenuator based low Gm-OTA for biomedical applications
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
Oscon 2011 wylie
Oscon 2011 wylieOscon 2011 wylie
Oscon 2011 wylie
 
Conrad PDES Spring 2006
Conrad PDES Spring 2006Conrad PDES Spring 2006
Conrad PDES Spring 2006
 
OpenAIRE and the Case of Irish Repositories
OpenAIRE and the Case of Irish RepositoriesOpenAIRE and the Case of Irish Repositories
OpenAIRE and the Case of Irish Repositories
 
OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...
OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...
OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
RDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
RDAP13 Lorrie Johnson: Facilitating Access to Scientific DataRDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
RDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web API
 
Information Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudInformation Extraction and Linked Data Cloud
Information Extraction and Linked Data Cloud
 
Patent scope ompi
Patent scope ompiPatent scope ompi
Patent scope ompi
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
DTL Partners Event - FAIR Data Tech overview - Day 1
DTL Partners Event - FAIR Data Tech overview - Day 1DTL Partners Event - FAIR Data Tech overview - Day 1
DTL Partners Event - FAIR Data Tech overview - Day 1
 
dcat: An RDF vocabulary for interoperability of data catalogues
dcat: An RDF vocabulary for interoperability of data cataloguesdcat: An RDF vocabulary for interoperability of data catalogues
dcat: An RDF vocabulary for interoperability of data catalogues
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Evaluating Text Extraction at Scale: A case study from Apache Tika

  • 1. Evaluating Text Extraction At Scale: A Case Study from Apache Tika Tim Allison, Ph.D. Data Scientist/Relevance Engineer Artificial Intelligence, Analytics and Innovative Development Organization(1740) ITSD The research was carried out at the NASA (National Aeronautics and Space Administration) Jet Propulsion Laboratory, California Institute of Technology under a contract with the Defense Advanced Research Projects Agency (DARPA) SafeDocs program. © 2020 California Institute of Technology. Government sponsorship acknowledged. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.
  • 2. jpl.nasa.gov About me • Data scientist (files and search) Jet Propulsion Laboratory, California Institute of Technology • Chair/V.P. Apache Tika • Committer Apache PDFBox, POI, Lucene/Solr, OpenNLP • Member Apache Software Foundation 2© 2020 California Institute of Technology. Government sponsorship acknowledged.10/20/20 The research was carried out at the NASA (National Aeronautics and Space Administration) Jet Propulsion Laboratory, California Institute of Technology under a contract with the Defense Advanced Research Projects Agency (DARPA) SafeDocs program. © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 3. jpl.nasa.gov Outline • Apache Tika, an overview • What can possibly go wrong? • tika-eval module • Evaluation at scale and public corpora 10/20/20 3© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 4. jpl.nasa.gov Apache Tika, an overview 10/20/20 4 bytes text and metadata Framework for file type detection, parsing and uniform output for ~75 parsers, ~100+ formats https://tika.apache.org/ © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 5. jpl.nasa.gov Apache Tika, features • Easy to add new file types for detection • Easy to add new parsers • Works recursively with embedded files/attachments • Integration with tesseract-ocr 10/20/20 5© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 6. jpl.nasa.gov Embedded files/attachments 10/20/20 6© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 7. jpl.nasa.gov Embedded files extracted 10/20/20 7© 2020 California Institute of Technology. Government sponsorship acknowledged. [ { "Application-Name": "Microsoft Office Word", "Content-Length": "27082", "Content-Type": "application/....wordprocessingml.document", "X-TIKA:content": "embed_0 ", ... }, { "Content-Type": "text/plain; charset=ISO-8859-1", "Last-Modified": "2014-06-04T04:08:28Z", "X-TIKA:content": "embed_1a", "X-TIKA:embedded_resource_path": "/embed1.zip/embed1a.txt", ... }, { "Content-Type": "application/zip", "Last-Modified": "2014-06-04T04:09:40Z", "X-TIKA:content": "embed4.txt", "X-TIKA:embedded_resource_path": "/embed1.zip/embed2.zip/embed3.zip/embed4.zip" ...}, ...]
  • 8. jpl.nasa.gov From PDF to … 10/20/20 8© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 9. jpl.nasa.gov Metadata 10/20/20 9© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 10. jpl.nasa.gov Text 10/20/20 10© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 11. jpl.nasa.gov Metadata and Content in JSON 10/20/20 11© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 12. jpl.nasa.gov What could possibly go wrong? • Basic • Thrown Exceptions – corrupt files, bad parser? • Image only – need to have mechanism to determine when to run OCR • Password protected files • Catastrophic • Infinite loops/out of memory/seg fault • Security compromises 10/20/20 12© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 13. jpl.nasa.gov What could subtly go wrong? • Missing text • Missing attachments • Garbled text 10/20/20 13© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 14. jpl.nasa.gov Missing Text 10/20/20 14 File publicly available: https://issues.apache.org/jira/browse/TIKA-1130 © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 15. jpl.nasa.gov Garbled Text 10/20/20 15 Upgrade from Apache PDFBox 1.8.6 to 1.8.7 © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 16. jpl.nasa.gov tika-eval 10/20/20 16 • Library and commandline tool to calculate these stats • Profile a single batch of text extracts • Compare two batches of extracts • Different tools • Different settings • Coming soon – multi-compare https://cwiki.apache.org/confluence/display/TIKA/TikaEval © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 17. jpl.nasa.gov Out of vocabulary (OOV) – Same file, different Tika 10/20/20 17 Tika 1.14 Tika 1.15-SNAPSHOT Unique Tokens 786 156 Total Tokens 1603 272 LangId zh-ch de Common Words 0 116 Alphabetic Tokens 1603 250 Top N Tokens 捳敨: 18 | 獴档: 14 | 略獴: 14 | m: 11 | 杮湥: 11 | 瑵 捳: 11 | 畬杮: 11 | 档湥: 10 | 搠敩: 9 | 敮浨: 9 die: 11 | und: 8 | von: 8 | deutschen: 7 | deutsche: 6 | 1: 5 | das: 5 | der: 5 | finanzministerium: 5 | oder: 5 OOV% 1-(0/1603) = 100% 1-(116/250) = 54% © 2020 California Institute of Technology. Government sponsorship acknowledged. Overlap: 0% Increase in Common Words: 116
  • 18. jpl.nasa.gov Out of vocabulary (OOV) – Same file, different Tika 10/20/20 18 Tika 1.14 Tika 1.15-SNAPSHOT Unique Tokens 1916 1995 Total Tokens 14187 14302 LangId en en Common Words 7498 7409 Alphabetic Tokens 13472 13587 Top 10 Unique Tokens applicant's: 8 | 1.69: 1 | arbitrary: 1 | collecting: 1 | constitution: 1 | e112: 1 | ei.b: 1 | equating: 1 | magnetically: 1 | o: 1 ss: 106 | applicantis: 8 | ssss: 7 | iactsi: 4 | ithe: 4 | imeansi: 3 | iprocessi: 3 | calculations.i: 2 | iabstract: 2 | idata: 2 OOV% 1-(7498/13472) = 44% 1-(7409/13587) = 45% © 2020 California Institute of Technology. Government sponsorship acknowledged. Overlap: 95.5% Increase in Common Words: -89
  • 19. jpl.nasa.gov OOV/Common Tokens corpus wide Comparing the HTMLDefault encoding detector with Mozilla’s Universal Encoding Detector on HTML encoding detection 10/20/20 19 HTMLDefault Universal HTMLDefault Sum Common Tokens Universal Sum Common Tokens Difference in Sums UTF-8 EUC-JP 4,437 481,919 477,482 EUC-JP Shift_JIS 1,512 391,126 389,614 UTF-16 windows-1252 1,240 368,496 367,256 UTF-16 UTF-8 2,563 321,717 319,154 EUC-JP UTF-8 764,957 1,047,029 282,072 © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 20. jpl.nasa.gov Scaling OOV in a set of English PDFs 10/20/20 20© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 21. jpl.nasa.gov From analytics to action 10/20/20 21 https://aviris.jpl.nasa.gov/proceedings/workshops/02_docs/2002_Ogura_1_web.pdf © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 22. jpl.nasa.gov Stored text vs. Optical Character Recognition 10/20/20 22 Text As Stored in File !"#%$& (') *,+-).' / 0 1,23 *. 457698;:;<>=75?&@78;ACB D(B7E;FHGJICBK5MLNBKOPBKF;B DJD Q R S.TVU9WNXMY[ZT]^W_S `badc 5KICedFgfh5 cji :;edF;A^5KEk<>Imln:;e[<>EnloedACICe a lo<p57Eg5Kqsr;E;<jloe[E 8;O 6hedA5Kq adc 57ItedFk:;B c qsICf;B a Text from Tesseract OCR Constrained Least Squares Linear Spectral Unmixture by the Hybrid Steepest Descent Method Nobuhiko Ogura’ and Isao Yamada” 1 Introduction A closed polyhedron is the intersection of finite number of closed half spaces, i.e., the setof points satisfying finite number of lincar incqualitics, and is widely used as a constraint in various application, for example specifications or constraints in signal processing or estimation problems, resource restrictions in financial applications and feasible sets of © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 23. jpl.nasa.gov What would Google do? 10/20/20 23 Text in Google’s Cache Popat, Ashok. (2009). A panlingual anomalous text detector. 201-204. 10.1145/1600193.1600237. © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 24. jpl.nasa.gov Uses • Parser developers • Production – search, Natural Language Processing, etc… • Which parsers/extract to use? • Which encoding detector to trust? • When to run OCR? • Do not index/relevance weights 10/20/20 24© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 25. jpl.nasa.gov Other indicators • Low token/page ratio • Language identification – Catalan in a largely English corpus?! • Median/mean word length and stdev • Number of Unicode code blocks • Alphabetic/non-alphabetic 10/20/20 25© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 26. jpl.nasa.gov Limitations in absence of ground truth • More exceptions – We have a problem! Wait… • New parser, we were entirely skipping those file types before • Parser was yielding junk before on this file, now it is letting us know there’s a problem • Fewer exceptions – Great! Wait… • Mime detection not working – skipping files that we used to parse (theoretical) • Now we’re getting junk 10/20/20 26© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 27. jpl.nasa.gov Limitations in absence of ground truth • More Common Words – Great! Wait… • Serious bug that duplicates worksheets in some xlsx files (TIKA-2356…my fault…ugh!) • Fewer Common Words – Problem! Wait… • More attachments, fewer attachments (Your turn!) 10/20/20 27© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 28. jpl.nasa.gov Multi-compare example…in the works 10/20/20 28 • Set 0: 0 bytes • Set 1: HÕll World! • Set 2: H’ll World! • Set 3: HÕllÿ World! file1 © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 29. jpl.nasa.gov Taking tika-eval public • Maruan Sahyoun kindly hosts a vm for ongoing evals (TIKA-1302) • 1 TB (~3 million files) from govdocs1, Common Crawl and bug trackers • Collaborating with Apache PDFBox and Apache POI to run evals as part of the release process 10/20/20 29© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 30. jpl.nasa.gov Taking tika-eval public • Critical to identifying regressions and building new parsers • Stacktraces created by public documents are critical for the hey-I’m-getting-this parse- exception-but-can’t-share-the- document-with-you problem 10/20/20 30© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 31. jpl.nasa.gov Corpora • Govdocs1 (https://digitalcorpora.org/corpora/files) • Common Crawl (https://commoncrawl.org/) • Bug trackers (see Peter Wyatt’s article: https://www.pdfa.org/a-new-stressful-pdf-corpus/) • Coming soon(?): Unit-test files from github parser projects 10/20/20 31© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 32. jpl.nasa.gov Corpora 10/20/20 32 https://corpora.tika.apache.org/base/docs/ © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 33. jpl.nasa.gov Bug-trackers 10/20/20 33 https://corpora.tika.apache.org/base/docs/bug_trackers/ © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 34. jpl.nasa.gov Please join the fun! • https://tika.apache.org/ • https://cwiki.apache.org/confluence/display/TIKA/Tik aEval • corpora-dev@tika.apache.org • https://issues.apache.org/jira/projects/TIKA • @ApacheTika 10/20/20 34© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 35. jpl.nasa.gov10/20/20 35 Questions? timothy.b.allison@jpl.nasa.gov @_tallison © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 36. jpl.nasa.gov 10/20/20 36© 2020 California Institute of Technology. Government sponsorship acknowledged.