SlideShare ist ein Scribd-Unternehmen logo
1 von 77
Downloaden Sie, um offline zu lesen
Apache Tika
What’s new with 2.0?
Apache Tika
What’s new with 2.0?
Nick Burch
CTO, Quanticate
Nick Burch
CTO, Quanticate
“small, yellow and leech-like, and
probably the oddest thing in the
Universe”
• Like a Babel Fish for content!
• Helps you work out what sort of thing
your content (1s & 0s) is
• Helps you extract the metadata from it,
in a consistent way
• Lets you get a plain text version of your
content, eg for full text indexing
• Provides a rich (XHTML) version too
Tika, in a nutshellTika, in a nutshell
Tika in the newsTika in the news
• Panama Papers – Tika used to extract content from most of
the files before indexing in Apache SOLR
https://source.opennews.org/en-US/articles/people-and-
tech-behind-panama-papers/
• MEMEX – DARPA funded project
https://nakedsecurity.sophos.com/2015/02/16/memex-
darpas-search-engine-for-the-dark-web/
A bit of historyA bit of historyA bit of historyA bit of history
Before TikaBefore Tika
• In the early 2000s, everyone was building a search engine /
search system for their CMS / web spider / etc
• Lucene mailing list and wiki had lots of code snippets for
using libraries to extract text
• Lots of bugs, people using old versions, people missing out
on useful formats, confusion abounded
• Handful of commercial libraries, generally expensive and
aimed at large companies and/or computer forensics
• Everyone was re-inventing the wheel, and doing it badly....
Tika's History (in brief)Tika's History (in brief)
• The idea from Tika first came from the Apache Nutch
project, who wanted to get useful things out of all the
content they were spidering and indexing
• The Apache Lucene project (which Nutch used) were also
interested, as lots of people there had the same problems
• Ideas and discussions started in 2006
• Project founded in 2007, in the Apache Incubator
• Initial contributions from Nutch, Lucene and Lius
• Graduated in 2008, v1.0 in 2011
Tika ReleasesTika Releases
10/06 02/08 07/09 11/10 04/12 08/13 12/14 05/16 09/17
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Tika ReleasesTika Releases
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.10 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.101.111.121.13
01/07
05/08
09/09
02/11
06/12
11/13
03/15
08/16
12/17
A (brief) introduction to TikaA (brief) introduction to TikaA (brief) introduction to TikaA (brief) introduction to Tika
(Some) Supported Formats(Some) Supported Formats
• HTML, XHTML, XML
• Microsoft Office – Word, Excel, PowerPoint, Works,
Publisher, Visio – Binary and OOXML formats
• OpenDocument (OpenOffice)
• iWorks – Keynote, Pages, Numbers
• PDF, RTF, Plain Text, CHM Help
• Compression / Archive – Zip, Tar, Ar, 7z, bz2, gz etc
• Atom, RSS, ePub Lots of Scientific formats
• Audio – MP3, MP4, Vorbis, Opus, Speex, MIDI, Wav
• Image – JPEG, TIFF, PNG, BMP, GIF, ICO
DetectionDetection
• Work out what kind of file something is
• Based on a mixture of things
• Filename
• Mime magic (first few hundred bytes)
• Dedicated code (eg containers)
• Some combination of all of these
• Can be used as a standalone – what is this thing?
• Can be combined with parsers – figure out what this is,
then find a parser to work on it
MetadataMetadata
• Describes a file
• eg Title, Author, Creation Date, Location
• Tika provides a way to extract this (where present)
• However, each file format tends to have its own kind of
metadata, which can vary a lot
• eg Author, Creator, Created By, First Author, Creator[0]
• Tika tries to map file format specific metadata onto
common, consistent metadata keys
• “Give me the thing that closest represents what Dublin
Core defines as Creator”
Plain TextPlain Text
• Most file formats include at least some text
• For a plain text file, that's everything in it!
• For others, it's only part
• Lots of libraries out there which can extract text, but how
you call them varies a lot
• Tika wraps all that up for you, and gives consistentency
• Plain Text is ideal for things like Full Text Indexing, eg to
feed into SOLR, Lucene or ElasticSearch
XHTMLXHTML
• Structured Text extraction
• Outputs SAX events for the tags and text of a file
• This is actually the Tika default, Plain Text is implemented
by only catching the Text parts of the SAX output
• Isn't supposed to be the “exact representation”
• Aims to give meaningful, semantic but simple output
• Can be used for basic previews
• Can be used to filter, eg ignore header + footer then give
remainder as plain text
What's New?What's New?What's New?What's New?
Formats and ParsersFormats and Parsers
Supported FormatsSupported Formats
• HTML
• XML
• Microsoft Office
• Word
• PowerPoint
• Excel (2,3,4,5,97+)
• Visio
• Outlook
Supported FormatsSupported Formats
• Open Document Format (ODF)
• iWorks
• PDF
• ePUB
• RTF
• Tar, RAR, AR, CPIO, Zip, 7Zip, Gzip, BZip2, XZ and Pack200
• Plain Text
• RSS and Atom
Supported FormatsSupported Formats
• IPTC ANPA Newswire
• CHM Help
• Wav, MIDI
• MP3, MP4 Audio
• Ogg Vorbis, Speex, FLAC, Opus, Theora
• PNG, JPG, BMP, TIFF, BPG, ICNS, PSD, WebP
• FLV, MP4 Video – Metadata and video histograms
• Java classes
Supported FormatsSupported Formats
• Source Code
• Mbox, RFC822, Outlook PST, Outlook MSG, TNEF
• DWG CAD
• DIF, GDAL, ISO-19139, Grib, HDF, ISA-Tab, NetCDF, Matlab
• Executables (Windows, Linux, Mac)
• Pkcs7
• SQLite
• Microsoft Access
OCROCR
OCROCR
• What if you don't have a text file, but instead a photo of
some text? Or a scan of some text?
• OCR (Optical Character Recognition) to the rescue!
• Tesseract is an Open Source OCR tool
• Tika has a parser which can use Tesseract for found images
• Tesseract is detected, and used if found on your path
• Explicit path can be given, or can be disabled
• TODO: Better combining of OCR + normal, or eg PDF only
Container FormatsContainer Formats
DatabasesDatabases
DatabasesDatabases
• A surprising number of Database and “database” systems
have a single-file mode
• If there's a single file, and a suitable library or program,
then Tika can get the data out!
• Main ones so far are MS Access & SQLite
• Panama Papers dump may inspire some more!
• How best to represent the contents in XHTML?
• One HTML table per Database Table best we have, so far...
Tika Config XMLTika Config XML
Tika Config XMLTika Config XML
• Using Config, you can specify what Parsers, Detectors,
Translator, Service Loader and Mime Types to use
• You can do it explicitly
• You can do it implicitly (with defaults)
• You can do “default except”
• Tools available to dump out a running config as XML
• Use the Tika App to see what you have + save it
Tika Config XML exampleTika Config XML example
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>image/jpeg</mime-exclude>
<mime-exclude>application/pdf</mime-exclude>
<parser-exclude
class="org.apache.tika.parser.executable.ExecutableParser"/>
</parser>
<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/pdf</mime>
</parser>
</parsers>
</properties>
Embedded ResourcesEmbedded Resources
Tika AppTika App
Tika ServerTika Server
OSGiOSGi
Tika BatchTika Batch
Named Entity RecognitionNamed Entity Recognition
Geo Entity LookupGeo Entity Lookup
Geo Entity LookupGeo Entity Lookup
• Augmenting “This was written in Vancouver BC in May” with
details of where that is (lat, long, country etc)
• Apache Lucene Gazetter provides fast lookup of place names
to geographic details
• Geonames.org dataset used to feed Gazetter
• Apache OpenNLP identifies places in text to lookup
• Needs custom NLP model for place name identification
• GeoTopicParser saves results as metadata, best & alternate
Apache cTAKESApache cTAKES
TranslationTranslation
Language DetectionLanguage Detection
TroubleshootingTroubleshooting
TroubleshootingTroubleshooting
• Finally, we have a troubleshooting guide!
http://wiki.apache.org/tika/Troubleshooting%20Tika
• Covers most of the major queries
• Why wasn’t the right parser used
• Why didn’t detection work
• What parsers do I really have etc!
What's New & Coming Soon?What's New & Coming Soon?What's New & Coming Soon?What's New & Coming Soon?
Apache Tika 1.11 – 1.13Apache Tika 1.11 – 1.13
Tika 1.11Tika 1.11
• Library upgrades for bug fixes (POI, PDFBox etc)
• Tika Config XML enhancements
• Tika Config XML output / dumping
• Apache Commons IO used more widely
• Java 7 nio.Path as an alternative to File
• GROBID – GeneRation Of Bibliographic Data – Support for
decorating parsing with machine-learnt bibliographic
information of scholarly documents
Tika 1.12Tika 1.12
• More consistent and better HTML between PPT and PPTX
• NamedEntity Parser, using both OpenNLP and Stanford
NER, outputting text and metadata
• GeoTopic Parser speedup via using new Lucene Geo
Gazetter REST server
• Pooled Time Series parser for video – motion properties
from videos to text to allow comparisons
• Bug fixes
Tika 1.13Tika 1.13
• Lots of library upgrades – Apache POI, Apache PDFBox 2.0,
Apache SIS and half a dozen others!
• Lots of new mimetypes and magic patterns, especially for
scientific-related formats
• NamedEntity Parser add support for Python NLTK and MIT-
NLP (MITRE)
• Tika Config XML dumping moved to core, and the app can
now dump your running config for you
• Bug fixes
• Being voted on right now!
Apache Tika 1.14+Apache Tika 1.14+
Tika 1.14+Tika 1.14+
• Commons IO in Core? TBD
• PDFBox 2.0 further updates, along with a 1.8 fallback
• Language Detector improvements – N-Gram, Optimaize
Lang Detector, MIT Text.jl, pluggable and pickable
• More NLP enhancement / augmentation
• Metadata aliasing
• Plus preparations for Tika 2
Tika 2.0Tika 2.0Tika 2.0Tika 2.0
Why no Tika v2 yet?Why no Tika v2 yet?
• Apache Tika 0.1 – December 2007
• Apache Tika 1.0 – November 2011
• Shouldn't we have had a v2 by now?
• Discussions started several years ago, on the list
• Plans for what we need on the wiki for ~1 year
• Largely though, every time someone came up with a breaking
feature for 2.0, a compatible way to do it was found!
Deprecated PartsDeprecated Parts
• Various parts of Tika have been deprecated over the years
• All of those will go!
• Main ones that might bite you:
• Parser parse with no ParseContext
• Old style Metadata keys
Metadata StorageMetadata Storage
• Currently, Metadata in Tika is String Key/Value Lists
• Many Metadata types have Properties, which provide
typing, conversions, sanity checks etc
• But all still stored as String Key + Value(s)
• Some people think we need a richer storage model
• Others want to keep it simple!
• JSON, XML DOM, XMP being debated
• Richer string keys also proposed
Metadata for Video etcMetadata for Video etc
• Video file might have 2 video streams, 4 audio streams, a
metadata stream and some subtitles
• Some of those you want to treat as embedded resources
• Some of those “belong” together
• How should we return the number of channels for the 1st
audio stream in a video?
• Should it change if there’s one or many?
Java Packaging of TikaJava Packaging of Tika
• Maven Packages of Tika are
• Tika Core
• Tika Parsers
• Tika Bundle
• Tika XMP
• Tika Java 7
• For just some parsers, in Tika 1.x, you need to exclude
maven dependencies + re-test
• In Tika 2, more fine-grained parser collections
Tika 2.x Parser SetsTika 2.x Parser Sets
• Available today in Git on the 2.x branch
• Advanced CAD Code
• Crypto Database eBook
• Journal Multimedia Office
• Package PDF Scientific
• Text Web XMP-Commons
• May change some more, but broadly in place now
Fallback/Preference ParsersFallback/Preference Parsers
• If we have several parsers that can handle a format
• Preferences?
• If one fails, how about trying others?
Multiple ParsersMultiple Parsers
• If we have several parsers that can handle a format
• What about running all of them?
• eg extract image metadata
• then OCR it
• then try the regular image parser for more metadata
• Or maybe for calling multiple different NER parsers
Parser Discovery/Loading?Parser Discovery/Loading?
• Currently, Tika uses a Service Loader mechanism to find
and load available Parsers (and Detectors+Translators)
• This allows you to drop a new Tika parser jar onto the
classpath, and have it automatically used
• Also allows you to miss one or two jars out, and not get any
content back with no warnings / errors...
• You can set the Service Loader to Warn, or even Error
• But most people don't, and it bites them!
• Change the default in 2? Or change entirely how we do it?
What we still need help with...What we still need help with...What we still need help with...What we still need help with...
Content Handler Reset/AddContent Handler Reset/Add
• Tika uses the SAX Content Handler interface for supplying
plain text along with semantically meaningful XHTML
• Streaming, write once
• How does that work with multiple parsers?
• How about if one parser fails and we want to try parsing
with a different one?
• How about if one parser works, then you want to run a
second?
• Language Detection / NER – how to mark up previous text?
Content EnhancementContent Enhancement
• How can we post-process the content to “enhance” it in
various ways?
• For example, how can we mark up parts of speach?
• Pull out information into the Metadata?
• Translate it, retaining the original positions?
• For just some formats, or for all?
• For just some documents in some formats?
• While still keeping the Streaming SAX-like contract?
Metadata StandardsMetadata Standards
• Currently, Tika works hard to map file-format-specific
metadata onto general metadata standards
• Means you don't have to know each standard in depth, can
just say “give me the closest to dc:subject you have, no
matter what file format or library it comes from”
• What about non-File-format metadata, such as content
metadata (Table of Contents, Author information etc)?
• What about combining things?
Richer MetadataRicher Metadata
• See Metadata Storage slides!
Bonus!Bonus! Apache Tika at ScaleApache Tika at ScaleBonus!Bonus! Apache Tika at ScaleApache Tika at Scale
Lots of Data is JunkLots of Data is Junk
• At scale, you're going to hit lots of edge cases
• At scale, you're going to come across lots of junk or
corrupted documents
• 1% of a lot is still a lot...
• 1% of the internet is a huge amount!
• Bound to find files which are unusual or corrupted enough
to be mis-identified
• You need to plan for failures!
Unusual TypesUnusual Types
• If you're working on a big data scale, you're bound to come
across lots of valid but unusual + unknown files
• You're never going to be able to add support for all of them!
• May be worth adding support for the more common
“uncommon” unsupported types
• Which means you'll need to track something about the files
you couldn't understand
• If Tika knows the mimetype but has no parser, just log the
mimetype
• If mimetype unknown, maybe log first few bytes
Failure at ScaleFailure at Scale
• Tika will sometimes mis-identify something, so sometimes
the wrong parser will run and object
• Some files will cause parsers or their underlying libraries to
do something silly, such as use lots of memory or get
into loops with lots to do
• Some files will cause parsers or their underlying libraries to
OOM, or infinite loop, or something else bad
• If a file fails once, will probably fail again, so blindly just re-
running that task again won't help
Failure at Scale, continuedFailure at Scale, continued
• You'll need approaches that plan for failure
• Consider what will happen if a file locks up your JVM, or
kills it with an OOM
• Forked Parser may be worth using
• Running a separate Tika Server could be good
• Depending on work needed, could have a smaller pool of
Tika Server instances for big data code to call
• Think about failure modes, then think about retries (or not)
• Track common problems, report and fix them!
Bonus!Bonus! Tika Batch, Eval & HadoopTika Batch, Eval & HadoopBonus!Bonus! Tika Batch, Eval & HadoopTika Batch, Eval & Hadoop
Tika Batch – TIKA-1330Tika Batch – TIKA-1330
• Aiming to provide a robust Tika wrapper, that handles
OOMs, permanent hangs, out of file handles etc
• Should be able to use Tika Batch to run Tika against a wide
range of documents, getting either content or an error
• First focus was on the Tika App, with adisk-to-disk wrapper
• Now looking at the Tika Server, to have it log errors,
provide a watchdog to restart after serious errors etc
• Once that's all baked in, refactor and fully-hadoop!
• Accept there will always be errors! Work with that
Tika Batch HadoopTika Batch Hadoop
• Now we have the basic Tika Batch working – Hadoop it!
• Aiming to provide a full Hadoop Tika Batch implementation
• Will process a large collection of files, providing either
Metadata+Content, or a detailed error of failure
• Failure could be machine/enivornment, so probably need to
retry a failure incase it isn't a Tika issue!
• Will be partly inspired by the work Apache Nutch does
• Tika will “eat our own dogfood” with this, using it to test for
regressions / improvements between versions
Tika Eval – TIKA-1332Tika Eval – TIKA-1332
• Building on top of Tika Batch, to work out how well / badly a
version of Tika does on a large collection of documents
• Provide comparible profiling of a run on a corpus
• Number of different file types found, number of exceptions,
exceptions by type and file type, attachements etc
• Also provide information on langauge stats, and junk text
• Identify file types to look at supporting
• Identify file types / exceptions which have regressed
• Identify exceptions / problems to try to fix
• Identify things for manual review, eg TIKA-1442 PDFBox bug
Batch+Eval+Public DatasetsBatch+Eval+Public Datasets
• When looking at a new feature, or looking to upgrade a
dependency, we want to know if we have broken anything
• Unit tests provide a good first-pass, but only so many files
• Running against a very large dataset and comparing
before/after the best way to handle it
• Initially piloting + developing against the Govdocs1 corpus
http://digitalcorpora.org/corpora/govdocs
• Using donated hosting from Rackspace for trying this
• Need newer + more varied corpuses as well! Know of any?
Reports include: Detection diffsReports include: Detection diffs
Any Questions?Any Questions?Any Questions?Any Questions?
Nick Burch
@Gagravarr
nick@apache.org
Nick Burch
@Gagravarr
nick@apache.org

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 

Kürzlich hochgeladen (20)

Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Apache Tika - what's new with 2.0?

  • 1. Apache Tika What’s new with 2.0? Apache Tika What’s new with 2.0?
  • 2. Nick Burch CTO, Quanticate Nick Burch CTO, Quanticate
  • 3. “small, yellow and leech-like, and probably the oddest thing in the Universe” • Like a Babel Fish for content! • Helps you work out what sort of thing your content (1s & 0s) is • Helps you extract the metadata from it, in a consistent way • Lets you get a plain text version of your content, eg for full text indexing • Provides a rich (XHTML) version too Tika, in a nutshellTika, in a nutshell
  • 4. Tika in the newsTika in the news • Panama Papers – Tika used to extract content from most of the files before indexing in Apache SOLR https://source.opennews.org/en-US/articles/people-and- tech-behind-panama-papers/ • MEMEX – DARPA funded project https://nakedsecurity.sophos.com/2015/02/16/memex- darpas-search-engine-for-the-dark-web/
  • 5. A bit of historyA bit of historyA bit of historyA bit of history
  • 6. Before TikaBefore Tika • In the early 2000s, everyone was building a search engine / search system for their CMS / web spider / etc • Lucene mailing list and wiki had lots of code snippets for using libraries to extract text • Lots of bugs, people using old versions, people missing out on useful formats, confusion abounded • Handful of commercial libraries, generally expensive and aimed at large companies and/or computer forensics • Everyone was re-inventing the wheel, and doing it badly....
  • 7. Tika's History (in brief)Tika's History (in brief) • The idea from Tika first came from the Apache Nutch project, who wanted to get useful things out of all the content they were spidering and indexing • The Apache Lucene project (which Nutch used) were also interested, as lots of people there had the same problems • Ideas and discussions started in 2006 • Project founded in 2007, in the Apache Incubator • Initial contributions from Nutch, Lucene and Lius • Graduated in 2008, v1.0 in 2011
  • 8. Tika ReleasesTika Releases 10/06 02/08 07/09 11/10 04/12 08/13 12/14 05/16 09/17 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
  • 9. Tika ReleasesTika Releases 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.10 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.101.111.121.13 01/07 05/08 09/09 02/11 06/12 11/13 03/15 08/16 12/17
  • 10. A (brief) introduction to TikaA (brief) introduction to TikaA (brief) introduction to TikaA (brief) introduction to Tika
  • 11. (Some) Supported Formats(Some) Supported Formats • HTML, XHTML, XML • Microsoft Office – Word, Excel, PowerPoint, Works, Publisher, Visio – Binary and OOXML formats • OpenDocument (OpenOffice) • iWorks – Keynote, Pages, Numbers • PDF, RTF, Plain Text, CHM Help • Compression / Archive – Zip, Tar, Ar, 7z, bz2, gz etc • Atom, RSS, ePub Lots of Scientific formats • Audio – MP3, MP4, Vorbis, Opus, Speex, MIDI, Wav • Image – JPEG, TIFF, PNG, BMP, GIF, ICO
  • 12. DetectionDetection • Work out what kind of file something is • Based on a mixture of things • Filename • Mime magic (first few hundred bytes) • Dedicated code (eg containers) • Some combination of all of these • Can be used as a standalone – what is this thing? • Can be combined with parsers – figure out what this is, then find a parser to work on it
  • 13. MetadataMetadata • Describes a file • eg Title, Author, Creation Date, Location • Tika provides a way to extract this (where present) • However, each file format tends to have its own kind of metadata, which can vary a lot • eg Author, Creator, Created By, First Author, Creator[0] • Tika tries to map file format specific metadata onto common, consistent metadata keys • “Give me the thing that closest represents what Dublin Core defines as Creator”
  • 14. Plain TextPlain Text • Most file formats include at least some text • For a plain text file, that's everything in it! • For others, it's only part • Lots of libraries out there which can extract text, but how you call them varies a lot • Tika wraps all that up for you, and gives consistentency • Plain Text is ideal for things like Full Text Indexing, eg to feed into SOLR, Lucene or ElasticSearch
  • 15. XHTMLXHTML • Structured Text extraction • Outputs SAX events for the tags and text of a file • This is actually the Tika default, Plain Text is implemented by only catching the Text parts of the SAX output • Isn't supposed to be the “exact representation” • Aims to give meaningful, semantic but simple output • Can be used for basic previews • Can be used to filter, eg ignore header + footer then give remainder as plain text
  • 16. What's New?What's New?What's New?What's New?
  • 18. Supported FormatsSupported Formats • HTML • XML • Microsoft Office • Word • PowerPoint • Excel (2,3,4,5,97+) • Visio • Outlook
  • 19. Supported FormatsSupported Formats • Open Document Format (ODF) • iWorks • PDF • ePUB • RTF • Tar, RAR, AR, CPIO, Zip, 7Zip, Gzip, BZip2, XZ and Pack200 • Plain Text • RSS and Atom
  • 20. Supported FormatsSupported Formats • IPTC ANPA Newswire • CHM Help • Wav, MIDI • MP3, MP4 Audio • Ogg Vorbis, Speex, FLAC, Opus, Theora • PNG, JPG, BMP, TIFF, BPG, ICNS, PSD, WebP • FLV, MP4 Video – Metadata and video histograms • Java classes
  • 21. Supported FormatsSupported Formats • Source Code • Mbox, RFC822, Outlook PST, Outlook MSG, TNEF • DWG CAD • DIF, GDAL, ISO-19139, Grib, HDF, ISA-Tab, NetCDF, Matlab • Executables (Windows, Linux, Mac) • Pkcs7 • SQLite • Microsoft Access
  • 23. OCROCR • What if you don't have a text file, but instead a photo of some text? Or a scan of some text? • OCR (Optical Character Recognition) to the rescue! • Tesseract is an Open Source OCR tool • Tika has a parser which can use Tesseract for found images • Tesseract is detected, and used if found on your path • Explicit path can be given, or can be disabled • TODO: Better combining of OCR + normal, or eg PDF only
  • 26. DatabasesDatabases • A surprising number of Database and “database” systems have a single-file mode • If there's a single file, and a suitable library or program, then Tika can get the data out! • Main ones so far are MS Access & SQLite • Panama Papers dump may inspire some more! • How best to represent the contents in XHTML? • One HTML table per Database Table best we have, so far...
  • 27. Tika Config XMLTika Config XML
  • 28. Tika Config XMLTika Config XML • Using Config, you can specify what Parsers, Detectors, Translator, Service Loader and Mime Types to use • You can do it explicitly • You can do it implicitly (with defaults) • You can do “default except” • Tools available to dump out a running config as XML • Use the Tika App to see what you have + save it
  • 29. Tika Config XML exampleTika Config XML example <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <mime-exclude>image/jpeg</mime-exclude> <mime-exclude>application/pdf</mime-exclude> <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/> </parser> <parser class="org.apache.tika.parser.EmptyParser"> <mime>application/pdf</mime> </parser> </parsers> </properties>
  • 35. Named Entity RecognitionNamed Entity Recognition
  • 36. Geo Entity LookupGeo Entity Lookup
  • 37. Geo Entity LookupGeo Entity Lookup • Augmenting “This was written in Vancouver BC in May” with details of where that is (lat, long, country etc) • Apache Lucene Gazetter provides fast lookup of place names to geographic details • Geonames.org dataset used to feed Gazetter • Apache OpenNLP identifies places in text to lookup • Needs custom NLP model for place name identification • GeoTopicParser saves results as metadata, best & alternate
  • 42. TroubleshootingTroubleshooting • Finally, we have a troubleshooting guide! http://wiki.apache.org/tika/Troubleshooting%20Tika • Covers most of the major queries • Why wasn’t the right parser used • Why didn’t detection work • What parsers do I really have etc!
  • 43. What's New & Coming Soon?What's New & Coming Soon?What's New & Coming Soon?What's New & Coming Soon?
  • 44. Apache Tika 1.11 – 1.13Apache Tika 1.11 – 1.13
  • 45. Tika 1.11Tika 1.11 • Library upgrades for bug fixes (POI, PDFBox etc) • Tika Config XML enhancements • Tika Config XML output / dumping • Apache Commons IO used more widely • Java 7 nio.Path as an alternative to File • GROBID – GeneRation Of Bibliographic Data – Support for decorating parsing with machine-learnt bibliographic information of scholarly documents
  • 46. Tika 1.12Tika 1.12 • More consistent and better HTML between PPT and PPTX • NamedEntity Parser, using both OpenNLP and Stanford NER, outputting text and metadata • GeoTopic Parser speedup via using new Lucene Geo Gazetter REST server • Pooled Time Series parser for video – motion properties from videos to text to allow comparisons • Bug fixes
  • 47. Tika 1.13Tika 1.13 • Lots of library upgrades – Apache POI, Apache PDFBox 2.0, Apache SIS and half a dozen others! • Lots of new mimetypes and magic patterns, especially for scientific-related formats • NamedEntity Parser add support for Python NLTK and MIT- NLP (MITRE) • Tika Config XML dumping moved to core, and the app can now dump your running config for you • Bug fixes • Being voted on right now!
  • 49. Tika 1.14+Tika 1.14+ • Commons IO in Core? TBD • PDFBox 2.0 further updates, along with a 1.8 fallback • Language Detector improvements – N-Gram, Optimaize Lang Detector, MIT Text.jl, pluggable and pickable • More NLP enhancement / augmentation • Metadata aliasing • Plus preparations for Tika 2
  • 50. Tika 2.0Tika 2.0Tika 2.0Tika 2.0
  • 51. Why no Tika v2 yet?Why no Tika v2 yet? • Apache Tika 0.1 – December 2007 • Apache Tika 1.0 – November 2011 • Shouldn't we have had a v2 by now? • Discussions started several years ago, on the list • Plans for what we need on the wiki for ~1 year • Largely though, every time someone came up with a breaking feature for 2.0, a compatible way to do it was found!
  • 52. Deprecated PartsDeprecated Parts • Various parts of Tika have been deprecated over the years • All of those will go! • Main ones that might bite you: • Parser parse with no ParseContext • Old style Metadata keys
  • 53. Metadata StorageMetadata Storage • Currently, Metadata in Tika is String Key/Value Lists • Many Metadata types have Properties, which provide typing, conversions, sanity checks etc • But all still stored as String Key + Value(s) • Some people think we need a richer storage model • Others want to keep it simple! • JSON, XML DOM, XMP being debated • Richer string keys also proposed
  • 54. Metadata for Video etcMetadata for Video etc • Video file might have 2 video streams, 4 audio streams, a metadata stream and some subtitles • Some of those you want to treat as embedded resources • Some of those “belong” together • How should we return the number of channels for the 1st audio stream in a video? • Should it change if there’s one or many?
  • 55. Java Packaging of TikaJava Packaging of Tika • Maven Packages of Tika are • Tika Core • Tika Parsers • Tika Bundle • Tika XMP • Tika Java 7 • For just some parsers, in Tika 1.x, you need to exclude maven dependencies + re-test • In Tika 2, more fine-grained parser collections
  • 56. Tika 2.x Parser SetsTika 2.x Parser Sets • Available today in Git on the 2.x branch • Advanced CAD Code • Crypto Database eBook • Journal Multimedia Office • Package PDF Scientific • Text Web XMP-Commons • May change some more, but broadly in place now
  • 57. Fallback/Preference ParsersFallback/Preference Parsers • If we have several parsers that can handle a format • Preferences? • If one fails, how about trying others?
  • 58. Multiple ParsersMultiple Parsers • If we have several parsers that can handle a format • What about running all of them? • eg extract image metadata • then OCR it • then try the regular image parser for more metadata • Or maybe for calling multiple different NER parsers
  • 59. Parser Discovery/Loading?Parser Discovery/Loading? • Currently, Tika uses a Service Loader mechanism to find and load available Parsers (and Detectors+Translators) • This allows you to drop a new Tika parser jar onto the classpath, and have it automatically used • Also allows you to miss one or two jars out, and not get any content back with no warnings / errors... • You can set the Service Loader to Warn, or even Error • But most people don't, and it bites them! • Change the default in 2? Or change entirely how we do it?
  • 60. What we still need help with...What we still need help with...What we still need help with...What we still need help with...
  • 61. Content Handler Reset/AddContent Handler Reset/Add • Tika uses the SAX Content Handler interface for supplying plain text along with semantically meaningful XHTML • Streaming, write once • How does that work with multiple parsers? • How about if one parser fails and we want to try parsing with a different one? • How about if one parser works, then you want to run a second? • Language Detection / NER – how to mark up previous text?
  • 62. Content EnhancementContent Enhancement • How can we post-process the content to “enhance” it in various ways? • For example, how can we mark up parts of speach? • Pull out information into the Metadata? • Translate it, retaining the original positions? • For just some formats, or for all? • For just some documents in some formats? • While still keeping the Streaming SAX-like contract?
  • 63. Metadata StandardsMetadata Standards • Currently, Tika works hard to map file-format-specific metadata onto general metadata standards • Means you don't have to know each standard in depth, can just say “give me the closest to dc:subject you have, no matter what file format or library it comes from” • What about non-File-format metadata, such as content metadata (Table of Contents, Author information etc)? • What about combining things?
  • 64. Richer MetadataRicher Metadata • See Metadata Storage slides!
  • 65. Bonus!Bonus! Apache Tika at ScaleApache Tika at ScaleBonus!Bonus! Apache Tika at ScaleApache Tika at Scale
  • 66. Lots of Data is JunkLots of Data is Junk • At scale, you're going to hit lots of edge cases • At scale, you're going to come across lots of junk or corrupted documents • 1% of a lot is still a lot... • 1% of the internet is a huge amount! • Bound to find files which are unusual or corrupted enough to be mis-identified • You need to plan for failures!
  • 67. Unusual TypesUnusual Types • If you're working on a big data scale, you're bound to come across lots of valid but unusual + unknown files • You're never going to be able to add support for all of them! • May be worth adding support for the more common “uncommon” unsupported types • Which means you'll need to track something about the files you couldn't understand • If Tika knows the mimetype but has no parser, just log the mimetype • If mimetype unknown, maybe log first few bytes
  • 68. Failure at ScaleFailure at Scale • Tika will sometimes mis-identify something, so sometimes the wrong parser will run and object • Some files will cause parsers or their underlying libraries to do something silly, such as use lots of memory or get into loops with lots to do • Some files will cause parsers or their underlying libraries to OOM, or infinite loop, or something else bad • If a file fails once, will probably fail again, so blindly just re- running that task again won't help
  • 69. Failure at Scale, continuedFailure at Scale, continued • You'll need approaches that plan for failure • Consider what will happen if a file locks up your JVM, or kills it with an OOM • Forked Parser may be worth using • Running a separate Tika Server could be good • Depending on work needed, could have a smaller pool of Tika Server instances for big data code to call • Think about failure modes, then think about retries (or not) • Track common problems, report and fix them!
  • 70. Bonus!Bonus! Tika Batch, Eval & HadoopTika Batch, Eval & HadoopBonus!Bonus! Tika Batch, Eval & HadoopTika Batch, Eval & Hadoop
  • 71. Tika Batch – TIKA-1330Tika Batch – TIKA-1330 • Aiming to provide a robust Tika wrapper, that handles OOMs, permanent hangs, out of file handles etc • Should be able to use Tika Batch to run Tika against a wide range of documents, getting either content or an error • First focus was on the Tika App, with adisk-to-disk wrapper • Now looking at the Tika Server, to have it log errors, provide a watchdog to restart after serious errors etc • Once that's all baked in, refactor and fully-hadoop! • Accept there will always be errors! Work with that
  • 72. Tika Batch HadoopTika Batch Hadoop • Now we have the basic Tika Batch working – Hadoop it! • Aiming to provide a full Hadoop Tika Batch implementation • Will process a large collection of files, providing either Metadata+Content, or a detailed error of failure • Failure could be machine/enivornment, so probably need to retry a failure incase it isn't a Tika issue! • Will be partly inspired by the work Apache Nutch does • Tika will “eat our own dogfood” with this, using it to test for regressions / improvements between versions
  • 73. Tika Eval – TIKA-1332Tika Eval – TIKA-1332 • Building on top of Tika Batch, to work out how well / badly a version of Tika does on a large collection of documents • Provide comparible profiling of a run on a corpus • Number of different file types found, number of exceptions, exceptions by type and file type, attachements etc • Also provide information on langauge stats, and junk text • Identify file types to look at supporting • Identify file types / exceptions which have regressed • Identify exceptions / problems to try to fix • Identify things for manual review, eg TIKA-1442 PDFBox bug
  • 74. Batch+Eval+Public DatasetsBatch+Eval+Public Datasets • When looking at a new feature, or looking to upgrade a dependency, we want to know if we have broken anything • Unit tests provide a good first-pass, but only so many files • Running against a very large dataset and comparing before/after the best way to handle it • Initially piloting + developing against the Govdocs1 corpus http://digitalcorpora.org/corpora/govdocs • Using donated hosting from Rackspace for trying this • Need newer + more varied corpuses as well! Know of any?
  • 75. Reports include: Detection diffsReports include: Detection diffs
  • 76. Any Questions?Any Questions?Any Questions?Any Questions?