SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Content Extraction with Apache Tika
     Jukka Zitting | Tika committer, co-author of Tika in Action




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Content Extraction with Apache Tika

    Introduction to Apache Tika
    Full text extraction with Tika
    Tika and Solr - the ExtractingRequestHandler
    Tika and Lucene - direct feeding of the index
         forked parsing
         link extraction




© 2012 Adobe Systems Incorporated. All Rights Reserved.   2
Introduction to Apache Tika
                                                          section 1 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Introduction to Apache Tika




                          The Apache Tika™ toolkit
                          - detects and extracts
                          - metadata and structured text content
                          - from various documents
                          - using existing parser libraries.

© 2012 Adobe Systems Incorporated. All Rights Reserved.   4
Problem domain




© 2012 Adobe Systems Incorporated. All Rights Reserved.   5
The Tika solution



                                                              It is a truth
                                                              universally
                                                              acknowledged, that
                                                              a single man in
                                                              possession of a
                                                              good fortune, must
                                                              be in want of a
                                                              wife...
                                                              Content
                                                              dc:title=
                                                               Pride and Prejudice
                                                              dc:creator=
                                                               Jane Austen
         Document                                             dc:date=1813

                                                              Metadata




© 2012 Adobe Systems Incorporated. All Rights Reserved.   6
Project background

    Brief history
         2007           Tika       started in the Apache Incubator
         2008           Tika       graduates into a Lucene subproject
         2010           Tika       becomes a standalone TLP
         2011           Tika       1.0 released
         2011           Tika       in Action published
    Latest release is Apache Tika 1.2
         thousands of known media types
             most with associated type detection patterns
         dozens of supported document formats
             including all major office formats
         basic language detection
         etc.
    For more information http://tika.apache.org/


© 2012 Adobe Systems Incorporated. All Rights Reserved.   7
Ohloh summary (http://www.ohloh.net/p/tika)




© 2012 Adobe Systems Incorporated. All Rights Reserved.   8
Full text extraction with Tika
                                                          section 2 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Demo: tika-app-1.2.jar

    https://github.com/jukka/tika-demo
    java -jar tika-app-1.2.jar




© 2012 Adobe Systems Incorporated. All Rights Reserved.   10
tika-app as a command line tool

$ java -jar tika-app-1.2.jar --xhtml /path/to/
document.doc

$ java -jar tika-app-1.2.jar --text http://example.com/
 document

$ java -jar tika-app-1.2.jar --metadata < document.doc

$ cat document.doc | java -jar tika-app-1.2.jar --text |
 grep foo

$ java -jar tika-app-1.2.jar --help



© 2012 Adobe Systems Incorporated. All Rights Reserved.   11
Tika’s Java API

    Divided in two layers
         The Tika facade: org.apache.tika.Tika
         Lower-level interfaces like Parser, Detector, etc.

    Use the Tika facade by default
         Provides simple support for most common use cases
         Example: new Tika().parseToString(“/path/to/document.doc”)

    Use the lower-level interfaces for more power or flexibility
         Allows fine-grained control of Tika functionality
         More complicated programming model
             Parsed content handled as XHTML SAX events
         Not all functionality is exposed through the Tika facade




© 2012 Adobe Systems Incorporated. All Rights Reserved.   12
Tika and Solr - the ExtractingRequestHandler
                                                          section 3 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
ExtractingRequestHandler

    aka Solr Cell
    http://wiki.apache.org/solr/ExtractingRequestHandler

     “Solr's ExtractingRequestHandler uses Tika
     to allow users to upload binary files to Solr and
     have Solr extract text from it and then index it.”

    For example:
     $ curl "http://localhost:8983/solr/update/extract?
     literal.id=document&commit=true" -F "file=@document.doc"

    Supports both text and metadata extraction
         with plenty of configurable options




© 2012 Adobe Systems Incorporated. All Rights Reserved.   14
ExtractingRequestHandler parameters

    Helping Tika do it’s job
         resource.name=document.doc - Helps Tika’s automatic type detection
         resource.password=secret - Allows Tika to read encrypted documents
         passwordsFile=/path/to/password-file - Resource name to password
          mappings
             for example: .*.pdf$ = pdf-secret
    Capturing special content
         xpath=//a - Capture only content inside elements that match the
          specified query
         capture=h1 - Capture content inside specific elements to a separate
          field
         captureAttr=true - Capture attributes into separate fields named after
          the element
    Mapping field names
         lowernames=true - Normalize metadata field names to “content_type”,
          etc.

© 2012 Adobe Systems Incorporated. All Rights Reserved.   15
Tika and Lucene - direct feeding of the index
                                                          section 4 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Using the Tika facade to feed Lucene

// Index first part of the document
String text = new Tika().parseToString(“/path/to/document.doc”);
document.add(new TextField(“text”, text, Field.Store.NO));


// Index the full document
Reader reader = new Tika().parse(“/path/to/document.doc”);
document.add(new TextField(“text”, reader));


// Index also some metadata
Metadata metadata = new Metadata();
Reader reader = new Tika().parse(
   new FileInputStream(“/path/to/document.doc”), metadata);
document.add(new TextField(“text”, reader));
document.add(new StringField(“type”,
metadata.get(Metadata.CONTENT_TYPE));

© 2012 Adobe Systems Incorporated. All Rights Reserved.   17
Things to consider

    What if the document is larger than your memory?
         Index only first N bytes/characters?
         WriteOutContentHandler supports an explicit write limit
             Enabled by default in the Tika facade, see get/setMaxStringLength()

    What if the document is malformed or intentionally broken?
         Could cause denial of service problems
             Might even crash the entire JVM due to bugs in native libraries in the JDK!
         SecureContentHandler monitors parsing and terminates it if things look
          bad
             Enabled by default in the Tika facade

    Ultimate solution: forked parsing and the Tika server
         Parse documents in separate, sandboxed JVM processes
         A document could fail to parse, but your application won’t crash
         Code is already there, but still a bit tricky to set up

© 2012 Adobe Systems Incorporated. All Rights Reserved.   18
Link extraction for web crawlers

    The LinkContentHandler class can be used to extract all links from
     a document
         Works also with links in things like PDF, MS Word and email documents
         Use TeeContentHandler to combine with other ways of capturing content

// for example
LinkContentHandler lch = new LinkContentHandler();
BodyContentHandler bch = new BodyContentHandler();
new Tika().getParser().parse(..., new TeeContentHandler(lch, bch), ...);

System.out.println(“Content: “ + bch);
for (Link link : lch.getLinks()) {
   System.out.println(“Link: “ + link):
}


© 2012 Adobe Systems Incorporated. All Rights Reserved.   19
Questions?
                                                          http://tika.apache.org/




© 2012 Adobe Systems Incorporated. All Rights Reserved.
© 2012 Adobe Systems Incorporated. All Rights Reserved.

Weitere ähnliche Inhalte

Was ist angesagt?

How To Manage API Request with AXIOS on a React Native App
How To Manage API Request with AXIOS on a React Native AppHow To Manage API Request with AXIOS on a React Native App
How To Manage API Request with AXIOS on a React Native AppAndolasoft Inc
 
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaJukka Zitting
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화NAVER D2
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSteve Loughran
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Elastic 101 - Get started
Elastic 101 - Get startedElastic 101 - Get started
Elastic 101 - Get startedIsmaeel Enjreny
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applicationsKnoldus Inc.
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Databricks
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDatabricks
 
Log management with ELK
Log management with ELKLog management with ELK
Log management with ELKGeert Pante
 

Was ist angesagt? (20)

How To Manage API Request with AXIOS on a React Native App
How To Manage API Request with AXIOS on a React Native AppHow To Manage API Request with AXIOS on a React Native App
How To Manage API Request with AXIOS on a React Native App
 
Custom PrimeFaces components
Custom PrimeFaces componentsCustom PrimeFaces components
Custom PrimeFaces components
 
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Elastic 101 - Get started
Elastic 101 - Get startedElastic 101 - Get started
Elastic 101 - Get started
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
React js
React jsReact js
React js
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
Log management with ELK
Log management with ELKLog management with ELK
Log management with ELK
 

Ähnlich wie Content extraction with apache tika

PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationAlfresco Software
 
Securing docker containers
Securing docker containersSecuring docker containers
Securing docker containersMihir Shah
 
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemasDC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemasMikael Nilsson
 
DataCite How To: Use the MDS
DataCite How To: Use the MDSDataCite How To: Use the MDS
DataCite How To: Use the MDSFrauke Ziedorn
 
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...Amazon Web Services
 
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...buildacloud
 
Open writing-cloud-collab
Open writing-cloud-collabOpen writing-cloud-collab
Open writing-cloud-collabKaren Vuong
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchivePhil Cryer
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataMarco Garcia
 
What is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna NookellaWhat is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna Nookellamuralikrishnanookella
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it bettergvernik
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?gvernik
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tikaSutthipong Kuruhongsa
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tikaSutthipong Kuruhongsa
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordMark Wilkinson
 
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAmazon Web Services
 

Ähnlich wie Content extraction with apache tika (20)

PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Securing docker containers
Securing docker containersSecuring docker containers
Securing docker containers
 
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemasDC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
 
DataCite How To: Use the MDS
DataCite How To: Use the MDSDataCite How To: Use the MDS
DataCite How To: Use the MDS
 
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
 
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
 
Open writing-cloud-collab
Open writing-cloud-collabOpen writing-cloud-collab
Open writing-cloud-collab
 
Multi Stage Docker Build
Multi Stage Docker Build Multi Stage Docker Build
Multi Stage Docker Build
 
People aggregator
People aggregatorPeople aggregator
People aggregator
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent Archive
 
Core os dna_automacon
Core os dna_automaconCore os dna_automacon
Core os dna_automacon
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
 
What is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna NookellaWhat is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna Nookella
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Cloudera and Spark setup
Cloudera and Spark setupCloudera and Spark setup
Cloudera and Spark setup
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, Oxford
 
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic Beanstalk
 

Mehr von Jukka Zitting

The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6Jukka Zitting
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CIJukka Zitting
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Jukka Zitting
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repositoryJukka Zitting
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStoreJukka Zitting
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorJukka Zitting
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Jukka Zitting
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repositoryJukka Zitting
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuningJukka Zitting
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical modelJukka Zitting
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache TikaJukka Zitting
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitJukka Zitting
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiJukka Zitting
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On SteroidsJukka Zitting
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache TikaJukka Zitting
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of JackrabbitJukka Zitting
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache JackrabbitJukka Zitting
 

Mehr von Jukka Zitting (19)

The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repository
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuning
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
NoSQL Oakland
NoSQL OaklandNoSQL Oakland
NoSQL Oakland
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On Steroids
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of Jackrabbit
 
Apache Tika
Apache TikaApache Tika
Apache Tika
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
 

Content extraction with apache tika

  • 1. Content Extraction with Apache Tika Jukka Zitting | Tika committer, co-author of Tika in Action © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 2. Content Extraction with Apache Tika  Introduction to Apache Tika  Full text extraction with Tika  Tika and Solr - the ExtractingRequestHandler  Tika and Lucene - direct feeding of the index  forked parsing  link extraction © 2012 Adobe Systems Incorporated. All Rights Reserved. 2
  • 3. Introduction to Apache Tika section 1 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 4. Introduction to Apache Tika The Apache Tika™ toolkit - detects and extracts - metadata and structured text content - from various documents - using existing parser libraries. © 2012 Adobe Systems Incorporated. All Rights Reserved. 4
  • 5. Problem domain © 2012 Adobe Systems Incorporated. All Rights Reserved. 5
  • 6. The Tika solution It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife... Content dc:title= Pride and Prejudice dc:creator= Jane Austen Document dc:date=1813 Metadata © 2012 Adobe Systems Incorporated. All Rights Reserved. 6
  • 7. Project background  Brief history  2007 Tika started in the Apache Incubator  2008 Tika graduates into a Lucene subproject  2010 Tika becomes a standalone TLP  2011 Tika 1.0 released  2011 Tika in Action published  Latest release is Apache Tika 1.2  thousands of known media types  most with associated type detection patterns  dozens of supported document formats  including all major office formats  basic language detection  etc.  For more information http://tika.apache.org/ © 2012 Adobe Systems Incorporated. All Rights Reserved. 7
  • 8. Ohloh summary (http://www.ohloh.net/p/tika) © 2012 Adobe Systems Incorporated. All Rights Reserved. 8
  • 9. Full text extraction with Tika section 2 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 10. Demo: tika-app-1.2.jar  https://github.com/jukka/tika-demo  java -jar tika-app-1.2.jar © 2012 Adobe Systems Incorporated. All Rights Reserved. 10
  • 11. tika-app as a command line tool $ java -jar tika-app-1.2.jar --xhtml /path/to/ document.doc $ java -jar tika-app-1.2.jar --text http://example.com/ document $ java -jar tika-app-1.2.jar --metadata < document.doc $ cat document.doc | java -jar tika-app-1.2.jar --text | grep foo $ java -jar tika-app-1.2.jar --help © 2012 Adobe Systems Incorporated. All Rights Reserved. 11
  • 12. Tika’s Java API  Divided in two layers  The Tika facade: org.apache.tika.Tika  Lower-level interfaces like Parser, Detector, etc.  Use the Tika facade by default  Provides simple support for most common use cases  Example: new Tika().parseToString(“/path/to/document.doc”)  Use the lower-level interfaces for more power or flexibility  Allows fine-grained control of Tika functionality  More complicated programming model  Parsed content handled as XHTML SAX events  Not all functionality is exposed through the Tika facade © 2012 Adobe Systems Incorporated. All Rights Reserved. 12
  • 13. Tika and Solr - the ExtractingRequestHandler section 3 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 14. ExtractingRequestHandler  aka Solr Cell  http://wiki.apache.org/solr/ExtractingRequestHandler “Solr's ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.”  For example: $ curl "http://localhost:8983/solr/update/extract? literal.id=document&commit=true" -F "file=@document.doc"  Supports both text and metadata extraction  with plenty of configurable options © 2012 Adobe Systems Incorporated. All Rights Reserved. 14
  • 15. ExtractingRequestHandler parameters  Helping Tika do it’s job  resource.name=document.doc - Helps Tika’s automatic type detection  resource.password=secret - Allows Tika to read encrypted documents  passwordsFile=/path/to/password-file - Resource name to password mappings  for example: .*.pdf$ = pdf-secret  Capturing special content  xpath=//a - Capture only content inside elements that match the specified query  capture=h1 - Capture content inside specific elements to a separate field  captureAttr=true - Capture attributes into separate fields named after the element  Mapping field names  lowernames=true - Normalize metadata field names to “content_type”, etc. © 2012 Adobe Systems Incorporated. All Rights Reserved. 15
  • 16. Tika and Lucene - direct feeding of the index section 4 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 17. Using the Tika facade to feed Lucene // Index first part of the document String text = new Tika().parseToString(“/path/to/document.doc”); document.add(new TextField(“text”, text, Field.Store.NO)); // Index the full document Reader reader = new Tika().parse(“/path/to/document.doc”); document.add(new TextField(“text”, reader)); // Index also some metadata Metadata metadata = new Metadata(); Reader reader = new Tika().parse( new FileInputStream(“/path/to/document.doc”), metadata); document.add(new TextField(“text”, reader)); document.add(new StringField(“type”, metadata.get(Metadata.CONTENT_TYPE)); © 2012 Adobe Systems Incorporated. All Rights Reserved. 17
  • 18. Things to consider  What if the document is larger than your memory?  Index only first N bytes/characters?  WriteOutContentHandler supports an explicit write limit  Enabled by default in the Tika facade, see get/setMaxStringLength()  What if the document is malformed or intentionally broken?  Could cause denial of service problems  Might even crash the entire JVM due to bugs in native libraries in the JDK!  SecureContentHandler monitors parsing and terminates it if things look bad  Enabled by default in the Tika facade  Ultimate solution: forked parsing and the Tika server  Parse documents in separate, sandboxed JVM processes  A document could fail to parse, but your application won’t crash  Code is already there, but still a bit tricky to set up © 2012 Adobe Systems Incorporated. All Rights Reserved. 18
  • 19. Link extraction for web crawlers  The LinkContentHandler class can be used to extract all links from a document  Works also with links in things like PDF, MS Word and email documents  Use TeeContentHandler to combine with other ways of capturing content // for example LinkContentHandler lch = new LinkContentHandler(); BodyContentHandler bch = new BodyContentHandler(); new Tika().getParser().parse(..., new TeeContentHandler(lch, bch), ...); System.out.println(“Content: “ + bch); for (Link link : lch.getLinks()) { System.out.println(“Link: “ + link): } © 2012 Adobe Systems Incorporated. All Rights Reserved. 19
  • 20. Questions? http://tika.apache.org/ © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 21. © 2012 Adobe Systems Incorporated. All Rights Reserved.

Hinweis der Redaktion

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n