SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Scientific data curation and
processing with Apache Tika
Chris A. Mattmann
Senior Computer Scientist, NASA Jet Propulsion Laboratory
Adjunct Assistant Professor, Univ. of Southern California
Member, Apache Software Foundation
Roadmap
• 1st
part of the talk
– Why Tika?
– What is Tika?
– What are the current versions of Tika?
– What can it do?
• 2nd
part of the talk
– NASA Earth Science Data Systems
– Data System Needs and Requirements
– How does Tika help?
And you are?
• Apache Member involved in
– Tika (VP,PMC), Nutch (PMC), Incubator (PMC),
OODT (Mentor), SIS (Mentor), Lucy (Mentor) and
Gora (Champion)
• Architect/Developer at
NASA JPL in
Pasadena, CA
• Software
Architecture/Engineeri
ng Prof at USC
The Information Landscape
Proliferation of content types
available
• By some accounts, 16K to 51K content
types*
• What to do with content types?
– Parse them
• How?
• Extract their text and structure
– Index their metadata
• In an indexing technology like Lucene, Solr, or in
Google Appliance
– Identify what language they belong to
• Ngrams
*http://filext.com/
Importance of content types
Importance of content type
detection
Search Engine Architecture
Goals
• Identify and classify file types
– MIME detection
• Glob pattern
– *.txt
– *.pdf
• URL
– http://…pdf
– ftp://myfile.txt
• Magic bytes
• Combination of
the above means
• Classification means
reaction can be targeted
is…
• A content analysis and detection toolkit
• A set of Java APIs providing MIME type
detection, language identification,
integration of various parsing libraries
• A rich Metadata API for representing
different Metadata models
• A command line interface to the
underlying Java code
• A GUI interface to the Java code
Tika’s (Brief) History
• Original idea for Tika came from Chris Mattmann
and Jerome Charron in 2006
• Proposed as Lucene sub-project
– Others interested, didn’t gain much traction
• Went the Incubator route in 2007 when Jukka
Zitting found that there was a need for Tika
capabilities in Apache Jackrabbit
– A Content Management System
• Graduated from the Incubator to Lucene sub-
project in 2008
• Graduated to Apache TLP in April 2010
• Over 90 issues shipping in latest release (0.8)
Community
• Mailing lists
– User: 153 peeps
– Dev: 114 peeps
• Committers/PMC
– 10 peeps
– Probably 5-6 active
• Releases
– 7 releases so far
– Working on 0.8
Credit: svnsearch.org
Getting started rapidly…like
now!
• Download Tika from:
– http://tika.apache.org/download.html
• Grab tika-app-0.7.jar
• alias tika “java –jar tika-app-0.7.jar”
• tika < somefile.doc > extracted-
text.xhtml
• tika –m < somefile.doc >
extracted.met
• Works on Windows too (alias only on UNIX)
Detecting MIME types from
Java
• String type = Tika.detect(…)
– java.io.InputStream
– java.io.File
– java.net.URL
– java.lang.String
Adding new MIME types
• Got XML?
• Based on freedesktop.org spec (loosely)
Many custom applications and
tools
• You need this: to read this:
Third-party parsing libraries
• Most of the custom applications come with
software libraries and tools to read/write
these files
– Rather than re-invent the wheel, figure out a
way to take advantage of them
• Parsing text and structure is a difficult
problem
– Not all libraries parse text in equivalent
manners
– Some are faster than others
– Some are more reliable than others
Parsing
• String content =
Tika.parseToString(…)
– InputStream
– File
– URL
Streaming Parsing
• Reader reader = Tika.parse(…)
– InputStream
– File
– URL
Extraction of Metadata
• Important to follow common Metadata models
– Dublin Core – any electronic resource
– XMP – also general like Dublin Core
– Word Metadata – specific to .doc, .ppt, etc.
– EXIF – image related
• Lots of standards and models out there
– The use and extraction of common models allows for
content intercomparison
– All standardize mechanisms for searching
– You always know for X file type that field Y is there and of
type String or Int or Date
Cancer Research Example
Cancer Research Example
Attributes
Relationships
Metadata
• Metadata met = new Metadata();
//Dubiln Core
met.set(Metadata.FORMAT, “text/html”);
//multi-valued
met.set(Metadata.FORMAT, “text/plain”);
System.out.println(
met.getValues(Metadata.FORMAT));
• Other met models supported (HTTP
Headers, Word, Creative Commons, Climate
Forcast, etc.)
– New in Tika 0.8! run: tika --list-met-models
Methods for language
identification
• N-grams
– Method of detecting next character or
set of characters in a sequence
– Useful in determine whether small
snippets of text come from a particular
language, or character set
• Non-computational approaches
– Tagging
– Looking for common words or characters
Language Detection
• LanguageIdentifier lang =
new LanguageIdentifier(new
LanguageProfile(
FileUtils.readFileToString(new
File(filename))));
• System.out.println(lang.getLanguage());
• Uses Ngram analysis included with Tika
– Originating from Nutch
– Can be improved
Running Tika in GUI form
• tika --gui
<html xmlns:html=“…”>
<body>
…
</body>
</html>
Integrating Tika into your
App
• Maven
• Ant
• Eclipse
• It’s just a set of jars
– tika-core
– tika-parsers
– tika-app
– tika-bundle
tika-core
tika-parsers
tika-
app
tika-
bundle
Some really great stuff in 0.8
• Container aware detection and MIME
improvements
• “Drop in” Parsers
– Compressed RTF / TNEF / LZFU
parsing available via external plugin at
Github
• New Parsers
– RSS
– Scientific files: NetCDF, HDF
Improvements to Tika
• Adding more parsers for content
types
– Omnigraffle?
• Expanding ability to handle random
access file parsing
– Scientific data file formats, some work
on this
• Improving language and charset
detection
Part 2
Science Data
Systems at NASA
NASA Ground Data Systems
Credit: D.
Woollard
Context
• NASA develops science data processing systems
for multiple earth science missions
• These systems convert the instrument telemetry
delivered to earth from space into useful data for
scientific research
• Typical characteristics
– Remote sensing instruments that orbit the Earth multiple
times daily
– Data are acquired constantly
– Complex algorithms convert instrument measurements to
geophysical quantities
The Square Kilometer Array
• 1 sq. km of
antennas
• Never-before
seen
resolution
looking into
the sky
• 700 TB
– Per second!
NASA DESDynI Mission
• 16 TB/day
• Geographically distributed
• 10s of 1000s of jobs per day
• Tier 1 Earth Science Decadal Mission
Some Considerations
• Scale
– Data throughput rates
– # of data types
– # of metadata types
– # of users to send the data to
• Federation
– Must leave the data where it is
– Socio/Economic/Political
• Heterogeneity
– Technology, data formats, skills!
Apache OODT
• We’ve got some components to deal with
these issues
How are we building these
systems now? -Allow for
push/pull of data
over arbitrary
protocols
- Ingestion builds
std catalog and
archive
-Deliver product
metadata to
search, portal or
GIS
-Plug in arbitrary
met extractors
How are we building these
systems now? -Separation of
file management
from workflow
management
-Allow for
heterogeneous
computing
resources
-Easily integrate
PGEs
-Leverages same
ingestion crawler
What does this have to do
with Tika?
Metadata
Ext: TIKA!
Metadata
Ext: TIKA!
MIME
identification:
TIKA!
MIME
identification:
TIKA!
What does this have to do
with Tika?
Metadata
Ext: TIKA!
MIME
identification:
TIKA!
MIME
identification:
TIKA!
Science Data File Formats
• Hierarchical Data Format (HDF)
– http://www.hdfgroup.org
– Versions 4 and 5
– Lots of NASA data is in 4, newer NASA data in 5
– Encapsulates
• Observation (Scalars, Vectors, Matrices, NxMxZ…)
• Metadata (Summary info, date/time ranges, spatial
ranges)
– Custom readers/writers/APIs in many languages
• C/C++, Python, Java
Science Data File Formats
• network Common Data Form (netCDF)
– www.unidata.ucar.edu/software/netcdf/
– Versions 3 and 4
– Heavily used in DOE, NOAA, etc.
– Encapsulates
• Observation (Scalars, Vectors, Matrices, NxMxZ…)
• Metadata (Summary info, date/time ranges, spatial
ranges)
– Custom readers/writers/APIs in many languages
• C/C++, Python, Java
– Not Hierarchical representation: all flat
So how does it work?
• Ingestion
– Science data files, ancillary information from
other missions, etc., arrive in NetCDF or HDF
format
– Need to extract their met, catalog and archive
them, etc.
• Can now use Tika to do this! TIKA-399 and TIKA-
400 added this capability into the Apache trunk
• Processing
– Processors (PGEs) generate NetCDF and
HDF, must extract met, catalog and archive
Tool support
• Entire stacks of tools written around
these formats
– OPeNDAP, LAS, readers, writers, custom
NASA mission toolkits
– OGC
• WMS, WCS, etc.
– Unique, one of a kind software build around
these data file formats
• Apache can contribute strongly in this
area!
Besides processing science
files
• …Tika also helps with
• MIME identification
– Useful in remote file acquisition
– Useful in classification (catalog/archive) of
existing content
– Useful in crawling (see my Nutch talk)
• Language identification
– Can be useful when data is coming from
around the world, but need to quickly identify
whether or not we can process it
Big Goal
• More closely link OODT and Tika
– Add new parser to Tika
– Easily get OODT met extractor based on it
• Contribute back some features still baking
in OODT
– Configuration aspects of parsing
– File types and extensions for science data files
• Spatial
– Some work done in my CS572 class on spatial
parser for Tika – would be great to integrate
with Tika, OODT, SIS, and Solr
NASA Geo Challenges
• Sometimes the data isn’t annotated with lat and lon
– How to discover this?
• Even when the data
is annotated with
spatial information,
computation of e.g.,
bounding box around
the poles is difficult
• Efficiency and speed are difficult since data is at
scale
Alright, I’ll shut up now
• Any questions?
• THANK YOU!
– mattmann@apache.org
– @chrismattmann on Twitter
Acknowledgements
• Some Tika material inspired by Jukka
Zitting’s talks
– http://www.slideshare.net/jukka/text-and-
metadata-extraction-with-apache-tika
– http://www.slideshare.net/jukka/text-and-
metadata-extraction-with-apache-tika-
4427630
• NASA Jet Propulsion Laboratory
– OODT Team
Book
• Jukka and I are writing
a book on Tika
– Working on Chapters 8
and 9 of 15
• Early Access available
through MEAP
program
• http://manning.com/mattmann/

Weitere ähnliche Inhalte

Was ist angesagt?

Oracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention TroubleshootingOracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention TroubleshootingTanel Poder
 
Rethinking State Management in Cloud-Native Streaming Systems
Rethinking State Management in Cloud-Native Streaming SystemsRethinking State Management in Cloud-Native Streaming Systems
Rethinking State Management in Cloud-Native Streaming SystemsYingjun Wu
 
우아한유스방
우아한유스방우아한유스방
우아한유스방BYUNGHOKIM10
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)SolarWinds
 
PS 향유회 세미나 - Python을 서브언어로 편하게 PS해보자
PS 향유회 세미나 - Python을 서브언어로 편하게 PS해보자PS 향유회 세미나 - Python을 서브언어로 편하게 PS해보자
PS 향유회 세미나 - Python을 서브언어로 편하게 PS해보자SesangCho
 
DataStax: Backup and Restore in Cassandra and OpsCenter
DataStax: Backup and Restore in Cassandra and OpsCenterDataStax: Backup and Restore in Cassandra and OpsCenter
DataStax: Backup and Restore in Cassandra and OpsCenterDataStax Academy
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!Guido Schmutz
 
Get Your Insecure PostgreSQL Passwords to SCRAM
Get Your Insecure PostgreSQL Passwords to SCRAMGet Your Insecure PostgreSQL Passwords to SCRAM
Get Your Insecure PostgreSQL Passwords to SCRAMJonathan Katz
 
devops 2년차 이직 성공기.pptx
devops 2년차 이직 성공기.pptxdevops 2년차 이직 성공기.pptx
devops 2년차 이직 성공기.pptxByungho Lee
 
Dependency Injection in Scala - Beyond the Cake Pattern
Dependency Injection in Scala - Beyond the Cake PatternDependency Injection in Scala - Beyond the Cake Pattern
Dependency Injection in Scala - Beyond the Cake PatternDebasish Ghosh
 
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...HostedbyConfluent
 
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...NETWAYS
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data AnalyticsSupersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analyticsmason_s
 
Elastic Stack & Data pipeline
Elastic Stack & Data pipelineElastic Stack & Data pipeline
Elastic Stack & Data pipelineJongho Woo
 
History rsr from the idea to sources history teachers of ireland lecture at t...
History rsr from the idea to sources history teachers of ireland lecture at t...History rsr from the idea to sources history teachers of ireland lecture at t...
History rsr from the idea to sources history teachers of ireland lecture at t...SACLibrary
 
MongoDB Performance Debugging
MongoDB Performance DebuggingMongoDB Performance Debugging
MongoDB Performance DebuggingMongoDB
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big DataPingCAP
 
인생은 짧아요, 엑셀 대신 파이썬
인생은 짧아요, 엑셀 대신 파이썬인생은 짧아요, 엑셀 대신 파이썬
인생은 짧아요, 엑셀 대신 파이썬Seung-June Lee
 

Was ist angesagt? (20)

Oracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention TroubleshootingOracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention Troubleshooting
 
Rethinking State Management in Cloud-Native Streaming Systems
Rethinking State Management in Cloud-Native Streaming SystemsRethinking State Management in Cloud-Native Streaming Systems
Rethinking State Management in Cloud-Native Streaming Systems
 
우아한유스방
우아한유스방우아한유스방
우아한유스방
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
 
PS 향유회 세미나 - Python을 서브언어로 편하게 PS해보자
PS 향유회 세미나 - Python을 서브언어로 편하게 PS해보자PS 향유회 세미나 - Python을 서브언어로 편하게 PS해보자
PS 향유회 세미나 - Python을 서브언어로 편하게 PS해보자
 
DataStax: Backup and Restore in Cassandra and OpsCenter
DataStax: Backup and Restore in Cassandra and OpsCenterDataStax: Backup and Restore in Cassandra and OpsCenter
DataStax: Backup and Restore in Cassandra and OpsCenter
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Get Your Insecure PostgreSQL Passwords to SCRAM
Get Your Insecure PostgreSQL Passwords to SCRAMGet Your Insecure PostgreSQL Passwords to SCRAM
Get Your Insecure PostgreSQL Passwords to SCRAM
 
devops 2년차 이직 성공기.pptx
devops 2년차 이직 성공기.pptxdevops 2년차 이직 성공기.pptx
devops 2년차 이직 성공기.pptx
 
Dependency Injection in Scala - Beyond the Cake Pattern
Dependency Injection in Scala - Beyond the Cake PatternDependency Injection in Scala - Beyond the Cake Pattern
Dependency Injection in Scala - Beyond the Cake Pattern
 
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
 
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data AnalyticsSupersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
 
Elastic Stack & Data pipeline
Elastic Stack & Data pipelineElastic Stack & Data pipeline
Elastic Stack & Data pipeline
 
History rsr from the idea to sources history teachers of ireland lecture at t...
History rsr from the idea to sources history teachers of ireland lecture at t...History rsr from the idea to sources history teachers of ireland lecture at t...
History rsr from the idea to sources history teachers of ireland lecture at t...
 
MongoDB Performance Debugging
MongoDB Performance DebuggingMongoDB Performance Debugging
MongoDB Performance Debugging
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big Data
 
인생은 짧아요, 엑셀 대신 파이썬
인생은 짧아요, 엑셀 대신 파이썬인생은 짧아요, 엑셀 대신 파이썬
인생은 짧아요, 엑셀 대신 파이썬
 

Andere mochten auch

Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Chris Mattmann
 
Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDavid Gil Sánchez
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...hannonhill
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platformmteutelink
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaPaolo Mottadelli
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache TikaPaolo Mottadelli
 
Mejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrMejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrIván Campaña Naranjo
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 
Alfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolAlfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolToni de la Fuente
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm CrawlerJulien Nioche
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationAlfresco Software
 

Andere mochten auch (20)

Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
 
Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsqueda
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
Mejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrMejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache Solr
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Alfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolAlfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en español
 
Search engine
Search engineSearch engine
Search engine
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Introducción a Solr
Introducción a SolrIntroducción a Solr
Introducción a Solr
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Conferencia 3: solrconfig.xml
Conferencia 3: solrconfig.xmlConferencia 3: solrconfig.xml
Conferencia 3: solrconfig.xml
 

Ähnlich wie Scientific data curation and processing with Apache Tika

What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?gagravarr
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyRobert Viseur
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...DuraSpace
 
The Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New TechnologiesThe Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New TechnologiesDave Lewis
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringTaro L. Saito
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemGrant Ingersoll
 
WoSC19: Serverless Workflows for Indexing Large Scientific Data
WoSC19: Serverless Workflows for Indexing Large Scientific DataWoSC19: Serverless Workflows for Indexing Large Scientific Data
WoSC19: Serverless Workflows for Indexing Large Scientific DataUniversity of Chicago
 
An information environment for neuroscientists
An information environment for neuroscientistsAn information environment for neuroscientists
An information environment for neuroscientistsDavid Wallom
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Chris Mattmann
 
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset catalogueseROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset cataloguese-ROSA
 
E-ARK-iPRES2016-Bern-October-2016
E-ARK-iPRES2016-Bern-October-2016E-ARK-iPRES2016-Bern-October-2016
E-ARK-iPRES2016-Bern-October-2016Sven Schlarb
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Pythonbotsplash.com
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...Jenny Mitcham
 

Ähnlich wie Scientific data curation and processing with Apache Tika (20)

What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
 
The Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New TechnologiesThe Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New Technologies
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Bertenthal
BertenthalBertenthal
Bertenthal
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
WoSC19: Serverless Workflows for Indexing Large Scientific Data
WoSC19: Serverless Workflows for Indexing Large Scientific DataWoSC19: Serverless Workflows for Indexing Large Scientific Data
WoSC19: Serverless Workflows for Indexing Large Scientific Data
 
An information environment for neuroscientists
An information environment for neuroscientistsAn information environment for neuroscientists
An information environment for neuroscientists
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset catalogueseROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
 
E-ARK-iPRES2016-Bern-October-2016
E-ARK-iPRES2016-Bern-October-2016E-ARK-iPRES2016-Bern-October-2016
E-ARK-iPRES2016-Bern-October-2016
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Python
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...
 

Mehr von Chris Mattmann

Wengines, Workflows, and 2 years of advanced data processing in Apache OODT
Wengines, Workflows, and 2 years of advanced data processing in Apache OODTWengines, Workflows, and 2 years of advanced data processing in Apache OODT
Wengines, Workflows, and 2 years of advanced data processing in Apache OODTChris Mattmann
 
Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayScalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayChris Mattmann
 
Teaching NASA to Open Source its Software the Apache Way
Teaching NASA to Open Source its Software the Apache WayTeaching NASA to Open Source its Software the Apache Way
Teaching NASA to Open Source its Software the Apache WayChris Mattmann
 
Supercharging your Apache OODT deployments with the Process Control System
Supercharging your Apache OODT deployments with the Process Control SystemSupercharging your Apache OODT deployments with the Process Control System
Supercharging your Apache OODT deployments with the Process Control SystemChris Mattmann
 
A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemChris Mattmann
 
Understanding the Meaningful Use of Open Source Software
Understanding the Meaningful Use of Open Source SoftwareUnderstanding the Meaningful Use of Open Source Software
Understanding the Meaningful Use of Open Source SoftwareChris Mattmann
 
An Open Source Strategy for NASA
An Open Source Strategy for NASAAn Open Source Strategy for NASA
An Open Source Strategy for NASAChris Mattmann
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Chris Mattmann
 

Mehr von Chris Mattmann (8)

Wengines, Workflows, and 2 years of advanced data processing in Apache OODT
Wengines, Workflows, and 2 years of advanced data processing in Apache OODTWengines, Workflows, and 2 years of advanced data processing in Apache OODT
Wengines, Workflows, and 2 years of advanced data processing in Apache OODT
 
Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayScalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
 
Teaching NASA to Open Source its Software the Apache Way
Teaching NASA to Open Source its Software the Apache WayTeaching NASA to Open Source its Software the Apache Way
Teaching NASA to Open Source its Software the Apache Way
 
Supercharging your Apache OODT deployments with the Process Control System
Supercharging your Apache OODT deployments with the Process Control SystemSupercharging your Apache OODT deployments with the Process Control System
Supercharging your Apache OODT deployments with the Process Control System
 
A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT Ecosystem
 
Understanding the Meaningful Use of Open Source Software
Understanding the Meaningful Use of Open Source SoftwareUnderstanding the Meaningful Use of Open Source Software
Understanding the Meaningful Use of Open Source Software
 
An Open Source Strategy for NASA
An Open Source Strategy for NASAAn Open Source Strategy for NASA
An Open Source Strategy for NASA
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
 

Kürzlich hochgeladen

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Scientific data curation and processing with Apache Tika

  • 1. Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant Professor, Univ. of Southern California Member, Apache Software Foundation
  • 2. Roadmap • 1st part of the talk – Why Tika? – What is Tika? – What are the current versions of Tika? – What can it do? • 2nd part of the talk – NASA Earth Science Data Systems – Data System Needs and Requirements – How does Tika help?
  • 3. And you are? • Apache Member involved in – Tika (VP,PMC), Nutch (PMC), Incubator (PMC), OODT (Mentor), SIS (Mentor), Lucy (Mentor) and Gora (Champion) • Architect/Developer at NASA JPL in Pasadena, CA • Software Architecture/Engineeri ng Prof at USC
  • 5. Proliferation of content types available • By some accounts, 16K to 51K content types* • What to do with content types? – Parse them • How? • Extract their text and structure – Index their metadata • In an indexing technology like Lucene, Solr, or in Google Appliance – Identify what language they belong to • Ngrams *http://filext.com/
  • 7. Importance of content type detection
  • 9. Goals • Identify and classify file types – MIME detection • Glob pattern – *.txt – *.pdf • URL – http://…pdf – ftp://myfile.txt • Magic bytes • Combination of the above means • Classification means reaction can be targeted
  • 10. is… • A content analysis and detection toolkit • A set of Java APIs providing MIME type detection, language identification, integration of various parsing libraries • A rich Metadata API for representing different Metadata models • A command line interface to the underlying Java code • A GUI interface to the Java code
  • 11. Tika’s (Brief) History • Original idea for Tika came from Chris Mattmann and Jerome Charron in 2006 • Proposed as Lucene sub-project – Others interested, didn’t gain much traction • Went the Incubator route in 2007 when Jukka Zitting found that there was a need for Tika capabilities in Apache Jackrabbit – A Content Management System • Graduated from the Incubator to Lucene sub- project in 2008 • Graduated to Apache TLP in April 2010 • Over 90 issues shipping in latest release (0.8)
  • 12. Community • Mailing lists – User: 153 peeps – Dev: 114 peeps • Committers/PMC – 10 peeps – Probably 5-6 active • Releases – 7 releases so far – Working on 0.8 Credit: svnsearch.org
  • 13. Getting started rapidly…like now! • Download Tika from: – http://tika.apache.org/download.html • Grab tika-app-0.7.jar • alias tika “java –jar tika-app-0.7.jar” • tika < somefile.doc > extracted- text.xhtml • tika –m < somefile.doc > extracted.met • Works on Windows too (alias only on UNIX)
  • 14. Detecting MIME types from Java • String type = Tika.detect(…) – java.io.InputStream – java.io.File – java.net.URL – java.lang.String
  • 15. Adding new MIME types • Got XML? • Based on freedesktop.org spec (loosely)
  • 16. Many custom applications and tools • You need this: to read this:
  • 17. Third-party parsing libraries • Most of the custom applications come with software libraries and tools to read/write these files – Rather than re-invent the wheel, figure out a way to take advantage of them • Parsing text and structure is a difficult problem – Not all libraries parse text in equivalent manners – Some are faster than others – Some are more reliable than others
  • 18. Parsing • String content = Tika.parseToString(…) – InputStream – File – URL
  • 19. Streaming Parsing • Reader reader = Tika.parse(…) – InputStream – File – URL
  • 20. Extraction of Metadata • Important to follow common Metadata models – Dublin Core – any electronic resource – XMP – also general like Dublin Core – Word Metadata – specific to .doc, .ppt, etc. – EXIF – image related • Lots of standards and models out there – The use and extraction of common models allows for content intercomparison – All standardize mechanisms for searching – You always know for X file type that field Y is there and of type String or Int or Date
  • 23. Metadata • Metadata met = new Metadata(); //Dubiln Core met.set(Metadata.FORMAT, “text/html”); //multi-valued met.set(Metadata.FORMAT, “text/plain”); System.out.println( met.getValues(Metadata.FORMAT)); • Other met models supported (HTTP Headers, Word, Creative Commons, Climate Forcast, etc.) – New in Tika 0.8! run: tika --list-met-models
  • 24. Methods for language identification • N-grams – Method of detecting next character or set of characters in a sequence – Useful in determine whether small snippets of text come from a particular language, or character set • Non-computational approaches – Tagging – Looking for common words or characters
  • 25. Language Detection • LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile( FileUtils.readFileToString(new File(filename)))); • System.out.println(lang.getLanguage()); • Uses Ngram analysis included with Tika – Originating from Nutch – Can be improved
  • 26. Running Tika in GUI form • tika --gui <html xmlns:html=“…”> <body> … </body> </html>
  • 27. Integrating Tika into your App • Maven • Ant • Eclipse • It’s just a set of jars – tika-core – tika-parsers – tika-app – tika-bundle tika-core tika-parsers tika- app tika- bundle
  • 28. Some really great stuff in 0.8 • Container aware detection and MIME improvements • “Drop in” Parsers – Compressed RTF / TNEF / LZFU parsing available via external plugin at Github • New Parsers – RSS – Scientific files: NetCDF, HDF
  • 29. Improvements to Tika • Adding more parsers for content types – Omnigraffle? • Expanding ability to handle random access file parsing – Scientific data file formats, some work on this • Improving language and charset detection
  • 31. NASA Ground Data Systems Credit: D. Woollard
  • 32. Context • NASA develops science data processing systems for multiple earth science missions • These systems convert the instrument telemetry delivered to earth from space into useful data for scientific research • Typical characteristics – Remote sensing instruments that orbit the Earth multiple times daily – Data are acquired constantly – Complex algorithms convert instrument measurements to geophysical quantities
  • 33. The Square Kilometer Array • 1 sq. km of antennas • Never-before seen resolution looking into the sky • 700 TB – Per second!
  • 34. NASA DESDynI Mission • 16 TB/day • Geographically distributed • 10s of 1000s of jobs per day • Tier 1 Earth Science Decadal Mission
  • 35. Some Considerations • Scale – Data throughput rates – # of data types – # of metadata types – # of users to send the data to • Federation – Must leave the data where it is – Socio/Economic/Political • Heterogeneity – Technology, data formats, skills!
  • 36. Apache OODT • We’ve got some components to deal with these issues
  • 37. How are we building these systems now? -Allow for push/pull of data over arbitrary protocols - Ingestion builds std catalog and archive -Deliver product metadata to search, portal or GIS -Plug in arbitrary met extractors
  • 38. How are we building these systems now? -Separation of file management from workflow management -Allow for heterogeneous computing resources -Easily integrate PGEs -Leverages same ingestion crawler
  • 39. What does this have to do with Tika? Metadata Ext: TIKA! Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA!
  • 40. What does this have to do with Tika? Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA!
  • 41. Science Data File Formats • Hierarchical Data Format (HDF) – http://www.hdfgroup.org – Versions 4 and 5 – Lots of NASA data is in 4, newer NASA data in 5 – Encapsulates • Observation (Scalars, Vectors, Matrices, NxMxZ…) • Metadata (Summary info, date/time ranges, spatial ranges) – Custom readers/writers/APIs in many languages • C/C++, Python, Java
  • 42. Science Data File Formats • network Common Data Form (netCDF) – www.unidata.ucar.edu/software/netcdf/ – Versions 3 and 4 – Heavily used in DOE, NOAA, etc. – Encapsulates • Observation (Scalars, Vectors, Matrices, NxMxZ…) • Metadata (Summary info, date/time ranges, spatial ranges) – Custom readers/writers/APIs in many languages • C/C++, Python, Java – Not Hierarchical representation: all flat
  • 43. So how does it work? • Ingestion – Science data files, ancillary information from other missions, etc., arrive in NetCDF or HDF format – Need to extract their met, catalog and archive them, etc. • Can now use Tika to do this! TIKA-399 and TIKA- 400 added this capability into the Apache trunk • Processing – Processors (PGEs) generate NetCDF and HDF, must extract met, catalog and archive
  • 44. Tool support • Entire stacks of tools written around these formats – OPeNDAP, LAS, readers, writers, custom NASA mission toolkits – OGC • WMS, WCS, etc. – Unique, one of a kind software build around these data file formats • Apache can contribute strongly in this area!
  • 45. Besides processing science files • …Tika also helps with • MIME identification – Useful in remote file acquisition – Useful in classification (catalog/archive) of existing content – Useful in crawling (see my Nutch talk) • Language identification – Can be useful when data is coming from around the world, but need to quickly identify whether or not we can process it
  • 46. Big Goal • More closely link OODT and Tika – Add new parser to Tika – Easily get OODT met extractor based on it • Contribute back some features still baking in OODT – Configuration aspects of parsing – File types and extensions for science data files • Spatial – Some work done in my CS572 class on spatial parser for Tika – would be great to integrate with Tika, OODT, SIS, and Solr
  • 47. NASA Geo Challenges • Sometimes the data isn’t annotated with lat and lon – How to discover this? • Even when the data is annotated with spatial information, computation of e.g., bounding box around the poles is difficult • Efficiency and speed are difficult since data is at scale
  • 48. Alright, I’ll shut up now • Any questions? • THANK YOU! – mattmann@apache.org – @chrismattmann on Twitter
  • 49. Acknowledgements • Some Tika material inspired by Jukka Zitting’s talks – http://www.slideshare.net/jukka/text-and- metadata-extraction-with-apache-tika – http://www.slideshare.net/jukka/text-and- metadata-extraction-with-apache-tika- 4427630 • NASA Jet Propulsion Laboratory – OODT Team
  • 50. Book • Jukka and I are writing a book on Tika – Working on Chapters 8 and 9 of 15 • Early Access available through MEAP program • http://manning.com/mattmann/