Understanding information content with apache tika

3/30/2560 BE, 10,10 PMUnderstanding information content with Apache Tika
Page 1 of 23https://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/
Oleg Tikhonov and Chris Mattmann
Published on June 15, 2010
Understanding inform
content with Apache
In this tutorial, we introduce the Apache Tika framework
(e.g., N-gram, parsing, mime detection, and content an
examples that should be applicable to not only seasone
to beginners to content analysis and programming as w
a working knowledge of the Java™ programming langu
to analyze.
Throughout this tutorial, you will learn:
Apache Tika's API, most relevant modules, and relat
Apache Nutch (one of the progenitors of Tika) and its
LanguageIdentiﬁer classes, which have recently been
cpdetector, the code page detector project, and its
What is Apache Tika?
As Apache Tika's site suggests, Apache Tika is a toolki

•
•
•
Learn › Open source

extracting metadata and structured text content from va
existing parser libraries.
The parser interface
The org.apache.tika.parser.Parser interface is the key c
It hides the complexity of different ﬁle formats and parsi
a simple and powerful mechanism for client applications
content and metadata from all sorts of documents. All t
single method:
The parse method takes the document to be parsed a
input, and outputs the results as XHTML SAX events an
main criteria that led to this design are shown in Table 1
Table 1. Criteria for Tika parsing design
1
2
void parse(InputStream stream, Conten
throws IOException, SAXException,
Criterion Explanation
Streamed
parsing
The interface should require neithe
nor the parser implementation to ke
content in memory or spooled to di
huge documents to be parsed with
requirements.
Structured
content
A parser implementation should be
structural information (headings, lin
content. A client application can us
example, to better judge the releva
the parsed document.
Input
metadata
A client application should be able
the file name or declared content
be parsed. The parser implementa
information to better guide the pars

These criteria are reﬂected in the arguments of the pars
Document InputStream
The ﬁrst argument is an InputStream for reading the d
If this document stream cannot be read, parsing stops
IOException is passed up to the client application. If t
not parsed (if the document is corrupted, for example),
TikaException.
The parser implementation will consume this stream, bu
the stream is the responsibility of the client application
Listing 1 shows the recommended pattern for using str
method.
Listing 1. Recommended pattern for using streams with the
XHTML SAX events
The parsed content of the document stream is returned
as a sequence of XHTML SAX events. XHTML is used t
content of the document, and SAX events enable stream
Output
metadata
A parser implementation should be
metadata in addition to document c
formats contain metadata, such as
that may be useful to client applica
1
2
3
4
5
6
InputStream stream = ...; // ope
try {
parser.parse(stream, ...); // par
} finally {
stream.close(); // clo
}

the XHTML format is used here only to convey structura
the documents for browsing.
The XHTML SAX events produced by the parser implem
ContentHandler instance given to the parse method
fails to process an event, parsing stops and the thrown
up to the client application.
Listing 2 shows the overall structure of the generated e
added for clarity).
Listing 2. Structure of the generated event stream
Parser implementations typically use the XHTMLConten
generate the XHTML output. Dealing with the raw SAX
Apache Tika (since V0.2) comes with several utility class
process and convert the event stream to other represen
For example, the BodyContentHandler class can be
body part of the XHTML output and feed it as SAX even
handler or as characters to an output stream, a writer, o
following code snippet parses a document from the sta
outputs the extracted text content to standard output:
Another useful class is ParsingReader that uses a ba
1
2
3
4
5
6
7
8
<html xmlns="http://www.w3.org/1999/x
<head>
<title>...</title>
</head>
<body>
...
</body>
</html>
1
2
ContentHandler handler = new BodyCont
parser.parse(System.in, handler, ...)

the document and returns the extracted text content as
Listing 3. Example of the ParsingReader
Document metadata
The ﬁnal argument to the parse method is used to pas
and out of the parser. Document metadata is expressed
Table 2 lists some of the more interesting metadata pro
Table 2. Metadata properties
1
2
3
4
5
6
7
InputStream stream = ...; // the docu
Reader reader = new ParsingReader(par
try {
...; // read the document text using
} finally {
reader.close(); // the document stre
}
Property Description
Metadata.RESOURCE_NAME_KEY The name of
contains the
application ca
allow the par
heuristics to
the documen
implementatio
if the file form
canonical nam
format has a
example).
Metadata.CONTENT_TYPE The declared
document —
set this prope
an HTTP Con
The declared
the parser to

Note that metadata handling is still being discussed by
development team, and it is likely that there will be som
incompatible) changes in metadata handling before Tika
Parser implementations
Apache Tika comes with a number of parser classes fo
document formats, as shown in Table 3.
Table 3. Tika parser classes
document. Th
sets this prop
according to
parsed.
Metadata.TITLE The title of th
parser implem
property if the
contains an e
Metadata.AUTHOR The name of
document —
implementatio
the documen
explicit autho
Format Descript
Microsoft® Excel® (application/vnd.ms-
excel)
Excel sp
available
and is ba
from POI
Microsoft Word® (application/msword) Word doc
available
and is ba
from POI

Microsoft PowerPoint®
(application/vnd.ms-powerpoint)
PowerPo
is availab
and is ba
from POI
Microsoft Visio® (application/vnd.visio) Visio diag
in Tika V
HDGF lib
Microsoft Outlook®
(application/vnd.ms-outlook)
Outlook m
added in
on the
GZIP compression (application/x-gzip) GZIP sup
V0.2 and
GZIPInp
Java 5
bzip2 compression (application/x-bzip) bzip2 sup
V0.2 and
parsing c
which wa
work by K
Software
MP3 audio (audio/mpeg) The pars
MP3 files
V0.2. If fo
metadata
TITLE
SUBJ
MIDI audio (audio/midi) Tika uses
javax.a
MIDI
karaoke
MIDI
embedde
knows ho
Wave audio (audio/basic) Tika supp
audio
•
•

javax.a
package
metadata
Extensible Markup Language (XML)
(application/xml)
Tika uses
classes t
HyperText Markup Language (HTML)
(text/html)
Tika uses
to parse
Images (image/*) Tika uses
classes t
image file
Java class files The pars
based
work by D
1522.
Java Archive Files The pars
performe
the ZIP a
parsers.
OpenDocument
(application/vnd.oasis.opendocument.*)
Tika uses
XML feat
language
OpenDoc
used
V2.0 and
OpenOffi
supporte
currently
well as th
Plain text (text/plain) Tika uses
Compon
library
Portable Document Format (PDF)
(application/pdf)
Tika uses
parse PD
Rich Text Format (RTF) (application/rtf) Tika uses

You can also extend Apache Tika with your own parser
Tika are welcome. The goal of Tika is to reuse existing p
PDFBox or Apache POI as much as possible, so most
Tika are adapters to such external libraries.
Apache Tika also contains some general-purpose parse
not targeted at any speciﬁc document formats. The mo
AutoDetectParser class that encapsulates all Tika
parser that can handle any type of document. This pars
determine the type of the incoming document based on
then parse the document accordingly.
Now it's time for hands-on activities. Here are the class
throughout our tutorial:
BudgetScramble— Shows how to use Apache Tika
which document has been changed recently and
TikaMetadata— Shows how to get all Apache Tika
document, even if there is no data (just to display all
TikaMimeType— Shows how to use Apache Tika's
mimetype of a particular document.
TikaExtractText— Shows Apache Tika's text-ext
saves extracted text as an appropriate ﬁle.
LanguageDetector — Introduces the Nutch langua
library to
TAR (application/x-tar) Tika uses
the TAR
Apache A
The TAR
by Timot
ZIP (application/zip) Tika uses
classes t
1.
2.
3.
4.
5.

identify the language of particular content.
Summary — Sums up Tika features, such as MimeTy
detection, and metadata. In addition, it introduces cp
determine a file's charset encoding. Finally, it shows
identification in process.
Requirements
Ant V1.7 or higher
Java V1.6 SE or higher
Lesson 1: Extracting metad
PDF file
So you've got Apache Tika downloaded and installed lo
what do you do with it? We suggest taking advantage o
extract some metadata from your favorite PDF file. We r
FY2010 budget for the U.S. National Aeronautics and S
(NASA).
Let's begin with some basic preparatory steps:
Build yourself a copy of tika-app. The easiest way to
copy of apache-tika-X.Y-src.zip and change directory
directory. From there, type mvn package.
Ensure that everything built correctly. Type java —ja
app/target/tika-app-X.Y.jar —h. If you see o
you are good to go.
Listing 4. Output from Java command
6.
•
•
1.
2.
1
2
java —jar tika-app/target/tika-app-X

Determining what metadata is availab
Before delving too deeply into Apache Tika's rich Java A
figure out what and how much metadata is available fro
used to refer to "data about data," is a description of a
this case, the PDF file), typically consisting of a set nam
contains metadata values. As an example, a PDF file m
description that includes an author field, with a value of
use the aforementioned command-line utility of Tika to
is available from the PDF file, as in Listing 5.
Listing 5. Reading PDF metafile data
Contents
Introduction
Lesson 1: Extracting metadata from a PDF file
Lesson 2: Automatic metadata extraction from any
file type
Lesson 3: Understanding mimetypes
Lesson 4: Automatic text extraction from any file type
Lesson 5: Language identification
Downloadable resources
Related topics
Comments
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
usage: tika [option] [file]
Options:
-? or --help Print this
-v or --verbose Print debug
-g or --gui Start the A
-eX or --encoding=X Use output
-x or --xml Output XHTM
-h or --html Output HTML
-t or --text Output plai
-m or --metadata Output only
Description:
Apache Tika will parse the file(
extracted text content or metada
Instead of a file name you can a
If no file name or URL is specif
standard input stream is parsed.
Use the "--gui" (or "-g") option
You can drag and drop files from
text content and metadata from t
1
2
3
4
5
6
java —jar tika-app/target/tika-app-X.
./National_Aeronautics_and_Space_
Content-Type: application/pdf
Last-Modified: Tue Feb 24 04:56:17 PS
created: Sat Feb 21 07:38:41 PST 2009
Learn Develop Connect

The above output gives us a preview of what metadata
downloaded PDF ﬁle. Unfortunately, outside of the last m
time, there isn't a lot of interesting metadata available. L
available on the Whitehouse budget site
(http://www.whitehouse.gov/omb/budget/Overview/
budget was uploaded (or modiﬁed) last?" And was it be
much indecisiveness on it? Perhaps there were budget
that needed to be factored in at the last minute. In any
can easily be answered by whipping together a quick T
(OK — the rationale behind the budget increases can't,
Lesson 1 helps you determine a document that has bee
recently. It is important to remember that the document
the web.
Listing 6. determineLast.java
7
8
9
creator: Adobe InDesign CS4 (6.0)
producer: Adobe PDF Library 9.0
resourceName: National_Aeronautics_an
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
public void determineLast() throws E
Tika tika = new Tika();
Date lastDate = new Date();
lastDate.setYear(lastDate.getYear()
String lastUrl = null;
for (String budgetUrl : URLs) {
Metadata met = new Metadata();
try {
tika.parse(new URL(budgetUrl).openS
Date docDate = BudgetScramble.toDat
log.info(System.getProperty("line.s
if (docDate.after(lastDate)) {
lastDate = docDate;
lastUrl = budgetUrl;
}
} catch (Exception e) {
log.error(e.getLocalizedMessage
}
}

You can run this example by typing ant budgetscram
result of running this program.
Listing 7. Result of Ant command
Listing 8 shows another use of Apache Tika, in which w
what document this maps.
Listing 8. Tika mapping example
Interestingly enough, a set of new changes and approp
year's budget.
The above example serves to illustrate the ease with wh
extracted from content using Apache Tika. Of course, y
Tika strives to extract as much provided metadata as
not limit the ability of Tika to extract derived metadata. T
org.apache.tika.metadata.Metadata class
the ability for merging and for easily adding new metada
and amending those that are already extracted. To date
20 common formats, including Microsoft Word and Exc
is best to check http://lucene.apache.org/tika/formats.h
date list (or the earlier part of this article where we expre
As Lesson 1 illustrates, Tika is a facade class for acces
class hides much of the underlying complexity of the low
1
2
3
4
budgetscramble:
[java] 09/12/23 09:29:08 INFO ex
finished is...[http://www.whitehouse
b 26 15:55:07 IST 2009
1
2
3
4
java -jar tika-app/target/tika-app-X.
-x "http://www.whitehouse.gov/omb
<a href="http://www.whitehouse.gov/om
Technical Changes</a>

provides simple methods for many common parsing an
operations.
Metadata is a multi-valued metadata container. The mo
parse that gets two parameters: InputStream and
Lesson 2: Automatic metad
extraction from any file type
Despite the previous PDF files from Lesson 1, Apache T
arbitrarily extract metadata from any file type. You'll lear
in the coming lessons. If you can't wait, jump to lessons
this, we'll take an arbitrary Open Office Document Temp
some of its metadata to the console automatically. This
for any file or content type in general, regardless of whe
understands what type it is. Tika's goal is to extract as
information as possible from the underlying file type, as
Listing 9. Extracting metadata with Tika
First, we get a list of files with which we're going to wor
we define the TikaMetadata object and show a metad
Type the following and see what happens: ant tikame
output appears in Listing 10.
1
2
3
4
5
6
7
8
9
10
List<File> list =
Utils.getFiles(new File(Messages.get
for (File f : list) {
try {
TikaMetadata tm = new TikaMetada
tm.showMe();
log.error(e.getLocalizedMessage(
}
}

Listing 10. ant listing of TikaMetaData
Note that the above presents a set of metadata keys (p
the above output), with associated values (present after
output) associated with the file type. Since Tika has the
.odt files, it was able to extract more comprehensive me
nbWord(s), nbPage(s), etc.)
Lesson 3: Understanding m
So, how did Apache Tika figure out how to extract text
PDF budget files in Lesson 1? Tika comes with a comp
repository. A mimetype repository is a set of definitions
Assigned Numbers Authority (IANA) mimetypes, where,
defined, an entry is recorded containing:
Its names (including aliases)
Its parent and child mimetypes
Mime MAGIC, a set of control bytes used to compar
file for detection
URL patterns, matching the file extension or file nam
XML root characters and namespaces
Apache Tika uses the mimetype repository and a set of
combination of mime MAGIC, URL patterns, XML root c
extensions) to determine if a particular file, URL, or piec
of its known types. If the content does match, Tika has
1
2
3
4
5
[java] thai_odt.odt
[java] nbObject=0 nbPara=5 nbImg
ux OpenOffice.org_project/310m19$Buil
nbPage=1 Content-Type=application/vn
haracter=2031
•
•
•
•
•

and can proceed to select the appropriate parser. In thi
some of the properties of a mimetype for a particular file
properties out. Often when manipulating files, we need
mimetype (e.g., a TXT file, HTML, or PDF), and how to r
binary. Simply binary output gives nothing. According to
choose an appropriate parser or something that relates
Listing 11. Working with mimetypes
Lesson 4: Automatic text e
from any file type
Besides having the ability to extract metadata, Apache
textual content, independent of other extraneous inform
binary garble, and other miscellaneous information typic
files) for any file type, so long as it can parse it. Tika's p
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public static void main(String[] arg
Metadata metadata = new Metadata();
MimeTypes mimeTypes = TikaConfig.get
List<File> list = Utils.getFiles(new
new ArrayList<File>());
String mime = null;
URL url;
try {
url = new URL("file:" + f.getAbs
InputStream in = url.openStream(
mime = mimeTypes.detect(in, metadat
log.info("Mime: " + mime + " for fi
log.error(e.getLocalizedMessage
}//try-catch
}//foreach
}//function
1 ant tikamimetype

a basic means for stripping out the text from a particula
parse. Textual content is useful as it can be sent to sear
content-management systems and used to show summ
particular pieces of content. In the example below, we'l
easy it is in Tika to extract textual content from any file t
the normal disclaimers seen on TV, we do want you to t
Listing 12. Extracting textual content from a file type
The magic is done by the ParseUtils getStringCo
ParseUtils contains utility methods for parsing docum
provide simple entry points into the Tika framework. On
file, and the second is TikaConfig, which parses XML
simple and powerful.
Are you burning with curiosity to know how this magic t
tikaextracttext.
Lesson 5: Language identifi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
TikaConfig tc = TikaConfig.getDefaul
Utils.deleteFiles(new File(Messages.
try {
String txt = ParseUtils.getStringCo
Utils.writeTxtFile(new File(Me
+ File.separator + Utils.getFileName
log.info(Messages.getProperty(
+ Messages.getProperty("m004"));
} catch (TikaException e) {
} catch (IOException e) {
}catch (Exception e) {
}
}

To have content is a good start, but it's not enough. Th
is missing. Did you ever think about how to identify con
the Natural Language Processing approach deals with t
Unfortunately, you have to be acquainted with that. But
Nutch has developed a module called LanguageIdent
going to use. Let's see how it works.
Listing 13. Example of using LanguageIdentifier
See how easy it is? Just call the getLanguage() funct
Don't hesitate to run this example: ant languagedete
That's it. If you're interested to know how language iden
add a new language profiler, and even more, read furthe
All languages have been identified properly except Chin
system doesn't have the capability to recognize a new l
box. Therefore, let's start to create an N-gram profiler. B
water, we would like to explain what an N-gram is, how
a training set.
What is an N-gram?
N-grams are sequences of characters or words extract
documents. They could be divided into two groups: cha
1
2
3
4
5
6
7
8
9
LanguageDetector ld = null;
for (Iterator<File> iterator = list.i
File file = (File) iterator.next();
ld = new LanguageDetector(file);
log.info(Messages.getProperty("m0
Messages.getProperty("m073") + ld.get
}//for

based. An N-gram is a set of N consecutive characters
in our case, string. The motivation behind that is similar
proportion of N-grams. The most common values for
bigrams and trigrams respectively. For instance, the wo
generation of the bigrams *T, TI, IK, KA, A* and trigrams
A**. The "*" denotes a padding space. Character-based
measuring the similarity of character strings. Some app
based N-grams are spelling checker, stemming, and OC
As you can guess, word N-grams are sequences of
extracted from text. It is also language-independent. Th
between two strings is measured by Dice's coefficient
measure). s = (2|X / Y|)/(|X| + |Y|), where X and Y are th
/ means an intersection between two sets. If we take
measure, the coefficient may be calculated for two strin
bigrams: s = (2Nt)/(Nx + Ny), where Nt is the number of
in both strings, Nx is the number of bigrams in string
bigrams in string y. For example, to calculate the similar
TECA, we would find the set of bigrams in each word a
{TE, EC, CA}. Each set has three elements, and the inte
has only zero. Now putting this into formula and calcula
totally dissimilar for bigrams. You'll get other results for
A large text corpus (training corpus) is used to estimate
Nutch's language identification, the file comes with an N
extension. It's a file that contains N-grams and its score
is a trigram with score 17376.
One of the major problems of N-gram modeling is its siz
have to fulfill the process once. Another interesting exam
the extracting features for clustering large sets of satellit
determining what part of the Earth a particular image ca
How is it possible to identify language?

Generally speaking, when a new document comes who
identified, we first create an N-gram profile of the docum
distance between the new document profile and the
distance is calculated according to "out-of-place measu
profiles. The shortest distance is chosen, and it is predi
document belongs to that language. A threshold value h
that if any distance goes above the threshold, the syste
of the document cannot be determined or mistakenly
created zh.ngp, our system determined Chinese docum
By adding a new N-gram language profile, we can get t
correctly. Apache Tika V0.5 has a LanguageIdentifi
framework. It works fine unless a document does not
LanguageIdentifier couldn't recognize as one of its
we've separated it to different packages. Now you can
add any language that is still unsupported and use a ca
function from your code.
One of the parameters the NgramProfiler main funct
TXT file. In our case, it should be a text file containing C
Wikipedia. The amount of text ought to be large in orde
profile that could predict with high probability what lang
belongs to. In addition, text is needed to be taken from
The topic might be geography, mathematics, astronauti
to reduce the noise (such as exclude links, image name
The data have to be redundant, preventing overlapping
identification accuracy.
Create a TXT file, such as chines4ngram.txt. Go to Wiki
paste the text into chines4ngr.txt. Try to avoid leaving b
the links and gather stuff. More is better in this case.
process, but it's important; 5,000-6,000 lines of text wi
Note: This process could be automated by using Nutch

NgramProfiler's main function expects to get param
<name_of_gram_profile> <text_file>.
Using Ant, type:
After a while, copy zh.ngp to the org.apache.analysis.la
TikaLanguageIdentifier by typing ant TikaLang
at the output. All content from Chinese files has been
In this tutorial, we have used an additional framework c
determine a file's charset encoding. The name cpdetec
page detector and has nothing to do with Java classpa
framework for configurable code page detection of doc
detect the code page of documents retrieved from rem
detection is needed whenever it is not known which enc
belongs to. Therefore, it is a core requirement for any ap
information mining or just information retrieval.
Downloadable resources
Related topics
Visit Apache.org/tika to learn more.
Learn more about Nutch.
Be sure to check out cpdetector.
Follow developerWorks on Twitter.
1
2
ant createngram -Dngpname=/home/olegt
-Dfile="/home/olegt/ chines4ngram.txt
PDF of this content
•
•
•
•

Comments
Sign in or register to add and
subscribe to comments.
Subscribe m
notiﬁcations
Visit the developerWorks Open source zone for exten
tools, and project updates to help you develop with o
and use them with IBM's products, as well as our
tutorials.
Download IBM product evaluation versions or explor
IBM SOA Sandbox and get your hands on applicatio
middleware products from DB2®, Lotus®, Rational®
WebSphere®.
•
•
developerWorks
About
Help
Submit content
RFE Community
Report abuse
Third-party notice
Join
Faculty
Students
Business Partners
Select a language
English
日本語
Русский
Português (Brasil)
Español
한글
Events
dW TV
Feeds
Newsletters
dW Answers
dW Blog

Contact Privacy Terms of use Accessibility Feedback Cookie Preferences United S

Understanding information content with apache tika

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (9)

Ähnlich wie Understanding information content with apache tika

Ähnlich wie Understanding information content with apache tika (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Understanding information content with apache tika