Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Apache Tika end-to-end

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 18 Anzeige

Apache Tika end-to-end

Herunterladen, um offline zu lesen

From the Fast Feather Track at ApacheCon NA 2010 in Atlanta

This quick talk provides an overview of Apache Tika, looks at a new features and supported file formats. It then shows how to create a new parser, and finishes with using Tika from your own application.

From the Fast Feather Track at ApacheCon NA 2010 in Atlanta

This quick talk provides an overview of Apache Tika, looks at a new features and supported file formats. It then shows how to create a new parser, and finishes with using Tika from your own application.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (20)

Anzeige

Ähnlich wie Apache Tika end-to-end (20)

Anzeige

Aktuellste (20)

Apache Tika end-to-end

  1. 1. Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application
  2. 2. Nick Burch Software Engineer Alfresco
  3. 3. Apache Tika http://tika.apache.org/ • Project which started in 2006 • Grew out of the Lucene community, now widely used • Provides detection of files – eg this binary blob is really a word file, that one is UTF-8 plain text • Plain text, HTML and XHTML versions of a wide range of different file formats • Consistent Metadata from different files • Tika hides the complexity of the
  4. 4. What's new? • Lots of new parsers – text, office formats, publishing formats, images, audio, CAD, fonts etc • Long standing parsers improved – better HTML from word for example • Embedded resources and containers • Use expanding – used by many SOLR users, Alfresco, lots of people crunching masses of data on Hadoop
  5. 5. Supported Formats Page 1 • Audio – WAV, RIFF, MIDI • DWG (CAD) • Epub • RSS and ATOM Feeds • True Type Fonts • HTML • Images – JPEG, GIF, PNG, TIFF, Bitmap (including EXIF where found) • iWork (Keynote, Pages etc) • RFC822 mbox Mail
  6. 6. Supported Formats Page 2 • Microsoft Outlook .msg Email • Microsoft Office (Binary) – Word, PowerPoint, Excel, Visio, Publisher, Works • Microsoft Office (OOXML) – Word, PowerPoint, Excel • MP3 (id3 v1 and v2) • CDF (Scientific Data) • Open Document Format (Open Office) • Old-style Open Office (.sxw etc)
  7. 7. Supported Formats Page 3 • Zip and Tar archives • RDF • Plain Text • FLV Video • XML • Java class files And I probably forgot one...!
  8. 8. Metadata • Tika provides consistent metadata across the range of parsers • No need to know if it's “Last Author”, “Last Editor” or “Previous Author” in a file format, they all come back with the same metadata key • Keys and values are strings, but strongly typed metadata entries provide converters to dates, ints etc
  9. 9. Text Content • Tika generates HTML-like SAX events as it parses • Uses Java SAX API • Events can be captured or transformed • Body Content Handler used for plain text • HTML and XHTML available • Can customise with your own handler, with XSLT or with E4X from JavaScript • eg HTML Table → CSV
  10. 10. Calling Tika
  11. 11. // Get a content detector, and an auto- selecting Parser TikaConfig config = TikaConfig.getDefaultConfig(); ContainerAwareDetector detector = new ContainerAwareDetector( config.getMimeRepository() ); Parser parser = new AutoDetectParser(detector); // We’ll only want the plain text contents ContentHandler handler = new
  12. 12. // Plain text only content handler ContentHandler handler = new BodyContentHandler(); String text = handler.toString(); // XHTML content handler SAXTransformerFactory factory = SAXTransformerFactory.newInstance(); TransformerHandler handler = factory.newTransformerHandler(); handler.getTransformer().setOutputProp erty(OutputKeys.METHOD, "xml");
  13. 13. Tika Parsers
  14. 14. Parser Interface • Two key methods – what mime types are supported, and do the parsing public interface Parser { Set<MediaType> getSupportedTypes(ParseContext context); void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException,
  15. 15. public class HelloWorldParser implements Parser { public Set<MediaType> getSupportedTypes(ParseContext context) { Set<MediaType> types = new HashSet<MediaType>(); types.add(MediaType.parse("hello/world ")); return types; } public void parse(InputStream stream,
  16. 16. Demo: Tika-App
  17. 17. Demo: Geo-Tagged Images in Alfresco Share via Tika
  18. 18. Any Questions?

×