Content analysis for ECM with Apache Tika

Content analysis for ECM with Apache Tika Paolo Mottadelli -

A real world challenge ? ? ? Searching .docx .xlsx .pptx in Alfresco ECM

What is Tika? Another Indian Lucene project? No.

What is Tika? It is a Toolkit

A brief history of Tika Sponsored by the Apache Lucene PMC

Tika organization Changing after graduation

Getting Tika … and contributing

The Parser interface ,[object Object]

XHTML SAX events ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Why XHTML? ,[object Object],[object Object],[object Object]

ContentHandler (CH) and Decorators (CHD)

The AutoDetectParser ,[object Object],[object Object]

Type Detection MimeType type = types.getMimeType(…);

tika-mimetypes.xml ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

A really simple example ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

ECM: Manage ,[object Object],[object Object],* *

Don’t do it on your own Tika shields ECM from using many single components

Alfresco Repository JSR-170 Level2 Compatible

Repository Architecture Services Components Storage Hibernate Content Lucene Content Index Database Search Node Node Content Query Index

Step 2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],ContentTransformerRegistry Provides the most appropriate ContentTransformer

Step 2 (explained) Too many different ContentTransformer implementations

Step 3 Transform public void transformInternal(ContentReader reader, ContentWriter writer, TransformationOptions options) throws Exception { ... HSSFWorkbook workbook = new HSSFWorkbook(is); ... for (int i = 0; i < sheetCount; i++) { HSSFSheet sheet = workbook.getSheetAt(i); String sheetName = workbook.getSheetName(i); writeSheet(os, sheet, encoding); } ... } Example: PoiHssfContentTransformer

Step 3 (explained) Too many different ContentTransformer implementations ... again !?!

Step 4 Lucene index creation ContentReader reader = contentService.getReader(nodeRef, propertyName); ContentTransformer transformer = contentService.getTransformer(reader.getMimetype(), MimetypeMap. MIMETYPE_TEXT_PLAIN); transformer.transform(reader, writer); reader = writer.getReader(); . . . . . . . . doc.add(new Field(attributeName, reader, Field.TermVector. NO));

Step 1 + Step 2 + Step 3 String name = “resource.doc” InputStream input = getResourceAsStream(name); Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler(); new AutoDetectParser().parse(input, handler, metadata); String title = metadata.get(Metadata. TITLE); String content = handler.toString();

Step 1 to 4 (compressed) String name = “resource.doc” InputStream input = getResourceAsStream(name); Reader reader = new ParsingReader (input, name); . . . . . . doc.add(new Field(attributeName, reader , Field.TermVector. NO));

Extension use case Adding support for Microsoft Office Open XML Documents (Office 2007+)

Apache POI Apache POI provides Text Extraction support for Office OpenXML formats and An advanced coverage of SpreadsheetML specification (WordprocessingML & PresentationML to come)

Apache POI TextExtractors ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Content analysis for ECM with Apache Tika

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Content analysis for ECM with Apache Tika

Ähnlich wie Content analysis for ECM with Apache Tika (20)

Mehr von Paolo Mottadelli

Mehr von Paolo Mottadelli (11)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Content analysis for ECM with Apache Tika