SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Downloaden Sie, um offline zu lesen
Metadata Extraction, Content
Transformations and Renditions
Nick Burch   •   Senior Engineer, Alfresco   •   twitter: @Gagravarr
Introduction: 3 Content Related Services

                        Covering:
• Metadata Extractor    
                            Service Uses
                        
                            Interfaces
• Content Transformer   
                            Calling the Service
• Renditions            
                            Java & JS APIs
                        
                            Demos
                        
                            Configuration
                        
                            Extending
                        
                            Apache Tika
Why Now? Aren't these old Services?
   The Metadata Extractor and Content
    Transformer are core repository services
   They've been around since the early days
   For a long time, not a lot change with them,
    “They're boring and just work....”
   In Alfresco 3.4 we added support for
    delegating some of the work to Apache Tika
   This has lead to a large improvement in the
    numbers of file formats that are supported!
   Renditions came in in Alfresco 3.3
What did Alfresco 3.3 Support?
   PDF
   Word, PowerPoint, Excel
   HTML
   Open Document Formats (OpenOffice)
   RFC822 Email
   Outlook .msg Email


   And that's it...
Supported Formats in Alfresco 4.0
   Audio – WAV, RIFF, MIDI
   DWG and PRT (CAD Formats)
   Epub
   RSS and ATOM Feeds
   True Type Fonts
   HTML
   Images – JPEG, GIF, PNG, TIFF, Bitmap
    Includes EXIF Metadata where present
Alfresco 4.0 Formats - Continued
   iWorks (Keynote, Pages, Numbers)
   RFC822 MBox Mail
   Microsoft Outlook .msg Email
   Microsoft Office (Binary) – Word,
    PowerPoint, Excel, Visio, Publisher, Works
   Microsoft Office (OOXML, 2007+) – Word,
    PowerPoint, Excel
   Open Document Format (OpenOffice)
   Old-style OpenOffice (.sxw etc)
Alfresco 4.0 Formats – Still Continued
   MP3 (id3 v1 and v2)
   Ogg Vorbis and FLAC
   CDF, HDF (Scientific Data)
   RDF
   RTF
   PDF
   Adobe Illustrator (PDF based)
   Adobe PSD (expected shortly)
   Plain Text
Alfresco 4.0 Formats – Final Set!
   Zip, Tar, Compress etc (Archive Formats)
   FLV Video
   XML
   Java Class Files
   CHM (Windows Help Files)
   Configurable External Programs


   And probably some others too!
Services Overview
The Metadata Extractor Service
What, How, Why?

    For a given piece of content, returns the Metadata
    held within that

    Document Metadata is converted into the content
    model

    Typically used with uploaded binary files

    Upload a PDF, extract out the Title and Description,
    save these as the properties on the Alfresco Node

    Powered internally by a number of different
    extractors

    Service picks the appropriate extractor for you

    Since Alfresco 3.4, makes heavy use of Apache Tika
The Content Transformer Service
What, How, Why?

    Transforms content from one format to another

    Driven by source and destination mime types

    Used to generate plain text versions for indexing

    Used to generate SWF versions for preview

    Used to generate PDF versions for web download

    Powered by a large number of different
    transformers internally

    Transformers can be chained togther, eg .doc →
    .pdf via OpenOffice, then .pdf → .swf via pdf2swf

    Since Alfresco 3.4, makes heavy use of Apache Tika
The Rendition Service (Alfresco 3.3+)
What, How, Why?

    Can turn content from one kind to another

    Or can just alter some content in the same format

    Used to manipulate images, eg crop and resize

    Used to generate HTML previews from .docx in the
    Web Quick Start

    Often uses the Content Transformation Service to
    do the actual heavy lifting

    The Thumbnail Service has been re-written to use
    the Rendition Service (all thumbnail actions now
    delegate to the Rendition Service)

    Renditions are all Actions
Apache Tika
Apache Tika – http://tika.apache.org/

    Apache Project which started in 2006

    Grew out of the Lucene community, now widely used
    in both Search and Content settings

    Provides detection of files – eg this binary blob is really
    a word file

    Plain text, HTML and XHTML versions of a wide range
    of different file formats

    Consistent Metadata from different files

    Tika hides the complexity of different formats and their
    libraries, instead it's a simple, powerful API

    Easy to use and extend

Any questions so far?




    ?
Metadata Extractor
     Service
Metadata Extractor – Java Use

    MetadataExtractorRegistry registry =
    (MetadataExtractorRegistry)
    context.getBean(“metadataExtracterRegistry”);

    ContentReader reader =
      contentService.getReader(nodeRef,
                     ContentModel.PROP_CONTENT);

    MetadataExtracter extractor =
        registry.getExtracter(reader.getMimetype());

    Map<QName, Serializable> properties =
        new HashMap<QName, Serializable>();

    extractor.extract(reader, properties);

    System.err.println(properties);
Metadata Extractor – JavaScript Use

    Full access is not          JavaScript
    directly availble in JS

    You can't get at the        
                                    var action =
    extractor registry               actions.create(

    You can't get at the raw            "extract-metadata");
    properties                  
                                    action.execute(

    You can, however, easily                    document);
    trigger the extraction on
    a given node

    This is done via the
    Script Actions service
Calling Apache Tika

    // Get a content detector, and an auto-selecting Parser

    // In Alfresco we already know the type, so we don’t need to Auto Detect!

    TikaConfig config = TikaConfig.getDefaultConfig();

    DefaultDetector detector = new DefaultDetector(
                                         config.getMimeRepository() );

    Parser parser = new AutoDetectParser(detector);


    // We’ll only want the plain text contents

    ContentHandler handler = new BodyContentHandler();



    // Tell the parser what we have

    Metadata metadata = new Metadata();

    metadata.set(Metadata.RESOURCE_NAME_KEY, filename);



    // Have it processed

    parser.parse(input, handler, metadata, new ParseContext());
Metadata Extractor – Mappings

    Mappings control how to turn the metadata an
    extractor produces into node properties

    Maps from extractor names to your content model

    Typically set in a properties file, one per extractor

    Can also be done in Spring when defining the bean

    An OverwritePolicy controls what happens when
    extracting for a 2nd (or subsequent) time

    One output from a metadata can map to multiple
    properties on your node

    Not all outputs need to be mapped, some can (and
    often are) ignored
Geo Content Model (cm:geographic)

    <aspect name="cm:geographic">

       <title>Geographic</title>

       <properties>

          <property name="cm:latitude">

             <title>Latitude</title>

             <type>d:double</type>

          </property>

          <property name="cm:longitude">

             <title>Longitude</title>

             <type>d:double</type>

          </property>

       </properties>

    </aspect>
Metadata Extractor – Geo Mapping

    # Namespaces

    namespace.prefix.cm=http://www.alfresco.org/model/content/1.0


    # Geo Mappings

    # Note – escape : in metadata keys inside properties files!

    geo:lat=cm:latitude

    geo:long=cm:longitude


    # Normal Mappings

    author=cm:author

    title=cm:title

    description=cm:description

    created=cm:created
Demo

Geo-Tagged Image Upload
Demo

   Tika + Geo-Tagged Images
java ­jar tika­app­1.0­SNAPSHOT.jar ­­metadata geotagged.jpg 

date: 2009­08­11T09:09:45
exif:DateTimeOriginal: 2009­08­11T09:09:45
exif:ExposureTime: 6.25E­4
exif:FNumber: 5.6
exif:Flash: false
exif:FocalLength: 194.0
exif:IsoSpeedRatings: 400
geo:lat: 12.54321
geo:long: ­54.1234
subject: canon­55­250
tiff:BitsPerSample: 8
tiff:ImageLength: 68
tiff:ImageWidth: 100
tiff:Make: Canon
tiff:Model: Canon EOS 40D
tiff:ResolutionUnit: Inch
tiff:Software: Adobe Photoshop CS3 Macintosh
tiff:XResolution: 240.0
tiff:YResolution: 240.0
Demo

Audio File Upload
Ways to Customise and Extend

    Customise

    Identify already available metadata of interest

    Define a content model for this

    Add mappings

    tika-app.jar can be very helpful here


    Extend

    Locate/Write library or program to read file format

    Write either Tika Plugin, or whole Extractor

    Define mappings

    http://blogs.alfresco.com/wp/nickb/ has more
Content Transformer
      Service
Out-of-the-box Transformations

    These are the main ones, there are others

    Plain Text, HTML & XHTML for all Apache Tika
    supported text and document formats (~30)

    PDF to Image and SWF (thumbnails and previews)

    Office File Formats to PDF (via Open Office direct /
    JODConverter in Enterprise)

    Plain Text and XML to PDF

    Zip listing to Text

    Image to other Images (via ImageMagick)

    With FFMpeg, video transforms and thumbnails

    Can chain transformers together, eg text preview
    via txt -> pdf -> swf
Checking Supported Transformations

    Checking active Transformations and Extractors

    New webscript in 3.4 exposes information on the
    available transformers and extractors

    http://localhost:8080/alfresco/service/mimtypes

    Shows live information as of when the page is
    requested

    As transforms come and go (eg OpenOffice dies),
    the list will show what's current active

    Only shows the current transformer, not in-active or
    lower preference ones

    Includes information on transformation both from
    and two each mimetype, plus metadata extractor
Demo

Mimetype Information WebScript
Content Transformer – Java Use

    ContentTransformerRegistry registry =
    (ContentTransformerRegistry)
           context.getBean(“contentTransformerRegistry”);

    ContentTransformer transformer = registry.
      getTransformer(“application/vnd.ms-excel”,”text/csv”,
                     new TransformationOptions());

    ContentReader reader =
    contentService.getReader(sourceNodeRef,
                             ContentModel.PROP_CONTENT);

    ContentWriter writer =
    contentService.getWriter(destNodeRef,
                             ContentModel.PROP_CONTENT);

    transformer.transform(reader, writer);
Content Transformer – JavaScript Use

    Full access is not         JavaScript
                                   var action =
    directly availble in JS
                               



                               actions.create("transform");

    You can't get at the       
                                   // Transform into the same folder
    tranformer registry        
                                   action.parameters["destination-

    You can't control which    folder"] = document.parent;
    property is transformed,   
                                   action.parameters["assoc-type"] =
    it's always Content        "{http://www.alfresco.org/model/c

    You can, however, easily   ontent/1.0}contains";

    trigger the                
                                   action.parameters["assoc-name"]

    transformation of a        = document.name +"transformed";

    given node
                               
                                   action.parameters["mime-type"] =
                               "text/html";

    This is done via the       
                                   // Execute
    Script Actions service     
                                   action.execute(document);
Custom Command Line Transformer
<bean id="transformer.worker.helloWorldCMD"
   class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
  <property name="mimetypeService“><ref bean="mimetypeService"/></property>
  <property name="transformCommand">
    <bean class="org.alfresco.util.exec.RuntimeExec">
      <property name="commandsAndArguments“><map>
         <entry key=".*“><list>
           <value>/bin/bash</value>
           <value>­c</value>
           <value>/bin/echo 'Hello World ­ ${source}' &gt; ${target}</value>
          </list></entry>
      </map></property>
      <property name="errorCodes“><value>1,127</value></property>
    </bean>
  </property
  <property name="explicitTransformations">
     <list><bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
        <property name="sourceMimetype“><value>text/plain</value></property>
        <property name="targetMimetype“><value>hello/world</value></property>
     </bean></list>
  </property>
</bean>

<bean id="transformer.helloWorldCMD" 
class="org.alfresco.repo.content.transform.ProxyContentTransformer"
   parent="baseContentTransformer">
  <property name="worker"><ref bean="transformer.worker.helloWorldCMD"/></property>
</bean>
Content Transformers and Tika

    Tika generates HTML-like SAX events as it parses

    Uses Java SAX API

    Events can be captured or transformed

    The Body Content Handler is used for plain text

    Both HTML and XHTML are available

    You can customise with your own handler, with
    XSLT or with E4X from JavaScript

    Text Indexing just uses a Body Content Handler

    The Excel to CSV transformer has a text altering
    SAX handler

    The Web Quick Start Word→HTML transformer both
    alters text, tags and embedded resources
Tika Plugins

    Tika ships with Parsers for a wide range of file
    formats as standard

    All of these Parsers depend on libraries that are
    Apache Licensed or similar

    For other Parsers, Tika provides a mechanism for
    having the Parser auto-loaded

    Typically used by GPL or Proprietary plugins

    Great way to have your custom formats handled

    Alfresco will auto-load these if available

    Current list of known third party plugins is:
http://wiki.apache.org/tika/3rd%20party%20parser%20plugins
Custom Tika Plugins

    Writing a new Tika Plugin is very straightforward

    Only 2 methods needed – getSupportedTypes to list
    which mimetypes you support, and parse

    Magic file used for detecting new plugins is
META-INF/services/org.apache.tika.parser.Parser

    With the service file, the Tika Auto-Detect parser
    will load and use the parser

    Without it, you can explicitly configure it into
    Alfresco via
    TikaSpringConfiguredContentTransformer

    Very easy way to add indexing and metadata
    support for custom file formats
Custom Tika Parser – “Hello World”
public class HelloWorldParser extends AbstractParser {
   public Set<MediaType> getSupportedTypes(ParseContext context) {
     Set<MediaType> types = new HashSet<MediaType>();
     types.add(MediaType.parse("hello/world"));
     return types;
 }

    public void parse(InputStream stream, ContentHandler handler,
          Metadata metadata, ParseContext context) throws SAXException {
       XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
       xhtml.startDocument();
       // Document Heading
       xhtml.element("h1", “Hello World!”);
       // To prove this worked, add some extra text to search for
       xhtml.startElement("p");
       xhtml.characters("To show that this went via the parser, we have ");
       xhtml.characters("some special text that we can search for. ");
       xhtml.characters("BADGER BADGER BADGER BADGER BADGER ");
       xhtml.characters("BADGER BADGER BADGER MUSHROOM MUSHROOM ");
       xhtml.endElement("p");
       // All Done
       xhtml.endDocument();

        metadata.set("hello","world");
        metadata.set("title","Hello World!");
        metadata.set("custom1","Hello, Custom Metadata 1!");
        metadata.set("custom2","Hello, Custom Metadata 2!");
    }
}
Demo

    “Hello World” Transformer Round-
                   Trip
var action = actions.create("transform");
action.parameters["destination­folder"] = document.parent;
action.parameters["assoc­type"] = 
"{http://www.alfresco.org/model/content/1.0}contains";
action.parameters["assoc­name"] = document.name + "HW";

if(document.mimetype == "hello/world") {
   // It's current a "Hello World" file
   // Use Apache Tika to create a plain text version
   action.parameters["mime­type"] = "text/plain";
} else {
   // It's a regular new text file
   // Have the command line tool make a "Hello World" version
   action.parameters["mime­type"] = "hello/world";
}

action.execute(document);
Demo

           Excel to HTML, CSV and
                     Text
var nameBase = document.name.substring(0, document.name.lastIndexOf("."));
var action = actions.create("transform");
action.parameters["destination­folder"] = document.parent;
action.parameters["assoc­type"] = 
"{http://www.alfresco.org/model/content/1.0}contains";

action.parameters["assoc­name"] = nameBase + ".txt";
action.parameters["mime­type"] = "text/plain";
action.execute(document);

action.parameters["assoc­name"] = nameBase + ".csv";
action.parameters["mime­type"] = "text/csv";
action.execute(document);

action.parameters["assoc­name"] = nameBase + ".html";
action.parameters["mime­type"] = "text/xml";
action.execute(document);
Rendition Service
Standard Rendition Engines

    Renditions Supported in Alfresco v4.0


    reformat – access to the Content Transformation
    Service

    image – crop, resize, etc

    freemarker – runs a Freemarker Template against the
    content of the node

    html – turns .docx files into clean HTML + images

    xslt – runs a XSLT Transformation against the content
    of the node, XML content nodes only!

    composite – execute several renditions in a series, eg
    reformat followed by image crop
Persisted and Transient Definitions
 For Complicated or Simple Renditons
 To run a rendition, first create a Rendition Definition


  for a given Rendering Engine
 Next, set your parameters on the definition


 Finally, execute this against a source node




 For very complicated, or very commonly used
  renditions, you don't want to have to create these
  definitions every time
 Instead, save them to the Data Dictionary, and load


  via the Rendition Service on demand
 Rendition Service provides Save and Load methods
Rendition Service – Call from Java
// Retrieve the existing Rendition Definition
QName renditionName = QName.createQName(
             NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn");
RenditionDefinition renditionDef = loadRenditionDefinition(renditionName);

// Make some changes.
renditionDef.setParameterValue(AbstractRenderingEngine.PARAM_MIME_TYPE,
                                MimetypeMap.MIMETYPE_PDF);
renditionDef.setParameterValue(
                 RenditionService.PARAM_ORPHAN_EXISTING_RENDITION, true);

// Persist the changes.
renditionService.saveRenditionDefinition(renditionDef);

// Run the Rendition
ChildAssociationRef assoc = renditionService.render(
                                      sourceNode, renditionDef);
Renditions from JavaScript
// Crop the image and place in a specified location
var renditionDef =
    renditionService.createRenditionDefinition(
            "cm:cropResize", "imageRenderingEngine");
renditionDef.parameters["destination-path-template"] =
     "/Company Home/Cropped Images/${name}.jpg";
renditionDef.parameters["isAbsolute"] = true;
renditionDef.parameters["xsize"] = 50;
renditionDef.parameters["ysize"] = 50;

renditionService.render(nodeRef, renditionDef);

var renditions = renditionService.getRenditions(nodeRef);
Rendition Service – More Ways to Call
 Actions, Rules, CMIS
 Renditions are Actions, but by default hidden


 Don't show up in Share when defining Rules


 Don't show up in Explorer for Run Custom Action




 They are available from Java and JS
 Solution – create JS script to call the Rendition, then


  run that script from your Rule / from Explorer

 No dedicated REST API is available
 Renditions show up in CMIS


 Or you can use standard Action and Node APIs
Custom Rendition Engines
   For when a composite just isn't enough...

 Rendition Engines are just a special kind of Action
  Executor, within the Action Framework
 If you know how to write Custom Actions, you can


  write your own Rendering Engine!
 org.alfresco.repo.rendition.executor.


  AbstractRenderingEngine provides a helpful
  superclass, with handy methods

   See the Actions talk for more on Custom Actions
    and Custom Action Executors!
Demo

  Crop and Resize and Image

          (Using Share Rules)
var renditionDef = renditionService.createRenditionDefinition(
"cm:cropResize", "imageRenderingEngine");
renditionDef.parameters["destination­path­template"] = 
               "/Company Home/Cropped Images/${name}.jpg";
renditionDef.parameters["isAbsolute"] = true;
renditionDef.parameters["xsize"] = 50;
renditionDef.parameters["ysize"] = 50;
renditionDef.parameters["percent_crop"] = true;
renditionDef.parameters["crop­width"] = 75;
renditionDef.parameters["crop­height"] = 60;
renditionDef.parameters["crop_x"] = 20;
renditionDef.parameters["crop_y"] = 150;
renditionDef.execute(document);
Demo

Video Thumbnailing and
       Rendition
Demo

Word .docx → HTML & Images

   (Uses Web Quick Start)
Metadata Extraction, Content
Transformations and Renditions
Any Questions?




?
Learn More
 http://wiki.alfresco.com/wiki/Metadata_Extraction
 http://wiki.alfresco.com/wiki/Content_Transformations
  http://wiki.alfresco.com/wiki/Content_Transformation_and_Metadata
                                _Extraction_with_Apache_Tika
  http://wiki.alfresco.com/wiki/Rendition_Service

 http://blogs.alfresco.com/wp/nickb/

 twitter: @Alfresco, @Gagravarr

Weitere ähnliche Inhalte

Was ist angesagt?

Alfresco Share - Recycle Bin Ideas
Alfresco Share - Recycle Bin IdeasAlfresco Share - Recycle Bin Ideas
Alfresco Share - Recycle Bin Ideas
AlfrescoUE
 
Alfresco search services: Now and Then
Alfresco search services: Now and ThenAlfresco search services: Now and Then
Alfresco search services: Now and Then
Angel Borroy López
 

Was ist angesagt? (20)

From zero to hero Backing up alfresco
From zero to hero Backing up alfrescoFrom zero to hero Backing up alfresco
From zero to hero Backing up alfresco
 
Alfresco tuning part2
Alfresco tuning part2Alfresco tuning part2
Alfresco tuning part2
 
Alfresco Security Best Practices Guide
Alfresco Security Best Practices GuideAlfresco Security Best Practices Guide
Alfresco Security Best Practices Guide
 
Alfresco node lifecyle, services and zones
Alfresco node lifecyle, services and zonesAlfresco node lifecyle, services and zones
Alfresco node lifecyle, services and zones
 
Alfresco Share - Recycle Bin Ideas
Alfresco Share - Recycle Bin IdeasAlfresco Share - Recycle Bin Ideas
Alfresco Share - Recycle Bin Ideas
 
Alfresco Certificates
Alfresco Certificates Alfresco Certificates
Alfresco Certificates
 
Moving Gigantic Files Into and Out of the Alfresco Repository
Moving Gigantic Files Into and Out of the Alfresco RepositoryMoving Gigantic Files Into and Out of the Alfresco Repository
Moving Gigantic Files Into and Out of the Alfresco Repository
 
Alfresco CMIS
Alfresco CMISAlfresco CMIS
Alfresco CMIS
 
Alfresco DevCon 2019 Performance Tools of the Trade
Alfresco DevCon 2019   Performance Tools of the TradeAlfresco DevCon 2019   Performance Tools of the Trade
Alfresco DevCon 2019 Performance Tools of the Trade
 
Alfresco tuning part1
Alfresco tuning part1Alfresco tuning part1
Alfresco tuning part1
 
Actions rules and workflow in alfresco
Actions rules and workflow in alfrescoActions rules and workflow in alfresco
Actions rules and workflow in alfresco
 
Alfresco DevCon 2019 - Alfresco Identity Services in Action
Alfresco DevCon 2019 - Alfresco Identity Services in ActionAlfresco DevCon 2019 - Alfresco Identity Services in Action
Alfresco DevCon 2019 - Alfresco Identity Services in Action
 
API Design, A Quick Guide to REST, SOAP, gRPC, and GraphQL, By Vahid Rahimian
API Design, A Quick Guide to REST, SOAP, gRPC, and GraphQL, By Vahid RahimianAPI Design, A Quick Guide to REST, SOAP, gRPC, and GraphQL, By Vahid Rahimian
API Design, A Quick Guide to REST, SOAP, gRPC, and GraphQL, By Vahid Rahimian
 
Fiware overview
Fiware overviewFiware overview
Fiware overview
 
Installing and Getting Started with Alfresco
Installing and Getting Started with AlfrescoInstalling and Getting Started with Alfresco
Installing and Getting Started with Alfresco
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Alfresco tuning part1
Alfresco tuning part1Alfresco tuning part1
Alfresco tuning part1
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
Guide to alfresco monitoring
Guide to alfresco monitoringGuide to alfresco monitoring
Guide to alfresco monitoring
 
Alfresco search services: Now and Then
Alfresco search services: Now and ThenAlfresco search services: Now and Then
Alfresco search services: Now and Then
 

Andere mochten auch

Plutext Alfresco Tech Talk
Plutext Alfresco Tech TalkPlutext Alfresco Tech Talk
Plutext Alfresco Tech Talk
quyong2000
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
mteutelink
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
Manish kumar
 

Andere mochten auch (20)

Customizing the Document Library
Customizing the Document LibraryCustomizing the Document Library
Customizing the Document Library
 
Plutext Alfresco Tech Talk
Plutext Alfresco Tech TalkPlutext Alfresco Tech Talk
Plutext Alfresco Tech Talk
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsqueda
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
Mejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrMejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache Solr
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Alfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolAlfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en español
 
Search engine
Search engineSearch engine
Search engine
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Introducción a Solr
Introducción a SolrIntroducción a Solr
Introducción a Solr
 
Conferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo SolrConferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo Solr
 

Ähnlich wie PLAT-13 Metadata Extraction and Transformation

Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB Foxx
Michael Hackstein
 
CustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputsCustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputs
Suite Solutions
 
OPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATIONOPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATION
SUMIT KUMAR
 
Adopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuiteAdopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuite
AnswerModules
 
DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)
Data Finder
 
Code Analysis and Refactoring with CDT
Code Analysis and Refactoring with CDTCode Analysis and Refactoring with CDT
Code Analysis and Refactoring with CDT
dschaefer
 

Ähnlich wie PLAT-13 Metadata Extraction and Transformation (20)

Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Tibco business works
Tibco business worksTibco business works
Tibco business works
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB Foxx
 
Switch to Alfresco with Seed in Australia and New Zealand
Switch to Alfresco with Seed in Australia and New ZealandSwitch to Alfresco with Seed in Australia and New Zealand
Switch to Alfresco with Seed in Australia and New Zealand
 
CustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputsCustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputs
 
Managed Feature Store for Machine Learning
Managed Feature Store for Machine LearningManaged Feature Store for Machine Learning
Managed Feature Store for Machine Learning
 
OPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATIONOPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATION
 
Adopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuiteAdopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuite
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
 
Integration Monday - BizTalk Migrator Deep Dive
Integration Monday - BizTalk Migrator Deep DiveIntegration Monday - BizTalk Migrator Deep Dive
Integration Monday - BizTalk Migrator Deep Dive
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddler
 
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.pptlecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
 
DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)
 
Code Analysis and Refactoring with CDT
Code Analysis and Refactoring with CDTCode Analysis and Refactoring with CDT
Code Analysis and Refactoring with CDT
 
Spring Ldap
Spring LdapSpring Ldap
Spring Ldap
 
Elements for an iOS Backend
Elements for an iOS BackendElements for an iOS Backend
Elements for an iOS Backend
 
Deep Dive: Alfresco Core Repository (... embedded in a micro-services style a...
Deep Dive: Alfresco Core Repository (... embedded in a micro-services style a...Deep Dive: Alfresco Core Repository (... embedded in a micro-services style a...
Deep Dive: Alfresco Core Repository (... embedded in a micro-services style a...
 

Mehr von Alfresco Software

Mehr von Alfresco Software (20)

Alfresco Day Benelux Inholland studentendossier
Alfresco Day Benelux Inholland studentendossierAlfresco Day Benelux Inholland studentendossier
Alfresco Day Benelux Inholland studentendossier
 
Alfresco Day Benelux Hogeschool Inholland Records Management application
Alfresco Day Benelux Hogeschool Inholland Records Management applicationAlfresco Day Benelux Hogeschool Inholland Records Management application
Alfresco Day Benelux Hogeschool Inholland Records Management application
 
Alfresco Day BeNelux: Customer Success Showcase - Saxion Hogescholen
Alfresco Day BeNelux: Customer Success Showcase - Saxion HogescholenAlfresco Day BeNelux: Customer Success Showcase - Saxion Hogescholen
Alfresco Day BeNelux: Customer Success Showcase - Saxion Hogescholen
 
Alfresco Day BeNelux: Customer Success Showcase - Gemeente Amsterdam
Alfresco Day BeNelux: Customer Success Showcase - Gemeente AmsterdamAlfresco Day BeNelux: Customer Success Showcase - Gemeente Amsterdam
Alfresco Day BeNelux: Customer Success Showcase - Gemeente Amsterdam
 
Alfresco Day BeNelux: The success of Alfresco
Alfresco Day BeNelux: The success of AlfrescoAlfresco Day BeNelux: The success of Alfresco
Alfresco Day BeNelux: The success of Alfresco
 
Alfresco Day BeNelux: Customer Success Showcase - Credendo Group
Alfresco Day BeNelux: Customer Success Showcase - Credendo GroupAlfresco Day BeNelux: Customer Success Showcase - Credendo Group
Alfresco Day BeNelux: Customer Success Showcase - Credendo Group
 
Alfresco Day BeNelux: Digital Transformation - It's All About Flow
Alfresco Day BeNelux: Digital Transformation - It's All About FlowAlfresco Day BeNelux: Digital Transformation - It's All About Flow
Alfresco Day BeNelux: Digital Transformation - It's All About Flow
 
Alfresco Day Vienna 2016: Activiti – ein Katalysator für die DMS-Strategie be...
Alfresco Day Vienna 2016: Activiti – ein Katalysator für die DMS-Strategie be...Alfresco Day Vienna 2016: Activiti – ein Katalysator für die DMS-Strategie be...
Alfresco Day Vienna 2016: Activiti – ein Katalysator für die DMS-Strategie be...
 
Alfresco Day Vienna 2016: Elektronische Geschäftsprozesse auf Basis von Alfre...
Alfresco Day Vienna 2016: Elektronische Geschäftsprozesse auf Basis von Alfre...Alfresco Day Vienna 2016: Elektronische Geschäftsprozesse auf Basis von Alfre...
Alfresco Day Vienna 2016: Elektronische Geschäftsprozesse auf Basis von Alfre...
 
Alfresco Day Vienna 2016: Alfrescos neue Rest API
Alfresco Day Vienna 2016: Alfrescos neue Rest APIAlfresco Day Vienna 2016: Alfrescos neue Rest API
Alfresco Day Vienna 2016: Alfrescos neue Rest API
 
Alfresco Day Vienna 2016: Support Tools für die Admin-Konsole
Alfresco Day Vienna 2016: Support Tools für die Admin-KonsoleAlfresco Day Vienna 2016: Support Tools für die Admin-Konsole
Alfresco Day Vienna 2016: Support Tools für die Admin-Konsole
 
Alfresco Day Vienna 2016: Entwickeln mit Alfresco
Alfresco Day Vienna 2016: Entwickeln mit AlfrescoAlfresco Day Vienna 2016: Entwickeln mit Alfresco
Alfresco Day Vienna 2016: Entwickeln mit Alfresco
 
Alfresco Day Vienna 2016: Activiti goes enterprise: Die Evolution der BPM Sui...
Alfresco Day Vienna 2016: Activiti goes enterprise: Die Evolution der BPM Sui...Alfresco Day Vienna 2016: Activiti goes enterprise: Die Evolution der BPM Sui...
Alfresco Day Vienna 2016: Activiti goes enterprise: Die Evolution der BPM Sui...
 
Alfresco Day Vienna 2016: Partner Lightning Talk: Westernacher
Alfresco Day Vienna 2016: Partner Lightning Talk: WesternacherAlfresco Day Vienna 2016: Partner Lightning Talk: Westernacher
Alfresco Day Vienna 2016: Partner Lightning Talk: Westernacher
 
Alfresco Day Vienna 2016: Bringing Content & Process together with the App De...
Alfresco Day Vienna 2016: Bringing Content & Process together with the App De...Alfresco Day Vienna 2016: Bringing Content & Process together with the App De...
Alfresco Day Vienna 2016: Bringing Content & Process together with the App De...
 
Alfresco Day Vienna 2016: Partner Lightning Talk - it-novum
Alfresco Day Vienna 2016: Partner Lightning Talk - it-novumAlfresco Day Vienna 2016: Partner Lightning Talk - it-novum
Alfresco Day Vienna 2016: Partner Lightning Talk - it-novum
 
Alfresco Day Vienna 2016: How to Achieve Digital Flow in the Enterprise - Joh...
Alfresco Day Vienna 2016: How to Achieve Digital Flow in the Enterprise - Joh...Alfresco Day Vienna 2016: How to Achieve Digital Flow in the Enterprise - Joh...
Alfresco Day Vienna 2016: How to Achieve Digital Flow in the Enterprise - Joh...
 
Alfresco Day Warsaw 2016 - Czy możliwe jest spełnienie wszystkich regulacji p...
Alfresco Day Warsaw 2016 - Czy możliwe jest spełnienie wszystkich regulacji p...Alfresco Day Warsaw 2016 - Czy możliwe jest spełnienie wszystkich regulacji p...
Alfresco Day Warsaw 2016 - Czy możliwe jest spełnienie wszystkich regulacji p...
 
Alfresco Day Warsaw 2016: Identyfikacja i podpiselektroniczny - Safran
Alfresco Day Warsaw 2016: Identyfikacja i podpiselektroniczny - SafranAlfresco Day Warsaw 2016: Identyfikacja i podpiselektroniczny - Safran
Alfresco Day Warsaw 2016: Identyfikacja i podpiselektroniczny - Safran
 
Alfresco Day Warsaw 2016: Advancing the Flow of Digital Business
Alfresco Day Warsaw 2016: Advancing the Flow of Digital BusinessAlfresco Day Warsaw 2016: Advancing the Flow of Digital Business
Alfresco Day Warsaw 2016: Advancing the Flow of Digital Business
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 

PLAT-13 Metadata Extraction and Transformation

  • 1. Metadata Extraction, Content Transformations and Renditions Nick Burch • Senior Engineer, Alfresco • twitter: @Gagravarr
  • 2. Introduction: 3 Content Related Services Covering: • Metadata Extractor  Service Uses  Interfaces • Content Transformer  Calling the Service • Renditions  Java & JS APIs  Demos  Configuration  Extending  Apache Tika
  • 3. Why Now? Aren't these old Services?  The Metadata Extractor and Content Transformer are core repository services  They've been around since the early days  For a long time, not a lot change with them, “They're boring and just work....”  In Alfresco 3.4 we added support for delegating some of the work to Apache Tika  This has lead to a large improvement in the numbers of file formats that are supported!  Renditions came in in Alfresco 3.3
  • 4. What did Alfresco 3.3 Support?  PDF  Word, PowerPoint, Excel  HTML  Open Document Formats (OpenOffice)  RFC822 Email  Outlook .msg Email  And that's it...
  • 5. Supported Formats in Alfresco 4.0  Audio – WAV, RIFF, MIDI  DWG and PRT (CAD Formats)  Epub  RSS and ATOM Feeds  True Type Fonts  HTML  Images – JPEG, GIF, PNG, TIFF, Bitmap Includes EXIF Metadata where present
  • 6. Alfresco 4.0 Formats - Continued  iWorks (Keynote, Pages, Numbers)  RFC822 MBox Mail  Microsoft Outlook .msg Email  Microsoft Office (Binary) – Word, PowerPoint, Excel, Visio, Publisher, Works  Microsoft Office (OOXML, 2007+) – Word, PowerPoint, Excel  Open Document Format (OpenOffice)  Old-style OpenOffice (.sxw etc)
  • 7. Alfresco 4.0 Formats – Still Continued  MP3 (id3 v1 and v2)  Ogg Vorbis and FLAC  CDF, HDF (Scientific Data)  RDF  RTF  PDF  Adobe Illustrator (PDF based)  Adobe PSD (expected shortly)  Plain Text
  • 8. Alfresco 4.0 Formats – Final Set!  Zip, Tar, Compress etc (Archive Formats)  FLV Video  XML  Java Class Files  CHM (Windows Help Files)  Configurable External Programs  And probably some others too!
  • 10. The Metadata Extractor Service What, How, Why?  For a given piece of content, returns the Metadata held within that  Document Metadata is converted into the content model  Typically used with uploaded binary files  Upload a PDF, extract out the Title and Description, save these as the properties on the Alfresco Node  Powered internally by a number of different extractors  Service picks the appropriate extractor for you  Since Alfresco 3.4, makes heavy use of Apache Tika
  • 11. The Content Transformer Service What, How, Why?  Transforms content from one format to another  Driven by source and destination mime types  Used to generate plain text versions for indexing  Used to generate SWF versions for preview  Used to generate PDF versions for web download  Powered by a large number of different transformers internally  Transformers can be chained togther, eg .doc → .pdf via OpenOffice, then .pdf → .swf via pdf2swf  Since Alfresco 3.4, makes heavy use of Apache Tika
  • 12. The Rendition Service (Alfresco 3.3+) What, How, Why?  Can turn content from one kind to another  Or can just alter some content in the same format  Used to manipulate images, eg crop and resize  Used to generate HTML previews from .docx in the Web Quick Start  Often uses the Content Transformation Service to do the actual heavy lifting  The Thumbnail Service has been re-written to use the Rendition Service (all thumbnail actions now delegate to the Rendition Service)  Renditions are all Actions
  • 13. Apache Tika Apache Tika – http://tika.apache.org/  Apache Project which started in 2006  Grew out of the Lucene community, now widely used in both Search and Content settings  Provides detection of files – eg this binary blob is really a word file  Plain text, HTML and XHTML versions of a wide range of different file formats  Consistent Metadata from different files  Tika hides the complexity of different formats and their libraries, instead it's a simple, powerful API  Easy to use and extend 
  • 16. Metadata Extractor – Java Use  MetadataExtractorRegistry registry = (MetadataExtractorRegistry) context.getBean(“metadataExtracterRegistry”);  ContentReader reader = contentService.getReader(nodeRef, ContentModel.PROP_CONTENT);  MetadataExtracter extractor = registry.getExtracter(reader.getMimetype());  Map<QName, Serializable> properties = new HashMap<QName, Serializable>();  extractor.extract(reader, properties);  System.err.println(properties);
  • 17. Metadata Extractor – JavaScript Use  Full access is not JavaScript directly availble in JS  You can't get at the  var action = extractor registry actions.create(  You can't get at the raw "extract-metadata"); properties  action.execute(  You can, however, easily document); trigger the extraction on a given node  This is done via the Script Actions service
  • 18. Calling Apache Tika  // Get a content detector, and an auto-selecting Parser  // In Alfresco we already know the type, so we don’t need to Auto Detect!  TikaConfig config = TikaConfig.getDefaultConfig();  DefaultDetector detector = new DefaultDetector( config.getMimeRepository() );  Parser parser = new AutoDetectParser(detector);  // We’ll only want the plain text contents  ContentHandler handler = new BodyContentHandler();  // Tell the parser what we have  Metadata metadata = new Metadata();  metadata.set(Metadata.RESOURCE_NAME_KEY, filename);  // Have it processed  parser.parse(input, handler, metadata, new ParseContext());
  • 19. Metadata Extractor – Mappings  Mappings control how to turn the metadata an extractor produces into node properties  Maps from extractor names to your content model  Typically set in a properties file, one per extractor  Can also be done in Spring when defining the bean  An OverwritePolicy controls what happens when extracting for a 2nd (or subsequent) time  One output from a metadata can map to multiple properties on your node  Not all outputs need to be mapped, some can (and often are) ignored
  • 20. Geo Content Model (cm:geographic)  <aspect name="cm:geographic">  <title>Geographic</title>  <properties>  <property name="cm:latitude">  <title>Latitude</title>  <type>d:double</type>  </property>  <property name="cm:longitude">  <title>Longitude</title>  <type>d:double</type>  </property>  </properties>  </aspect>
  • 21. Metadata Extractor – Geo Mapping  # Namespaces  namespace.prefix.cm=http://www.alfresco.org/model/content/1.0  # Geo Mappings  # Note – escape : in metadata keys inside properties files!  geo:lat=cm:latitude  geo:long=cm:longitude  # Normal Mappings  author=cm:author  title=cm:title  description=cm:description  created=cm:created
  • 23. Demo Tika + Geo-Tagged Images java ­jar tika­app­1.0­SNAPSHOT.jar ­­metadata geotagged.jpg  date: 2009­08­11T09:09:45 exif:DateTimeOriginal: 2009­08­11T09:09:45 exif:ExposureTime: 6.25E­4 exif:FNumber: 5.6 exif:Flash: false exif:FocalLength: 194.0 exif:IsoSpeedRatings: 400 geo:lat: 12.54321 geo:long: ­54.1234 subject: canon­55­250 tiff:BitsPerSample: 8 tiff:ImageLength: 68 tiff:ImageWidth: 100 tiff:Make: Canon tiff:Model: Canon EOS 40D tiff:ResolutionUnit: Inch tiff:Software: Adobe Photoshop CS3 Macintosh tiff:XResolution: 240.0 tiff:YResolution: 240.0
  • 25. Ways to Customise and Extend  Customise  Identify already available metadata of interest  Define a content model for this  Add mappings  tika-app.jar can be very helpful here  Extend  Locate/Write library or program to read file format  Write either Tika Plugin, or whole Extractor  Define mappings  http://blogs.alfresco.com/wp/nickb/ has more
  • 27. Out-of-the-box Transformations  These are the main ones, there are others  Plain Text, HTML & XHTML for all Apache Tika supported text and document formats (~30)  PDF to Image and SWF (thumbnails and previews)  Office File Formats to PDF (via Open Office direct / JODConverter in Enterprise)  Plain Text and XML to PDF  Zip listing to Text  Image to other Images (via ImageMagick)  With FFMpeg, video transforms and thumbnails  Can chain transformers together, eg text preview via txt -> pdf -> swf
  • 28. Checking Supported Transformations  Checking active Transformations and Extractors  New webscript in 3.4 exposes information on the available transformers and extractors  http://localhost:8080/alfresco/service/mimtypes  Shows live information as of when the page is requested  As transforms come and go (eg OpenOffice dies), the list will show what's current active  Only shows the current transformer, not in-active or lower preference ones  Includes information on transformation both from and two each mimetype, plus metadata extractor
  • 30. Content Transformer – Java Use  ContentTransformerRegistry registry = (ContentTransformerRegistry) context.getBean(“contentTransformerRegistry”);  ContentTransformer transformer = registry. getTransformer(“application/vnd.ms-excel”,”text/csv”, new TransformationOptions());  ContentReader reader = contentService.getReader(sourceNodeRef, ContentModel.PROP_CONTENT);  ContentWriter writer = contentService.getWriter(destNodeRef, ContentModel.PROP_CONTENT);  transformer.transform(reader, writer);
  • 31. Content Transformer – JavaScript Use  Full access is not JavaScript var action = directly availble in JS  actions.create("transform");  You can't get at the  // Transform into the same folder tranformer registry  action.parameters["destination-  You can't control which folder"] = document.parent; property is transformed,  action.parameters["assoc-type"] = it's always Content "{http://www.alfresco.org/model/c  You can, however, easily ontent/1.0}contains"; trigger the  action.parameters["assoc-name"] transformation of a = document.name +"transformed"; given node  action.parameters["mime-type"] = "text/html";  This is done via the  // Execute Script Actions service  action.execute(document);
  • 32. Custom Command Line Transformer <bean id="transformer.worker.helloWorldCMD"    class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">   <property name="mimetypeService“><ref bean="mimetypeService"/></property>   <property name="transformCommand">     <bean class="org.alfresco.util.exec.RuntimeExec">       <property name="commandsAndArguments“><map>          <entry key=".*“><list>            <value>/bin/bash</value>            <value>­c</value>            <value>/bin/echo 'Hello World ­ ${source}' &gt; ${target}</value>           </list></entry>       </map></property>       <property name="errorCodes“><value>1,127</value></property>     </bean>   </property   <property name="explicitTransformations">      <list><bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">         <property name="sourceMimetype“><value>text/plain</value></property>         <property name="targetMimetype“><value>hello/world</value></property>      </bean></list>   </property> </bean> <bean id="transformer.helloWorldCMD"  class="org.alfresco.repo.content.transform.ProxyContentTransformer"    parent="baseContentTransformer">   <property name="worker"><ref bean="transformer.worker.helloWorldCMD"/></property> </bean>
  • 33. Content Transformers and Tika  Tika generates HTML-like SAX events as it parses  Uses Java SAX API  Events can be captured or transformed  The Body Content Handler is used for plain text  Both HTML and XHTML are available  You can customise with your own handler, with XSLT or with E4X from JavaScript  Text Indexing just uses a Body Content Handler  The Excel to CSV transformer has a text altering SAX handler  The Web Quick Start Word→HTML transformer both alters text, tags and embedded resources
  • 34. Tika Plugins  Tika ships with Parsers for a wide range of file formats as standard  All of these Parsers depend on libraries that are Apache Licensed or similar  For other Parsers, Tika provides a mechanism for having the Parser auto-loaded  Typically used by GPL or Proprietary plugins  Great way to have your custom formats handled  Alfresco will auto-load these if available  Current list of known third party plugins is: http://wiki.apache.org/tika/3rd%20party%20parser%20plugins
  • 35. Custom Tika Plugins  Writing a new Tika Plugin is very straightforward  Only 2 methods needed – getSupportedTypes to list which mimetypes you support, and parse  Magic file used for detecting new plugins is META-INF/services/org.apache.tika.parser.Parser  With the service file, the Tika Auto-Detect parser will load and use the parser  Without it, you can explicitly configure it into Alfresco via TikaSpringConfiguredContentTransformer  Very easy way to add indexing and metadata support for custom file formats
  • 36. Custom Tika Parser – “Hello World” public class HelloWorldParser extends AbstractParser { public Set<MediaType> getSupportedTypes(ParseContext context) { Set<MediaType> types = new HashSet<MediaType>(); types.add(MediaType.parse("hello/world")); return types; } public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws SAXException { XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); xhtml.startDocument(); // Document Heading xhtml.element("h1", “Hello World!”); // To prove this worked, add some extra text to search for xhtml.startElement("p"); xhtml.characters("To show that this went via the parser, we have "); xhtml.characters("some special text that we can search for. "); xhtml.characters("BADGER BADGER BADGER BADGER BADGER "); xhtml.characters("BADGER BADGER BADGER MUSHROOM MUSHROOM "); xhtml.endElement("p"); // All Done xhtml.endDocument(); metadata.set("hello","world"); metadata.set("title","Hello World!"); metadata.set("custom1","Hello, Custom Metadata 1!"); metadata.set("custom2","Hello, Custom Metadata 2!"); } }
  • 37. Demo “Hello World” Transformer Round- Trip var action = actions.create("transform"); action.parameters["destination­folder"] = document.parent; action.parameters["assoc­type"] =  "{http://www.alfresco.org/model/content/1.0}contains"; action.parameters["assoc­name"] = document.name + "HW"; if(document.mimetype == "hello/world") {    // It's current a "Hello World" file    // Use Apache Tika to create a plain text version    action.parameters["mime­type"] = "text/plain"; } else {    // It's a regular new text file    // Have the command line tool make a "Hello World" version    action.parameters["mime­type"] = "hello/world"; } action.execute(document);
  • 38. Demo Excel to HTML, CSV and Text var nameBase = document.name.substring(0, document.name.lastIndexOf(".")); var action = actions.create("transform"); action.parameters["destination­folder"] = document.parent; action.parameters["assoc­type"] =  "{http://www.alfresco.org/model/content/1.0}contains"; action.parameters["assoc­name"] = nameBase + ".txt"; action.parameters["mime­type"] = "text/plain"; action.execute(document); action.parameters["assoc­name"] = nameBase + ".csv"; action.parameters["mime­type"] = "text/csv"; action.execute(document); action.parameters["assoc­name"] = nameBase + ".html"; action.parameters["mime­type"] = "text/xml"; action.execute(document);
  • 40. Standard Rendition Engines  Renditions Supported in Alfresco v4.0  reformat – access to the Content Transformation Service  image – crop, resize, etc  freemarker – runs a Freemarker Template against the content of the node  html – turns .docx files into clean HTML + images  xslt – runs a XSLT Transformation against the content of the node, XML content nodes only!  composite – execute several renditions in a series, eg reformat followed by image crop
  • 41. Persisted and Transient Definitions  For Complicated or Simple Renditons  To run a rendition, first create a Rendition Definition for a given Rendering Engine  Next, set your parameters on the definition  Finally, execute this against a source node  For very complicated, or very commonly used renditions, you don't want to have to create these definitions every time  Instead, save them to the Data Dictionary, and load via the Rendition Service on demand  Rendition Service provides Save and Load methods
  • 42. Rendition Service – Call from Java // Retrieve the existing Rendition Definition QName renditionName = QName.createQName( NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn"); RenditionDefinition renditionDef = loadRenditionDefinition(renditionName); // Make some changes. renditionDef.setParameterValue(AbstractRenderingEngine.PARAM_MIME_TYPE, MimetypeMap.MIMETYPE_PDF); renditionDef.setParameterValue( RenditionService.PARAM_ORPHAN_EXISTING_RENDITION, true); // Persist the changes. renditionService.saveRenditionDefinition(renditionDef); // Run the Rendition ChildAssociationRef assoc = renditionService.render( sourceNode, renditionDef);
  • 43. Renditions from JavaScript // Crop the image and place in a specified location var renditionDef = renditionService.createRenditionDefinition( "cm:cropResize", "imageRenderingEngine"); renditionDef.parameters["destination-path-template"] = "/Company Home/Cropped Images/${name}.jpg"; renditionDef.parameters["isAbsolute"] = true; renditionDef.parameters["xsize"] = 50; renditionDef.parameters["ysize"] = 50; renditionService.render(nodeRef, renditionDef); var renditions = renditionService.getRenditions(nodeRef);
  • 44. Rendition Service – More Ways to Call  Actions, Rules, CMIS  Renditions are Actions, but by default hidden  Don't show up in Share when defining Rules  Don't show up in Explorer for Run Custom Action  They are available from Java and JS  Solution – create JS script to call the Rendition, then run that script from your Rule / from Explorer  No dedicated REST API is available  Renditions show up in CMIS  Or you can use standard Action and Node APIs
  • 45. Custom Rendition Engines  For when a composite just isn't enough...  Rendition Engines are just a special kind of Action Executor, within the Action Framework  If you know how to write Custom Actions, you can write your own Rendering Engine!  org.alfresco.repo.rendition.executor. AbstractRenderingEngine provides a helpful superclass, with handy methods  See the Actions talk for more on Custom Actions and Custom Action Executors!
  • 46. Demo Crop and Resize and Image (Using Share Rules) var renditionDef = renditionService.createRenditionDefinition( "cm:cropResize", "imageRenderingEngine"); renditionDef.parameters["destination­path­template"] =                 "/Company Home/Cropped Images/${name}.jpg"; renditionDef.parameters["isAbsolute"] = true; renditionDef.parameters["xsize"] = 50; renditionDef.parameters["ysize"] = 50; renditionDef.parameters["percent_crop"] = true; renditionDef.parameters["crop­width"] = 75; renditionDef.parameters["crop­height"] = 60; renditionDef.parameters["crop_x"] = 20; renditionDef.parameters["crop_y"] = 150; renditionDef.execute(document);
  • 48. Demo Word .docx → HTML & Images (Uses Web Quick Start)
  • 51. Learn More http://wiki.alfresco.com/wiki/Metadata_Extraction http://wiki.alfresco.com/wiki/Content_Transformations http://wiki.alfresco.com/wiki/Content_Transformation_and_Metadata _Extraction_with_Apache_Tika http://wiki.alfresco.com/wiki/Rendition_Service http://blogs.alfresco.com/wp/nickb/ twitter: @Alfresco, @Gagravarr