In this session, we will look first at the rich metadata that documents in your repository have, how to control the mapping of this on to your content model, and some of the interesting things this can deliver. We’ll then move on to the content transformation and rendition services, and see how you can easily and powerfully generate a wide range of media from the content you already have. Finally, we’ll look at how to extend these services to support additional formats.
2. Introduction: 3 Content Related Services
Covering:
• Metadata Extractor
Service Uses
Interfaces
• Content Transformer
Calling the Service
• Renditions
Java & JS APIs
Demos
Configuration
Extending
Apache Tika
3. Why Now? Aren't these old Services?
The Metadata Extractor and Content
Transformer are core repository services
They've been around since the early days
For a long time, not a lot change with them,
“They're boring and just work....”
In Alfresco 3.4 we added support for
delegating some of the work to Apache Tika
This has lead to a large improvement in the
numbers of file formats that are supported!
Renditions came in in Alfresco 3.3
4. What did Alfresco 3.3 Support?
PDF
Word, PowerPoint, Excel
HTML
Open Document Formats (OpenOffice)
RFC822 Email
Outlook .msg Email
And that's it...
5. Supported Formats in Alfresco 4.0
Audio – WAV, RIFF, MIDI
DWG and PRT (CAD Formats)
Epub
RSS and ATOM Feeds
True Type Fonts
HTML
Images – JPEG, GIF, PNG, TIFF, Bitmap
Includes EXIF Metadata where present
6. Alfresco 4.0 Formats - Continued
iWorks (Keynote, Pages, Numbers)
RFC822 MBox Mail
Microsoft Outlook .msg Email
Microsoft Office (Binary) – Word,
PowerPoint, Excel, Visio, Publisher, Works
Microsoft Office (OOXML, 2007+) – Word,
PowerPoint, Excel
Open Document Format (OpenOffice)
Old-style OpenOffice (.sxw etc)
7. Alfresco 4.0 Formats – Still Continued
MP3 (id3 v1 and v2)
Ogg Vorbis and FLAC
CDF, HDF (Scientific Data)
RDF
RTF
PDF
Adobe Illustrator (PDF based)
Adobe PSD (expected shortly)
Plain Text
8. Alfresco 4.0 Formats – Final Set!
Zip, Tar, Compress etc (Archive Formats)
FLV Video
XML
Java Class Files
CHM (Windows Help Files)
Configurable External Programs
And probably some others too!
10. The Metadata Extractor Service
What, How, Why?
For a given piece of content, returns the Metadata
held within that
Document Metadata is converted into the content
model
Typically used with uploaded binary files
Upload a PDF, extract out the Title and Description,
save these as the properties on the Alfresco Node
Powered internally by a number of different
extractors
Service picks the appropriate extractor for you
Since Alfresco 3.4, makes heavy use of Apache Tika
11. The Content Transformer Service
What, How, Why?
Transforms content from one format to another
Driven by source and destination mime types
Used to generate plain text versions for indexing
Used to generate SWF versions for preview
Used to generate PDF versions for web download
Powered by a large number of different
transformers internally
Transformers can be chained togther, eg .doc →
.pdf via OpenOffice, then .pdf → .swf via pdf2swf
Since Alfresco 3.4, makes heavy use of Apache Tika
12. The Rendition Service (Alfresco 3.3+)
What, How, Why?
Can turn content from one kind to another
Or can just alter some content in the same format
Used to manipulate images, eg crop and resize
Used to generate HTML previews from .docx in the
Web Quick Start
Often uses the Content Transformation Service to
do the actual heavy lifting
The Thumbnail Service has been re-written to use
the Rendition Service (all thumbnail actions now
delegate to the Rendition Service)
Renditions are all Actions
13. Apache Tika
Apache Tika – http://tika.apache.org/
Apache Project which started in 2006
Grew out of the Lucene community, now widely used
in both Search and Content settings
Provides detection of files – eg this binary blob is really
a word file
Plain text, HTML and XHTML versions of a wide range
of different file formats
Consistent Metadata from different files
Tika hides the complexity of different formats and their
libraries, instead it's a simple, powerful API
Easy to use and extend
17. Metadata Extractor – JavaScript Use
Full access is not JavaScript
directly availble in JS
You can't get at the
var action =
extractor registry actions.create(
You can't get at the raw "extract-metadata");
properties
action.execute(
You can, however, easily document);
trigger the extraction on
a given node
This is done via the
Script Actions service
18. Calling Apache Tika
// Get a content detector, and an auto-selecting Parser
// In Alfresco we already know the type, so we don’t need to Auto Detect!
TikaConfig config = TikaConfig.getDefaultConfig();
DefaultDetector detector = new DefaultDetector(
config.getMimeRepository() );
Parser parser = new AutoDetectParser(detector);
// We’ll only want the plain text contents
ContentHandler handler = new BodyContentHandler();
// Tell the parser what we have
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
// Have it processed
parser.parse(input, handler, metadata, new ParseContext());
19. Metadata Extractor – Mappings
Mappings control how to turn the metadata an
extractor produces into node properties
Maps from extractor names to your content model
Typically set in a properties file, one per extractor
Can also be done in Spring when defining the bean
An OverwritePolicy controls what happens when
extracting for a 2nd (or subsequent) time
One output from a metadata can map to multiple
properties on your node
Not all outputs need to be mapped, some can (and
often are) ignored
25. Ways to Customise and Extend
Customise
Identify already available metadata of interest
Define a content model for this
Add mappings
tika-app.jar can be very helpful here
Extend
Locate/Write library or program to read file format
Write either Tika Plugin, or whole Extractor
Define mappings
http://blogs.alfresco.com/wp/nickb/ has more
27. Out-of-the-box Transformations
These are the main ones, there are others
Plain Text, HTML & XHTML for all Apache Tika
supported text and document formats (~30)
PDF to Image and SWF (thumbnails and previews)
Office File Formats to PDF (via Open Office direct /
JODConverter in Enterprise)
Plain Text and XML to PDF
Zip listing to Text
Image to other Images (via ImageMagick)
With FFMpeg, video transforms and thumbnails
Can chain transformers together, eg text preview
via txt -> pdf -> swf
28. Checking Supported Transformations
Checking active Transformations and Extractors
New webscript in 3.4 exposes information on the
available transformers and extractors
http://localhost:8080/alfresco/service/mimtypes
Shows live information as of when the page is
requested
As transforms come and go (eg OpenOffice dies),
the list will show what's current active
Only shows the current transformer, not in-active or
lower preference ones
Includes information on transformation both from
and two each mimetype, plus metadata extractor
31. Content Transformer – JavaScript Use
Full access is not JavaScript
var action =
directly availble in JS
actions.create("transform");
You can't get at the
// Transform into the same folder
tranformer registry
action.parameters["destination-
You can't control which folder"] = document.parent;
property is transformed,
action.parameters["assoc-type"] =
it's always Content "{http://www.alfresco.org/model/c
You can, however, easily ontent/1.0}contains";
trigger the
action.parameters["assoc-name"]
transformation of a = document.name +"transformed";
given node
action.parameters["mime-type"] =
"text/html";
This is done via the
// Execute
Script Actions service
action.execute(document);
33. Content Transformers and Tika
Tika generates HTML-like SAX events as it parses
Uses Java SAX API
Events can be captured or transformed
The Body Content Handler is used for plain text
Both HTML and XHTML are available
You can customise with your own handler, with
XSLT or with E4X from JavaScript
Text Indexing just uses a Body Content Handler
The Excel to CSV transformer has a text altering
SAX handler
The Web Quick Start Word→HTML transformer both
alters text, tags and embedded resources
34. Tika Plugins
Tika ships with Parsers for a wide range of file
formats as standard
All of these Parsers depend on libraries that are
Apache Licensed or similar
For other Parsers, Tika provides a mechanism for
having the Parser auto-loaded
Typically used by GPL or Proprietary plugins
Great way to have your custom formats handled
Alfresco will auto-load these if available
Current list of known third party plugins is:
http://wiki.apache.org/tika/3rd%20party%20parser%20plugins
35. Custom Tika Plugins
Writing a new Tika Plugin is very straightforward
Only 2 methods needed – getSupportedTypes to list
which mimetypes you support, and parse
Magic file used for detecting new plugins is
META-INF/services/org.apache.tika.parser.Parser
With the service file, the Tika Auto-Detect parser
will load and use the parser
Without it, you can explicitly configure it into
Alfresco via
TikaSpringConfiguredContentTransformer
Very easy way to add indexing and metadata
support for custom file formats
36. Custom Tika Parser – “Hello World”
public class HelloWorldParser extends AbstractParser {
public Set<MediaType> getSupportedTypes(ParseContext context) {
Set<MediaType> types = new HashSet<MediaType>();
types.add(MediaType.parse("hello/world"));
return types;
}
public void parse(InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context) throws SAXException {
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
xhtml.startDocument();
// Document Heading
xhtml.element("h1", “Hello World!”);
// To prove this worked, add some extra text to search for
xhtml.startElement("p");
xhtml.characters("To show that this went via the parser, we have ");
xhtml.characters("some special text that we can search for. ");
xhtml.characters("BADGER BADGER BADGER BADGER BADGER ");
xhtml.characters("BADGER BADGER BADGER MUSHROOM MUSHROOM ");
xhtml.endElement("p");
// All Done
xhtml.endDocument();
metadata.set("hello","world");
metadata.set("title","Hello World!");
metadata.set("custom1","Hello, Custom Metadata 1!");
metadata.set("custom2","Hello, Custom Metadata 2!");
}
}
37. Demo
“Hello World” Transformer Round-
Trip
var action = actions.create("transform");
action.parameters["destinationfolder"] = document.parent;
action.parameters["assoctype"] =
"{http://www.alfresco.org/model/content/1.0}contains";
action.parameters["assocname"] = document.name + "HW";
if(document.mimetype == "hello/world") {
// It's current a "Hello World" file
// Use Apache Tika to create a plain text version
action.parameters["mimetype"] = "text/plain";
} else {
// It's a regular new text file
// Have the command line tool make a "Hello World" version
action.parameters["mimetype"] = "hello/world";
}
action.execute(document);
38. Demo
Excel to HTML, CSV and
Text
var nameBase = document.name.substring(0, document.name.lastIndexOf("."));
var action = actions.create("transform");
action.parameters["destinationfolder"] = document.parent;
action.parameters["assoctype"] =
"{http://www.alfresco.org/model/content/1.0}contains";
action.parameters["assocname"] = nameBase + ".txt";
action.parameters["mimetype"] = "text/plain";
action.execute(document);
action.parameters["assocname"] = nameBase + ".csv";
action.parameters["mimetype"] = "text/csv";
action.execute(document);
action.parameters["assocname"] = nameBase + ".html";
action.parameters["mimetype"] = "text/xml";
action.execute(document);
40. Standard Rendition Engines
Renditions Supported in Alfresco v4.0
reformat – access to the Content Transformation
Service
image – crop, resize, etc
freemarker – runs a Freemarker Template against the
content of the node
html – turns .docx files into clean HTML + images
xslt – runs a XSLT Transformation against the content
of the node, XML content nodes only!
composite – execute several renditions in a series, eg
reformat followed by image crop
41. Persisted and Transient Definitions
For Complicated or Simple Renditons
To run a rendition, first create a Rendition Definition
for a given Rendering Engine
Next, set your parameters on the definition
Finally, execute this against a source node
For very complicated, or very commonly used
renditions, you don't want to have to create these
definitions every time
Instead, save them to the Data Dictionary, and load
via the Rendition Service on demand
Rendition Service provides Save and Load methods
42. Rendition Service – Call from Java
// Retrieve the existing Rendition Definition
QName renditionName = QName.createQName(
NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn");
RenditionDefinition renditionDef = loadRenditionDefinition(renditionName);
// Make some changes.
renditionDef.setParameterValue(AbstractRenderingEngine.PARAM_MIME_TYPE,
MimetypeMap.MIMETYPE_PDF);
renditionDef.setParameterValue(
RenditionService.PARAM_ORPHAN_EXISTING_RENDITION, true);
// Persist the changes.
renditionService.saveRenditionDefinition(renditionDef);
// Run the Rendition
ChildAssociationRef assoc = renditionService.render(
sourceNode, renditionDef);
43. Renditions from JavaScript
// Crop the image and place in a specified location
var renditionDef =
renditionService.createRenditionDefinition(
"cm:cropResize", "imageRenderingEngine");
renditionDef.parameters["destination-path-template"] =
"/Company Home/Cropped Images/${name}.jpg";
renditionDef.parameters["isAbsolute"] = true;
renditionDef.parameters["xsize"] = 50;
renditionDef.parameters["ysize"] = 50;
renditionService.render(nodeRef, renditionDef);
var renditions = renditionService.getRenditions(nodeRef);
44. Rendition Service – More Ways to Call
Actions, Rules, CMIS
Renditions are Actions, but by default hidden
Don't show up in Share when defining Rules
Don't show up in Explorer for Run Custom Action
They are available from Java and JS
Solution – create JS script to call the Rendition, then
run that script from your Rule / from Explorer
No dedicated REST API is available
Renditions show up in CMIS
Or you can use standard Action and Node APIs
45. Custom Rendition Engines
For when a composite just isn't enough...
Rendition Engines are just a special kind of Action
Executor, within the Action Framework
If you know how to write Custom Actions, you can
write your own Rendering Engine!
org.alfresco.repo.rendition.executor.
AbstractRenderingEngine provides a helpful
superclass, with handy methods
See the Actions talk for more on Custom Actions
and Custom Action Executors!