The document provides an overview of Apache ManifoldCF, an open source content management system. It describes ManifoldCF's capabilities, including crawling repositories to index their contents and push those contents to search servers. It details the key components of ManifoldCF like the Pull Agent Daemon, jobs, connectors, and monitoring UI. The document also outlines ManifoldCF's history and major releases from its incubation at Apache to becoming a top-level project.
2. About me
● Open Source ECM Specialist at Sourcesence
● Author and Technical Reviewer at Packt Publishing
○ Alfresco 3 Web Services (2010)
○ GateIn Cookbook (2012)
● Alfresco Community (nickname OpenPj)
○ Alfresco Wiki Gardener
○ Top 10 supporter (english and italian)
○ Moderator of the italian forum
● PMC Member at the Apache Software Foundation
● JBoss Community
○ Content editor for jboss.org
○ Project Leader and Committer for PortletSwap
3. Overview
● The story
● What is ManifoldCF?
○ What is a repository?
○ What is a search server?
● Why ManifoldCF?
● Architecture
● The growing path
○ The 0.3-incubating version
○ The 0.4-incubating version
○ The 0.5-incubating version
○ The 0.6 version (graduated ^__^)
○ What's new in the 1.0.1 version
● The book: ManifoldCF in Action
● Demo
● Resources
4. The story
The original ManifoldCF code base was granted by MetaCarta
Inc., to the Apache Software Foundation in December 2009.
The MetaCarta effort represented more than five years of
successful development and testing in multiple, challenging
enterprise environments.
The project was graduated as Apache Top Level Project in July
2012.
^__^
5. What is ManifoldCF?
● Open Source crawler
○ schedule jobs to create indexes
■ get contents from repositories
■ push contents on search servers
Search Server
Repository 1
1
Search Server
Repository 2 Apache ManifoldCF 2
Search Server
Repository 3
3
6. What is ManifoldCF?
● Open Source crawler
○ schedule jobs to create indexes
■ get contents from repositories
■ push contents on search servers
● Out-Of-The-Box it is distributed as J2EE web apps
○ REST API
○ Authority Service
○ Crawler UI
● Can be embedded in any Java application
7. What is a repository?
● Open Source crawler
○ schedule jobs to create indexes
■ get contents from repositories
■ push contents on search servers
Search Server
Repository 1
1
Search Server
Repository 2 Apache ManifoldCF 2
Search Server
Repository 3
3
8. What is a repository?
● central place where to put and get contents
● contents are kept is an organized way
○ ER model is the old way
○ Node graph
■ properties (metadata)
■ associations
■ renditions
● base component of Enterprise Content Management (ECM)
systems
● is from Latin repositorium
○ table of service
○ vessel
○ chamber
○ where to keep and find your things!!!
9. Enterprise Content Management
Enterprise content management (ECM) is a formalized means of
organizing and storing an organization's documents, and other
content, that relate to the organization's processes. The term
encompasses strategies, methods, and tools used throughout the
lifecycle of the content.
Wikipedia
http://en.wikipedia.org/wiki/Enterprise_content_management
10. Enterprise Content Management
ECM
Enterprise services
Workflows and processes
Users +
Repository groups BPM
(LDAP, IDM)
11. What is a repository? - You use it!!!
● Some simple examples:
○ SMTP servers
○ Google Drive
○ Dropbox
● Some Open Source repository implementations:
○ exoJCR
○ Apache JackRabbit
● Some Open Source ECM systems for critical usage:
○ Alfresco
○ Nuxeo
○ Hippo
12. What is a repository? - Decoration
CMIS
JCR
REST
SOAP
IMAP
apply EMAIL
metadata FTP retrieve content using
metadata
Query Languages:
CMIS
JCR SQL
Repository XPath
Lucene
Full Text (Google style)
Indexes
13. What is a repository? - Architecture
APIs (CMIS, REST, FTP, WebDAV, IMAP)
Model
Storage
Content Store Indexes
14. What is a repository? - Repo Model
● different point of view of how managing data
○ no more Relational databases (ER)
● repositories offers you an API!
● based on the JCR Repository Model (JSR-283)
○ workspaces
○ identifiers
○ users
○ nodes and node types (contents)
■ properties and property types
■ associations (shared nodes)
15. What is a repository? - Repo Model
● A node is a generic content stored in a repository
○ type
○ properties
○ associations
○ binary streams (optional)
■ renditions
■ text document
■ Video
■ Image
■. . .
16. What is a repository? - Repo Model
Properties
Type (metadata):
- name
Node
- description
- mimetype
- tags
- categories
Renditions
Binary 1 Binary 2 Binary 3
17. What is a repository? - Repo Model
Repository
Workspace
1
Workspace
2
Workspace Root node
3
A B
C
D E G
18. Why use a repository?
● adding new node types means to add a configuration
● you can scale out easily
● storing very large amounts of data
● storing simple data structures, such as simple JSON
documents
● looking up data by keys rather than using queries
● searching for data based upon relevance rather than criteria
● evolving schemas and/or data structures
● caching data in-memory for performance
● giving up consistency guarantees for increased availability
19. Why use a repository?
● Standard API
○ Content Management Interoperability Services (CMIS)
○ Java Content Repository (JCR)
● Hierarchical structure
● Transaction support
● Versioning
● Locking
● Observation
● References
● Navigation services
○ parents
○ children
○ associated
● Search services
20. What is a search server?
● Open Source crawler
○ schedule jobs to create indexes
■ get contents from repositories
■ push contents on search servers
Search Server
Repository 1
1
Search Server
Repository 2 Apache ManifoldCF 2
Search Server
Repository 3
3
21. What is a search server?
A search server is an application that allows users to
find repository contents quickly using:
● keywords (full text search)
● content fields
● tags
● categories
● ranking
The informations kept are indexes.
22. What is a search server?
REST API
Storage
Indexes
24. Why ManifoldCF? - Reliability
Jobs scheduling and configuration are stored in the
database to maintain the state of all the executions
Repository Pull Agent Daemon Search Server
configuration and scheduling
Database
25. Why ManifoldCF? - Incremental
Jobs can be optionally configured to re-visit contents
incrementally
Repository
N1
Apache ManifoldCF
N2
N4
26. Why ManifoldCF? - Multi repositories
Jobs can retrieve contents from the following repositories:
● CMIS-compliant
● Alfresco
● IBM FileNet
● EMC Documentum
● Microsoft SharePoint
● OpenText LiveLink
● Autonomy Meridio
● Memex Patriarch
● Windows Share/DFS
● Generic JDBC
● Generic Filesystem
● Generic RSS and Web
27. Why ManifoldCF? - Multi repositories
Jobs can ingest contents to the following search
servers:
● Apache Solr
● ElasticSearch
● OpenSearchServer
● MetaCarta GTS
28. Why ManifoldCF? - Security model
Retrieve per-content ACLs Authority 1
Authority Service Authority 2
Authority 3
Repository 1
Repository 2 Pull Agent Daemon
user access
Repository 3 tokens
doc access
tokens
user specific
Search Server search
results
29. Why ManifoldCF? - Monitoring
UI Crawler allows you to:
● configure jobs and connectors
● monitor jobs execution
● monitor contents ingestion
○ status reports
■ document status
■ queue status
○ history reports
■ simple history
■ maximum activity
■ maximum bandwidth
■ result histogram
31. Architecture
● Pull Agent Daemon (the core service)
○ Jobs (execute the ingestion tasks)
■ Repository Connectors (retrieve contents)
■ Output Connectors (ingest contents)
■ Authority Connectors (retrieve ACLs)
32. Architecture
Authority Service
Search Server
Repository 1
1
Search Server
Repository 2 Pull Agent Daemon
2
Search Server
Repository 3
3
Database
33. Architecture - Job
A job is an ingestion work that consists of:
○ verbal description
○ repository connection
■ authority connection (optional)
○ metadata mapping
○ output connection (search server)
○ crawling model
○ scheduling information (on demand or time ranges)
35. The 0.3-incubating version
● CMIS Repository Connector
● OpenSearchServer Output Connector
● Scripting Language
● New Maven build process
● Several bug fixes
36. The 0.4-incubating version
● Alfresco Connector
● JDBC Connector now supports MySQL
● CMIS Connector upgraded to OpenCMIS 0.5.0
● Several bug fixes
37. The 0.5-incubating version
● Apache Velocity for connectors UI templates
● ElasticSearch Output Connector
● CMIS Connector upgraded to OpenCMIS 0.6.0
● Prebuild connector support: just add jars and go!
● New Japanese localization
● Several bug fixes
38. The 0.6 version
● Project moved to JDK 1.6
● Many improvements for all the connectors
● Updated the Apache Solr plugin
● Several bugfixes
39. What's new in 1.0.1 version
● Microsoft SharePoint 2010 support
● JDBC Connector now manages metadata
● CMIS Connector upgraded to OpenCMIS 0.7.0
● Several bugfixes
40. The book: ManifoldCF in Action
ManifoldCF in Action
by Karl Wright
published by Manning
Karl is the original developer and the
principal committer of Apache ManifoldCF
The book is available at the following site:
http://www.manning.com/wright