Aurélien Mazoyer from France Labs gives users a step-by-step breakdown of everything ManifoldCF is capable of. He also works through a scenario that shows what can happen when using Apache ManifoldCF with Apache Solr.
3. 3
01
Apache Manifold CF
o Agenda
• Overview of ManifoldCF
• Our scenario : find files on a file share
• In real life
4. 4
01
Apache Manifold CF
o Overview
• Connector Framework
• Incremental crawling
• Handle authorization
• Configuration via REST API and UI
5. 5
01
Apache Manifold CF
o History
• Based on « Connector Framework » developed by Karl Wright
for the MetaCarta Appliance
• Donated to the Apache Software Foundation in 2009
• May 2012 : out of incubation
• Current version : 2.2 (August 2015)
6. 6
01
Connectors gone wild
o Different connectors for :
• Content repositories
• Web, Wiki, DB, Email, RSS, CMIS, Alfresco…
• But also Windows Share, Sharepoint, Dropbox…
• Authorities
• LDAP, AD, CMIS…
• Output
• Solr, Elasticsearch, OSS…
7. 7
03
Big picture
Manifold CF
Solr Elasticsearch Repository N
OpenLDAP
Authority N
…
Daemon Agent
Conn. 1
Manifold CF
authority
service
Ouputs
Authorities
Conn. 2
Conn. N
ManifoldCF
UI
ManifoldCF
API
Conn. 1 Conn. 2 Conn. N
Wiki
DB
Repository N
…
…
Repositories
Conn. 1
Conn. N
8. 8
01
Roles of components
o Daemon agent
• Java process
• Run repository and ouput connectors
• Run data crawling jobs
9. 9
01
Roles of components
o Authority service
• Web application
• Run authority connectors
• Get security tokens for a specific user
12. 12
01
Test it!
o For testing purpose:
• java –jar post.jar
• All-in-one process
• Embedded database (HSQL)
13. 13
01
Taking MCF to production
Multi-process deployment
o 3 web application in a servlet container
• mcf-crawler-ui
• mcf-authorization-service
• mcf-api-service
o Daemon agent
o Database
• PostgresSQL
o Synchronize on filesystem ( local or distributed (zK) )
14. 14
01
Search files with Security : Solr + MCF
o Our scenario
• File share using Active Directory
• Search with Solr
• With security constraints
15. 15
01
Security model : Solr + MCF
o Authorization
• Early Binding
• Index documents with ACLs
• Compute authorization at runtime
o Authentication
• Not handled by Solr/ManifoldCF
• Front-end application should authenticate user
16. 16
01
Search files with security : Solr + MCF
Manifold CF
AD
Daemon
Agent
JCIFS
Connector
Solr
connector
Phase 1 : Indexing
Repositories Authorities
Output Connector
Solr
Extracting
Handler
Manifold CF
authority
service
AD
ConnectorWindows
Share
MCF Plugin
Send docs and
ACLs
Crawl
documents
with ACLs
17. Get User
access token
Solr
MCF Plugin
17
01
Search files with security : Solr + MCF
Manifold CF
AD
Daemon
Agent
JCIFS
Connector
Solr
connector
Repositories Authorities
Extracting
Handler
Manifold CF
authority
service
AD
Connector
Front End Authenticated Search Filter docs based on
ACLs and users info
Authorized results
Phase 2 : Searching
Output Connector
Windows
Share
18. 18
01
Configure Solr + MCF
o side
o 4 connections and 1 job
• Create Windows Share connection
• Create Solr connection
• Create Active Directory connection
• Create Authority Group connection
• Create a crawling Job
21. 21
01
Configure Solr + MCF
o Frond end side
o Authentication
• For Tomcat
• JDNI Tomcat Realm
• TomcatSPNEGO
22. 22
01
Configure Solr + MCF
o side
o Modify schema.xml
• Add fields for security tokens
o Modify solrconfig.xml
• Add MCF Solr Plugin (query parser)
o And don’t forget to protect the Solr instance :-P
23. 23
01
Configure Solr + MCF
o Leverage Solr Extracting handler
• Based on ApacheTika
• Mime type detection
• Embed parsing library
• Supported extension:
• MS Office (OLE2 and OOXML)
• OpenDocument
• Pdf
• Audio/video/image files
• Now OCRs thanks to Tika 1.7 (and Tesseract)
o Now, can be done directly in MCF!
25. 25
01
Crawling principle
o Crawling model
• Incremental model
• Continuous model
ManifoldCF In Action – Chapter 1 (Karl Wright)
Phase 1 Phase 2
26. 26
01
Incremental crawling of file share
o Incremental crawling not so easy with some
repositories:
Windows Shar
e Connector
JCIFS
Windows Share
Uhuuu, file share, what's new
since last time we met?
Errkkk…
27. 27
01
Incremental crawling of file share : Solr + MCF
o Phase 1 : Discovery/Indexing Depth first
Fetch SMB file attributes
If file is a directory and if matches inclusion regex
For each file
If file is a regular file and if matches inclusion regex
List files in SMB directory
Check ingeststatus entry in crawler DB
If no entry or the version attribute is different
Fetch file content
Update ingeststatus entry in DB
Push file to Solr
For each start path
entry
Windows Share
28. 28
01
o What is ingeststatus database entry?
o Simplified version :
o LastVersion?
• Here, computed from lastModified and ACLs on the file
DOCURI LAST_INGEST LAST_VERSION
protocol://REPO_HOST/Doc1.docx 10.09.2015 18:21:04 Doc1_Version1
protocol://REPO_HOST/Doc2.docx 10.09.2015 19:21:04 Doc2_Version1
+S-1-5-18+S-1-5-21-3380247023-2036360560-1108467148-1118+S-1-5-21-3380247023-
2036360560-1108467148-500+S-1-5-32-544+1+DEAD_AUTHORITY+-file://///52.30.17.1
84/ShareFolder/TestFile.txt+1444462827664:16Y
Incremental crawling of file share
29. 29
01
Incremental crawling of file share : Solr + MCF
o Phase 2 : Deleting unreachable documents
Update Crawler database
Send delete command to Solr
For each crawler DB entry
30. 30
01
How to see what happened
o Search History
o Monitoring
• Job Status
• Notification Connections
31. 31
01
How to see what happened
o Search History
o History
• Simple History
• Maximum Activity
• Maximum Bandwidth
• Result Histogram
o Status
• Document Status
• Queue Status
33. 33
01
Handle performance issue
o Specific connector’s configuration
• Throttling
• Max JVM connections
o Can improve speed / limit impact on crawled repository
o Very specific to the repository
34. 34
01
Handle performance issue
o Job settings
o Size limit of ingested documents
o Use regex to remove some extensions from crawl
38. 38
01
References
o ManifoldCF documentation
https://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html
o ManifoldCF in Action (K. Wright)
https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs
o Securing Solr document with MCF (K. Wright)
http://fr.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011
o France Labs blog posts :
http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-solr-for-files-search/
http://www.francelabs.com/blog/tutorial-on-authorizations-for-manifold-cf-and-solr/
39. 39
01
Datafari
Search
Admin
o Intranet “ready to play” search solution
• Apache License
o Embed:
o Solr
o ManifoldCF
o And other cool stuff:
• Admin and responsive search UI
• User Management
• Banana for user behavior analysis
• Tesseract OCR
• A funny zebra
• Etc…
www.datafari.com
Start : 0:00. End : 1:05
Hi,Thank you,
Moi : Aurélien MAZOYER
Co founder of France Labs, open source company based in France. We offer consulting on search technologies, icosystem
Datafari, intranet search solution. Say a few word about datafari at the end of talk.
Topic,
Survey : how many of you have ever use manifoldcf ?
Start : 1:05. End : 1:40
3 parts
Overview of MCF.
Explain a Case study on the integration of MCF with Solr in order to search file
What happens with mcf
Start 1:40 to 2:45
CF stands for Connector framework.
That means that it is a tool that help you to connect
heterodjinious
Push the data to your favorite search engine
Keep it syncronize
Take access right into account to perform authenticated
Provides a Complete UI and REST API
Start 2:45 to 3:15
Karl wright when he worked for
Apache Top Level Project since 2012.
Active project : last release this summer.
Start 3:15 to 4:25
Plenty of connectors included in ManifoldCF
What is called
You can write your own (ManifoldCF In action give you the best practice to write your own connector)
Domain controller, such as an active directory
Search engine
Start 4:25 to 5:03
Contains Different components
You can see the different connectors for the interaction with the external world
Administration interface
Talk about it in a few slide.
Cannot see here a underlying database, backbone of the solution
Start 5:03 to 5:30
Actually do the crawling job
Start 5:30 to 6:26
You add the username in parameter provide the security tokens for a specific user.
It gives
For example the sid of the user in Active Directory, and the sid of all groups that he belongs to
Start 6:26 to 7:14
Also web application.
Administrate MCF.
To begin, you will have to create
Crawl Job. Start the job.
Once you are done, you are now able to start your crawl
Start 7:14 to 8:00
What can be done in the admin
Put new config
Send command
Respect REST standards
Start 8:00 to 8:49
Very simple to test it. Extract the binary distribution, open example directory
Not unfamiliar TO solr users
Not recommanded way to run it in production (mainly because of the HSQL database)
Start 8:49 to 10:00
Component we described in different processes.
The database is very important.
One of the recommanded databaseSynchronize via local folder on the machine or with zookeeper.
Start 10:00 to 11:07
Here is our scenario
Let’s imagine
An intranet Users who authenticate against AD
They put their files on a shared folders.
You have access rights on folders based on the user. But specific permission for some users.
Of course it is a mess so they need a good search engine to find theirs documents
Quite simple, not very unusuable, but it can be a nightmare if you don’t have to right tool
We are here in a full proprietary environment.
But we will see that MCF and solr can deal with it.
Start 11:07 to 12:00
A few words
Autorisation, when user runs a solr query
Nither solr nor mcf will do this job
Up to the front end application
Start 12:00 to 12:34
Go back to the big picture
Step 1
JCIFS connector fetches documents with theirs access control and push to Solr Extracting handler
Start 12:34 to 13:00
Step 2 :
Frontend sends an authenticated query
Retrieves the security tokens linked to the current user
Then, runs a normal search and filter the result set with the help the document acces control list and user security tokens
Start 13:00 to 13:44
How can we actually implement that.
ON the mcfside
Windows share connection.
Some few step to do
(download last version of JCIFS library, uncomment the windows share line in the connectors config file)
Start 13:44 to 14:03
Authority connector should be belong to an authority group
Start 14:03 to 14:18
That’s it for manifold
Start 14:18 to 15:00
Told you
Front end is in charge of the authentication
LDAP protocol to authenticate
TomcatSPNEGO (Active directory).
Spénégo : use single sign on
Start 15:00 to 15:57
Add fields that will contains the access control list of the document
Declare the MCF plugin
Configure the endpoint of the authority service
Add a filter query that uses this plugin in your search handler
This is for the search handler
Start 15:57 to 16:30
For the update handler. It is a default extracting handler that integrate apache tika.
As a reminder, since Solr 5, extracting handler can run tesseract to extract content from images.
Solr can do this job.
Start 16:30 to 17:22
In new version of Manifold. It can also be done
In fact, processing pipeline.
You can do field mapping but also tika extraction.
Perfect if you don’t want to send big files over the network
Start 17:22 to 18:11
Now we will try to understand what is going on under the hood during our crawl
These two crawling models are available with manifoldCF.
To avoid indexing
Discover new documents, remove old ones.
Start 18:11 to 18:33
Some repository works well with incremental crawling
Others don’t
Unfortunatly our windows share won’t be able to answer
Start 18:33 to 20:00
Therefore JCIFS connector
How do windows share connetor handle incremental
If it is a file
Next slide is version attribute
Fetch from the windows share
Start 20:00 to 20:50
For each document
Last version
Depends on the repository
Start 20:50 to 21:04
This was for step 1. more
We can repeat these 2 steps in order to keep our data syncrhonize.
We have covered how to configure this and we ve describe of it works under the hood.
Now it is in production mode and you want to be sure of what is going on
Start 21:04 to 22:33
Many informations
UI or API
Send alert if something went wrong or just if the crawl is finished
Start 22:33 to 23:00
You also have a tab that shows you an history of all the different activities.
Document status, for example if you want to see if a document has already been ingested in the current crawl
Maximum bandwitch will give you information of crawling performance
Start 23:00 to 23:16
Unfortunatly somtimes facing
obvious
Crawled repository that is overloaded
It can be because of the network. You should packet with wireshark
Solr server : for example if the autocommit frequency is too high.
Mcf database is an important component, be sure that you followed the best pratices in the documentation
Start 23:16 to 24:25
Maybe it is because of the configuration of your connector
Two main parameters that can have an impact on performance
Throttling : Fixing hard limit on fetching document (usefull if you are doing web crawling don’t don’t to be ban by the webmaster)
Max connections that will be done to the system.
It can be a good idea if we want to do web crawling to increase this value
But windows share won’t work very well with a of connection, so in our scenario we should use a small value
Start 24:25 to 25:15
In the job settings, you can filter document that you want to index
For an intranet file share, you probably don’t want to index the last Star wars movie that an employee wanted to share with their colleagues
Start 25:15 to 26:00
That was some example of performance issues.
But unfortunatly, It can be even worst, you can face errors
If you are facing errors
A thread dump can give you information on
Start 26:00 to 28:08
One common problem is when the account you use for crawl doesn’t
It must be able to read everything and to read ACLs for each file
It can need special right, such as Print operator.
As we just saw, we can use exclusion regex or size limit
Be also sure to add ignore tika exception in solr
JCIFS errors linked or not to network issues timeout.
Sometimes be solve while increase jcifs timeout
Sometimes you can have to increase solr time out issues
Big processing
Start 28:08 to 28:45
What can happen in real life
To conclude.
Massive web crawling : Nutch is the best tool for you
Then, go for it.
Start 28:45 to 29:18
Here are some references
That is now freely available
You can have a look at our blog posts, that you how to run through the different steps that I covered in the file search scenario I described
Start 29:18 to 29:40
If you are too lazy to integrate Solr and ManifoldCF by yourself
Start 29:40 to 30:00
Thank you very much for your attention,
Be pleased to answer any question you may have