Integrate ManifoldCF with Solr

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Properly integrate ManifoldCF with Solr
Aurélien MAZOYER
Search Expert, Co-founder, France Labs

3
01
Apache Manifold CF
o Agenda
• Overview of ManifoldCF
• Our scenario : find files on a file share
• In real life

4
01
Apache Manifold CF
o Overview
• Connector Framework
• Incremental crawling
• Handle authorization
• Configuration via REST API and UI

5
01
Apache Manifold CF
o History
• Based on « Connector Framework » developed by Karl Wright
for the MetaCarta Appliance
• Donated to the Apache Software Foundation in 2009
• May 2012 : out of incubation
• Current version : 2.2 (August 2015)

6
01
Connectors gone wild
o Different connectors for :
• Content repositories
• Web, Wiki, DB, Email, RSS, CMIS, Alfresco…
• But also Windows Share, Sharepoint, Dropbox…
• Authorities
• LDAP, AD, CMIS…
• Output
• Solr, Elasticsearch, OSS…

7
03
Big picture
Manifold CF
Solr Elasticsearch Repository N
OpenLDAP
Authority N
…
Daemon Agent
Conn. 1
Manifold CF
authority
service
Ouputs
Authorities
Conn. 2
Conn. N
ManifoldCF
UI
ManifoldCF
API
Conn. 1 Conn. 2 Conn. N
Wiki
DB
Repository N
…
…
Repositories
Conn. 1
Conn. N

8
01
Roles of components
o Daemon agent
• Java process
• Run repository and ouput connectors
• Run data crawling jobs

9
01
Roles of components
o Authority service
• Web application
• Run authority connectors
• Get security tokens for a specific user

10
01
Component
Ouput ConnectionRepo Connection Crawl Job
1…1 1…* 1…* 1…*
o ManifoldCF UI
That’s it.

12
01
Test it!
o For testing purpose:
• java –jar post.jar
• All-in-one process
• Embedded database (HSQL)

13
01
Taking MCF to production
Multi-process deployment
o 3 web application in a servlet container
• mcf-crawler-ui
• mcf-authorization-service
• mcf-api-service
o Daemon agent
o Database
• PostgresSQL
o Synchronize on filesystem ( local or distributed (zK) )

14
01
Search files with Security : Solr + MCF
o Our scenario
• File share using Active Directory
• Search with Solr
• With security constraints

15
01
Security model : Solr + MCF
o Authorization
• Early Binding
• Index documents with ACLs
• Compute authorization at runtime
o Authentication
• Not handled by Solr/ManifoldCF
• Front-end application should authenticate user

16
01
Search files with security : Solr + MCF
Manifold CF
AD
Daemon
Agent
JCIFS
Connector
Solr
connector
Phase 1 : Indexing
Repositories Authorities
Output Connector
Solr
Extracting
Handler
Manifold CF
authority
service
AD
ConnectorWindows
Share
MCF Plugin
Send docs and
ACLs
Crawl
documents
with ACLs

Get User
access token
Solr
MCF Plugin
17
01
Search files with security : Solr + MCF
Manifold CF
AD
Daemon
Agent
JCIFS
Connector
Solr
connector
Repositories Authorities
Extracting
Handler
Manifold CF
authority
service
AD
Connector
Front End Authenticated Search Filter docs based on
ACLs and users info
Authorized results
Phase 2 : Searching
Output Connector
Windows
Share

18
01
Configure Solr + MCF
o side
o 4 connections and 1 job
• Create Windows Share connection
• Create Solr connection
• Create Active Directory connection
• Create Authority Group connection
• Create a crawling Job

19
01
Component
0…1
1…*
Authority Group
Authority Connection
1…1
1…*
1…1 1…* 1…* 1…*

20
01
Component
AD Group
Crawl Job Solr Connection
AD Connection
Windows Share
Connection

21
01
o Frond end side
o Authentication
• For Tomcat
• JDNI Tomcat Realm
• TomcatSPNEGO

22
01
o side
o Modify schema.xml
• Add fields for security tokens
o Modify solrconfig.xml
• Add MCF Solr Plugin (query parser)
o And don’t forget to protect the Solr instance :-P

23
01
o Leverage Solr Extracting handler
• Based on ApacheTika
• Mime type detection
• Embed parsing library
• Supported extension:
• MS Office (OLE2 and OOXML)
• OpenDocument
• Pdf
• Audio/video/image files
• Now OCRs thanks to Tika 1.7 (and Tesseract)
o Now, can be done directly in MCF!

24
01
Component
0…1
1…*
Authority Group
Authority Connection
1…1
1…*
1…1 1…* 1…* 1…*
Transformation
Connection
0…*
1…*

25
01
Crawling principle
o Crawling model
• Incremental model
• Continuous model
ManifoldCF In Action – Chapter 1 (Karl Wright)
Phase 1 Phase 2

26
01
Incremental crawling of file share
o Incremental crawling not so easy with some
repositories:
Windows Shar
e Connector
JCIFS
Windows Share
Uhuuu, file share, what's new
since last time we met?
Errkkk…

27
01
Incremental crawling of file share : Solr + MCF
o Phase 1 : Discovery/Indexing Depth first
Fetch SMB file attributes
If file is a directory and if matches inclusion regex
For each file
If file is a regular file and if matches inclusion regex
List files in SMB directory
Check ingeststatus entry in crawler DB
If no entry or the version attribute is different
Fetch file content
Update ingeststatus entry in DB
Push file to Solr
For each start path
entry
Windows Share

28
01
o What is ingeststatus database entry?
o Simplified version :
o LastVersion?
• Here, computed from lastModified and ACLs on the file
DOCURI LAST_INGEST LAST_VERSION
protocol://REPO_HOST/Doc1.docx 10.09.2015 18:21:04 Doc1_Version1
protocol://REPO_HOST/Doc2.docx 10.09.2015 19:21:04 Doc2_Version1
+S-1-5-18+S-1-5-21-3380247023-2036360560-1108467148-1118+S-1-5-21-3380247023-
2036360560-1108467148-500+S-1-5-32-544+1+DEAD_AUTHORITY+-file://///52.30.17.1
84/ShareFolder/TestFile.txt+1444462827664:16Y
Incremental crawling of file share

29
01
Incremental crawling of file share : Solr + MCF
o Phase 2 : Deleting unreachable documents
Update Crawler database
Send delete command to Solr
For each crawler DB entry

30
01
How to see what happened
o Search History
o Monitoring
• Job Status
• Notification Connections

31
01
How to see what happened
o Search History
o History
• Simple History
• Maximum Activity
• Maximum Bandwidth
• Result Histogram
o Status
• Document Status
• Queue Status

32
01
Performance issue
o Find bottleneck
• Crawled repository
• Network
• Solr
• MCF database
• MCF configuration

33
01
Handle performance issue
o Specific connector’s configuration
• Throttling
• Max JVM connections
o Can improve speed / limit impact on crawled repository
o Very specific to the repository

34
01
Handle performance issue
o Job settings
o Size limit of ingested documents
o Use regex to remove some extensions from crawl

35
01
Investigate errors
• Increase connector’s log level
• Read MCF simple history
• Thread Dump

36
01
Common errors in file crawling
o Crawler account rights
o Exotic files
o Very biiiiiiig files
o JCIFS errors
o Solr connector timeout

37
01
When use ManifoldCF?
q = crawled_environment:heterogeneous
OR scenario:intranet
OR security:mandatory

38
01
References
o ManifoldCF documentation
https://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html
o ManifoldCF in Action (K. Wright)
https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs
o Securing Solr document with MCF (K. Wright)
http://fr.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011
o France Labs blog posts :
http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-solr-for-files-search/
http://www.francelabs.com/blog/tutorial-on-authorizations-for-manifold-cf-and-solr/

39
01
Datafari
Search
Admin
o Intranet “ready to play” search solution
• Apache License
o Embed:
o Solr
o ManifoldCF
o And other cool stuff:
• Admin and responsive search UI
• User Management
• Banana for user behavior analysis
• Tesseract OCR
• A funny zebra
• Etc…
www.datafari.com

40
aurelien.mazoyer@francelabs.com
@francelabs
www.francelabs.com

Integrate ManifoldCF with Solr

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Integrate ManifoldCF with Solr

Ähnlich wie Integrate ManifoldCF with Solr (20)

Mehr von francelabs

Mehr von francelabs (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Integrate ManifoldCF with Solr

Hinweis der Redaktion