Weitere ähnliche Inhalte
Ähnlich wie Taking eZ Find beyond full-text search (20)
Kürzlich hochgeladen (20)
Taking eZ Find beyond full-text search
- 1. Taking eZ Find beyond
full-text search
Paul Borgermans
eZ Summer Conference
London, June 16-17, 2011
© 2011 Paul Borgermans
- 2. About me
● 10 years in the eZ ecosystem
– eZ Lucene → eZ Solr → eZ Find
– 3.5 years with eZ Systems (2007-2010)
– Independent consultant since 2011
● Fancying
– Apache projects (mainly Solr, Hadoop, Stanbol, Zeta, ..)
– NoSQL (Not only SQL) and scalable architectures
– eZ Publish & CMS systems in general
– Semantic web
© 2011 Paul Borgermans
- 8. eZ Find is your friend!
Although sometimes more like a rough diamond
© 2011 Paul Borgermans
- 10. eZ Find
RESTful
© 2011 Paul Borgermans
- 11. Solr in a nutshell
● State of the art, advanced full text search and
information retrieval engine
● Fast, scalable with native replication features
● Flexible configuration
● Extensible
● Document oriented storage
● Geospatial search (Solr 3.1+)
● Native cloud features*
* under active development, almost complete (Solr 4.0)
© 2011 Paul Borgermans
- 12. Solr
HTTP Request Servlet Update Servlet
Admin Disjunction XML/PHP XML
Standard Custom
Interface Request Max JSON/... Update
Request
Request Response Interface
Handler Handler
Handler Writer
Config Schema Caching
Update
Solr Core
Handler
Analysis Concurrency
Replication
Lucene
Figure credit: Yonik Seeley
© 2011 Paul Borgermans
- 13. Performance!
● The backend Solr employs intelligent caches
– filters
– queries
– internal indexes
● Optimized for search/retrieval
– Slower writing
● When updates are done, caches are
reconstructed on the fly in the background
● Horizonthal & vertical scaling
© 2011 Paul Borgermans
- 15. eZ Find alter egos
● eZ Find/Solr as a scalable IR engine/layer
– Remove the burden on your DB
– Significant speedups also for regular content
– Clustering built-in
● eZ Find/Solr as a content and integration
engine
– Document oriented storage system
(hello NoSQL)
– Archive use-case
– External content
© 2011 Paul Borgermans
- 16. eZ Find alter egos (...)
● Alternate navigation interfaces
– Facets, filtering, sorting
– Function queries (!)
● Document clustering
– More Like This
– Tag based
(and more semantic stuff coming up)
– Carrot2 based
© 2011 Paul Borgermans
- 17. Provisions in eZ Find
● Attribute storage (serialized content)
– Less DB queries
● Multi-core setup
● Distributed search in fetch(ezfind, search)
– Query parameters
– Filter parameters
– Fields to return (for rendering)
© 2011 Paul Borgermans
- 20. Tools
● Solr Data Import Handler (DIH)
● Apache Manifold Connector framework
● Using API's
– eZ Find
– Zeta Components Search
© 2011 Paul Borgermans
- 21. Integrating external data: Solr DIH
http://wiki.apache.org/solr/DataImportHandler
● Goals
– Read data residing in relational databases and
XML files
– Build Solr documents according to configuration
(joins, views, ...)
– Update Solr with such documents
– Provide ability to do full imports ..
– .. as well as delta imports
© 2011 Paul Borgermans
- 22. Configuring DIH
● Need a more complete Solr: add DIH jars
● solrconfig.xml:
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">/home/username/data-config.xml</str>
</lst>
</requestHandler>
● Configure data sources (RDBMS, XML files)
– data-config.xml with connection and schema
information
© 2011 Paul Borgermans
- 23. Using DIH
● Send commands to DIH request handler
http://<host>:<port>/solr/dataimport?command=<command>
– full-import
– delta-import
– Status
● You can use eZ Find raw Solr request API
© 2011 Paul Borgermans
- 24. Apache Manifold CF
http://incubator.apache.org/connectors/
● ManifoldCF is a crawler framework
● Supports:
– File System, Windows Shares
– JDBC, RSS
– Web, LiveLink (OpenText)
– Documentum (EMC)
– SharePoint (MSFT)
– Meridio (Autonomy)
– FileNet (IBM)
© 2011 Paul Borgermans
- 25. With eZ Find API...
<?php
$solr = new eZSolrBase('http://localhost:8983/solr');
$documents = array( array( 'id' => '1135',...
'tags_lk' =>
array('London','2011')));
foreach ($documents as $doc){
ezfSolrUtils::addDocument($solr, $doc);
}
$solr->commit();
?>
© 2011 Paul Borgermans
- 26. Or With Zeta Components Search
● http://incubator.apache.org/zetacomponents/
<?php
require_once 'tutorial_autoload.php';
// on localhost with the default port
$handler = new ezcSearchSolrHandler;
// on another host with a different port
$handler = new ezcSearchSolrHandler
( '10.0.2.184', 9123 );
?>
© 2011 Paul Borgermans
- 27. Indexing workflow
● Assemble documents in the correct XML format
● Send one or more documents at a time
● Commit => it becomes searchable
● Optional parameters
– Boosting at the document level
– Boosting at the field level
– Auto-commit heartbeat interval
(commitWithin, millisecs)
© 2011 Paul Borgermans
- 28. Indexing workflow:
important properties (...)
● Update = Add with same global id
● Deleting
– An individual document (id)
– A collection of documents (using a Solr query
expression)
– Needs a commit() to really disappear from
search results
© 2011 Paul Borgermans
- 29. Indexing: performance
considerations
● Commits can become expensive
– Use them wisely: in batches where you can
– Delay options
● cron job
● CommitWithin parameter
● From time to time, also need an optimize()
command
– Deletes leave “holes”
– File fragmentation with adding/updating
– Daily, weekly for very large indexes (multi GB)
© 2011 Paul Borgermans
- 30. But you will also need to configure Solr
© 2011 Paul Borgermans
- 31. Field definitions: schema.xml
● Field types
– text
– numerical
– dates
– location
– … (about 25 in total)
● Actual fields (name, definition, properties)
● Dynamic fields
● Copy fields (as aggregators)
© 2011 Paul Borgermans
- 32. schema.xml: simple field type examples
<fieldType name="string" class="solr.StrField"
sortMissingLast="true" omitNorms="true"/>
<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" class="solr.BoolField"
sortMissingLast="true" omitNorms="true"/>
<!-- A Trie based date field for faster date range
queries and date faceting. -->
<fieldType name="tdate" class="solr.TrieDateField"
omitNorms="true" precisionStep="6"
positionIncrementGap="0"/>
<!-- A text field that only splits on whitespace for exact matching
of words -->
<fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
© 2011 Paul Borgermans
- 33. Analysis
● Solr does not really search your text, but rather
the terms that result from the analysis of text
● Typically a chain of
– Character filter(s)
– Tokenisation
– Filter A
– Filter B
– …
© 2011 Paul Borgermans
- 34. Solr comes with many tokenizers and
filters
● Some are language specific
● Others are very specialised
● It is very important to get this right
otherwise, you may not get what you expect!
Best practice: do like eZ Find, provide multiple incarnations to
suite facet, filter and search needs
© 2011 Paul Borgermans
- 36. Semantic aspects:
using an annotation engine
● Main use cases for CMS systems
– Suggest tags to use for editors
– Enhance search engine relevancy
– Enhance clustering (related content)
● Based on
– Domain specific ontologies
– Public available databases and (RESTful)
services
© 2011 Paul Borgermans
- 38. eZ Publish / eZ Find integration
● Personal initiative
– Joined an EC funded project as “early adopter”
● Initial goals:
– eZ Find relevancy optimisation
– Annotation suggestions from public data
● More ambitious
– eZ Publish based, domain specific ontology
definition
– TBD, as Apache Stanbol evolves
© 2011 Paul Borgermans
- 40. The eZ Publish content model
● One of the main strengths
● But
– Do you need versioning in all cases?
– Translations: quite tightly coupled
– Difficulties to have workflows independent of the
published version
– Variability in objects: sometimes too rigid
– Want traveling objects (UUID)
– ...
● And of course: scalability of the implementation
is limited too
© 2011 Paul Borgermans
- 41. So, a call for participation ….
© 2011 Paul Borgermans
- 42. A new content repository project
● Provide a very powerful content model
– adaptable to various scenarios and use-cases
● Exposes a rich service layer, including an
optional security model
– Role / policy based
● Exposes its content through a variety of ways
– Simple to use API PHP
– REST-style
– Later: various standards (PHPCR, CMIS)
© 2011 Paul Borgermans
- 43. A new content repository ...
● Builds on top of an IR (information retrieval)
layer
– initially SOLR based
● Pluggable persistence layer
– Traditional RDBMS
– Highly scalable NoSQL stores (Hbase,
MongoDB, CouchDB, ..)
© 2011 Paul Borgermans
- 44. Connects to eZ Publish through ..
● eZ Find
● Dedicated modules
and after refactoring of the kernel
● Use it as a content store for eZ Publish itself
© 2011 Paul Borgermans
- 45. Thank you!
Questions?
http://joind.in/3443
paul.borgermans@gmail.com
@paulborgermans
© 2011 Paul Borgermans