Taking eZ Find beyond full-text search

Taking eZ Find beyond
full-text search
Paul Borgermans

eZ Summer Conference

London, June 16-17, 2011

© 2011 Paul Borgermans

About me
● 10 years in the eZ ecosystem
– eZ Lucene → eZ Solr → eZ Find
– 3.5 years with eZ Systems (2007-2010)
– Independent consultant since 2011
● Fancying
– Apache projects (mainly Solr, Hadoop, Stanbol, Zeta, ..)
– NoSQL (Not only SQL) and scalable architectures
– eZ Publish & CMS systems in general
– Semantic web


Large sites?


Lots of traffic?


Per user navigation needs?


Complex pages?

Slow attribute filters?


Need to integrate data from other
sources?

ERP

DB


eZ Find is your friend!

Although sometimes more like a rough diamond


Preludium

Meet the beast ….


eZ Find

RESTful


Solr in a nutshell
● State of the art, advanced full text search and
information retrieval engine
● Fast, scalable with native replication features
● Flexible configuration
● Extensible
● Document oriented storage
● Geospatial search (Solr 3.1+)
● Native cloud features*
* under active development, almost complete (Solr 4.0)

Solr
HTTP Request Servlet Update Servlet

Admin Disjunction XML/PHP XML
Standard Custom
Interface Request Max JSON/... Update
Request
Request Response Interface
Handler Handler
Handler Writer

Config Schema Caching
Update
Solr Core
Handler
Analysis Concurrency

Replication
Lucene

Figure credit: Yonik Seeley

Performance!
● The backend Solr employs intelligent caches
– filters
– queries
– internal indexes
● Optimized for search/retrieval
– Slower writing
● When updates are done, caches are
reconstructed on the fly in the background
● Horizonthal & vertical scaling


Using eZ Find/Solr beyond
search


eZ Find alter egos
● eZ Find/Solr as a scalable IR engine/layer
– Remove the burden on your DB
– Significant speedups also for regular content
– Clustering built-in
● eZ Find/Solr as a content and integration
engine
– Document oriented storage system
(hello NoSQL)
– Archive use-case
– External content

eZ Find alter egos (...)
● Alternate navigation interfaces
– Facets, filtering, sorting
– Function queries (!)

● Document clustering
– More Like This
– Tag based
(and more semantic stuff coming up)
– Carrot2 based


Provisions in eZ Find
● Attribute storage (serialized content)
– Less DB queries
● Multi-core setup
● Distributed search in fetch(ezfind, search)
– Query parameters
– Filter parameters
– Fields to return (for rendering)


Getting external data into Solr


Tools

● Solr Data Import Handler (DIH)
● Apache Manifold Connector framework
● Using API's
– eZ Find
– Zeta Components Search


Integrating external data: Solr DIH
http://wiki.apache.org/solr/DataImportHandler
● Goals
– Read data residing in relational databases and
XML files
– Build Solr documents according to configuration
(joins, views, ...)
– Update Solr with such documents
– Provide ability to do full imports ..
– .. as well as delta imports


Configuring DIH
● Need a more complete Solr: add DIH jars
● solrconfig.xml:
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">/home/username/data-config.xml</str>
</lst>
</requestHandler>

● Configure data sources (RDBMS, XML files)
– data-config.xml with connection and schema
information


Using DIH
● Send commands to DIH request handler
http://<host>:<port>/solr/dataimport?command=<command>

– full-import
– delta-import
– Status
● You can use eZ Find raw Solr request API


Apache Manifold CF
http://incubator.apache.org/connectors/
● ManifoldCF is a crawler framework
● Supports:
– File System, Windows Shares
– JDBC, RSS
– Web, LiveLink (OpenText)
– Documentum (EMC)
– SharePoint (MSFT)
– Meridio (Autonomy)
– FileNet (IBM)

With eZ Find API...

<?php
$solr = new eZSolrBase('http://localhost:8983/solr');

$documents = array( array( 'id' => '1135',...
'tags_lk' =>
array('London','2011')));

foreach ($documents as $doc){
ezfSolrUtils::addDocument($solr, $doc);
}

$solr->commit();

?>


Or With Zeta Components Search
● http://incubator.apache.org/zetacomponents/
<?php
require_once 'tutorial_autoload.php';
// on localhost with the default port
$handler = new ezcSearchSolrHandler;
// on another host with a different port
$handler = new ezcSearchSolrHandler
( '10.0.2.184', 9123 );
?>


Indexing workflow
● Assemble documents in the correct XML format
● Send one or more documents at a time
● Commit => it becomes searchable
● Optional parameters
– Boosting at the document level
– Boosting at the field level
– Auto-commit heartbeat interval
(commitWithin, millisecs)


Indexing workflow:
important properties (...)
● Update = Add with same global id
● Deleting
– An individual document (id)
– A collection of documents (using a Solr query
expression)
– Needs a commit() to really disappear from
search results


Indexing: performance
considerations
● Commits can become expensive
– Use them wisely: in batches where you can
– Delay options
● cron job
● CommitWithin parameter
● From time to time, also need an optimize()
command
– Deletes leave “holes”
– File fragmentation with adding/updating
– Daily, weekly for very large indexes (multi GB)

But you will also need to configure Solr


Field definitions: schema.xml
● Field types
– text
– numerical
– dates
– location
– … (about 25 in total)
● Actual fields (name, definition, properties)
● Dynamic fields
● Copy fields (as aggregators)


schema.xml: simple field type examples
<fieldType name="string" class="solr.StrField"
sortMissingLast="true" omitNorms="true"/>


<fieldType name="boolean" class="solr.BoolField"
sortMissingLast="true" omitNorms="true"/>


<fieldType name="tdate" class="solr.TrieDateField"
omitNorms="true" precisionStep="6"
positionIncrementGap="0"/>


<fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>

Analysis
● Solr does not really search your text, but rather
the terms that result from the analysis of text
● Typically a chain of
– Character filter(s)
– Tokenisation
– Filter A
– Filter B
– …


Solr comes with many tokenizers and
filters

● Some are language specific
● Others are very specialised
● It is very important to get this right

otherwise, you may not get what you expect!

Best practice: do like eZ Find, provide multiple incarnations to
suite facet, filter and search needs


Semantic aspects


Semantic aspects:
using an annotation engine
● Main use cases for CMS systems
– Suggest tags to use for editors
– Enhance search engine relevancy
– Enhance clustering (related content)
● Based on
– Domain specific ontologies
– Public available databases and (RESTful)
services


Annotation engine: “open”
databases


eZ Publish / eZ Find integration
● Personal initiative
– Joined an EC funded project as “early adopter”
● Initial goals:
– eZ Find relevancy optimisation
– Annotation suggestions from public data
● More ambitious
– eZ Publish based, domain specific ontology
definition
– TBD, as Apache Stanbol evolves


Something extra ...


The eZ Publish content model
● One of the main strengths
● But
– Do you need versioning in all cases?
– Translations: quite tightly coupled
– Difficulties to have workflows independent of the
published version
– Variability in objects: sometimes too rigid
– Want traveling objects (UUID)
– ...
● And of course: scalability of the implementation
is limited too

So, a call for participation ….


A new content repository project
● Provide a very powerful content model
– adaptable to various scenarios and use-cases
● Exposes a rich service layer, including an
optional security model
– Role / policy based
● Exposes its content through a variety of ways
– Simple to use API PHP
– REST-style
– Later: various standards (PHPCR, CMIS)


A new content repository ...
● Builds on top of an IR (information retrieval)
layer
– initially SOLR based
● Pluggable persistence layer
– Traditional RDBMS
– Highly scalable NoSQL stores (Hbase,
MongoDB, CouchDB, ..)


Connects to eZ Publish through ..
● eZ Find
● Dedicated modules

and after refactoring of the kernel

● Use it as a content store for eZ Publish itself


Thank you!

Questions?

http://joind.in/3443

paul.borgermans@gmail.com
@paulborgermans


Taking eZ Find beyond full-text search

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Taking eZ Find beyond full-text search

Ähnlich wie Taking eZ Find beyond full-text search (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Taking eZ Find beyond full-text search