SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Taking eZ Find beyond
    full-text search
        Paul Borgermans




    eZ Summer Conference

    London, June 16-17, 2011




                               © 2011 Paul Borgermans
About me
●   10 years in the eZ ecosystem
         –   eZ Lucene → eZ Solr → eZ Find
         –   3.5 years with eZ Systems (2007-2010)
         –   Independent consultant since 2011
●   Fancying
         –   Apache projects (mainly Solr, Hadoop, Stanbol, Zeta, ..)
         –   NoSQL (Not only SQL) and scalable architectures
         –   eZ Publish & CMS systems in general
         –   Semantic web




                                                                © 2011 Paul Borgermans
Large sites?




               © 2011 Paul Borgermans
Lots of traffic?




                   © 2011 Paul Borgermans
Per user navigation needs?




                        © 2011 Paul Borgermans
Complex pages?

Slow attribute filters?




                          © 2011 Paul Borgermans
Need to integrate data from other
            sources?



    ERP


              DB



                            © 2011 Paul Borgermans
eZ Find is your friend!




Although sometimes more like a rough diamond




                                               © 2011 Paul Borgermans
Preludium




 Meet the beast ….




                     © 2011 Paul Borgermans
eZ Find

     RESTful




               © 2011 Paul Borgermans
Solr in a nutshell
●   State of the art, advanced full text search and
    information retrieval engine
●   Fast, scalable with native replication features
●   Flexible configuration
●   Extensible
●   Document oriented storage
●   Geospatial search (Solr 3.1+)
●   Native cloud features*
    * under active development, almost complete (Solr 4.0)
                                                             © 2011 Paul Borgermans
Solr
                   HTTP Request Servlet                     Update Servlet


  Admin            Disjunction                   XML/PHP         XML
          Standard             Custom
Interface Request      Max                       JSON/...      Update
                               Request
                    Request                      Response     Interface
          Handler              Handler
                     Handler                      Writer



  Config       Schema                             Caching
                                                              Update
                        Solr Core
                                                              Handler
        Analysis                    Concurrency


                                                                             Replication
                            Lucene




                            Figure credit: Yonik Seeley
                                                                             © 2011 Paul Borgermans
Performance!
●   The backend Solr employs intelligent caches
        –   filters
        –   queries
        –   internal indexes
●   Optimized for search/retrieval
        –   Slower writing
●   When updates are done, caches are
    reconstructed on the fly in the background
●   Horizonthal & vertical scaling

                                             © 2011 Paul Borgermans
Using eZ Find/Solr beyond
         search




                        © 2011 Paul Borgermans
eZ Find alter egos
●   eZ Find/Solr as a scalable IR engine/layer
        –   Remove the burden on your DB
        –   Significant speedups also for regular content
        –   Clustering built-in
●   eZ Find/Solr as a content and integration
    engine
        –   Document oriented storage system
             (hello NoSQL)
        –   Archive use-case
        –   External content
                                                     © 2011 Paul Borgermans
eZ Find alter egos (...)
●   Alternate navigation interfaces
        –   Facets, filtering, sorting
        –   Function queries (!)

●   Document clustering
        –   More Like This
        –   Tag based
              (and more semantic stuff coming up)
        –   Carrot2 based


                                                    © 2011 Paul Borgermans
Provisions in eZ Find
●   Attribute storage (serialized content)
        – Less DB queries
●   Multi-core setup
●   Distributed search in fetch(ezfind,        search)
        –   Query parameters
        –   Filter parameters
        –   Fields to return (for rendering)



                                                    © 2011 Paul Borgermans
© 2011 Paul Borgermans
Getting external data into Solr




                            © 2011 Paul Borgermans
Tools

●   Solr Data Import Handler (DIH)
●   Apache Manifold Connector framework
●   Using API's
       –   eZ Find
       –   Zeta Components Search




                                          © 2011 Paul Borgermans
Integrating external data: Solr DIH
    http://wiki.apache.org/solr/DataImportHandler
●   Goals
       –   Read data residing in relational databases and
            XML files
       –   Build Solr documents according to configuration
            (joins, views, ...)
       –   Update Solr with such documents
       –   Provide ability to do full imports ..
       –   .. as well as delta imports


                                                   © 2011 Paul Borgermans
Configuring DIH
●   Need a more complete Solr: add DIH jars
●   solrconfig.xml:
    <requestHandler name="/dataimport"
    class="org.apache.solr.handler.dataimport.DataImportHandler">
        <lst name="defaults">
          <str name="config">/home/username/data-config.xml</str>
        </lst>
      </requestHandler>




●   Configure data sources (RDBMS, XML files)
         –   data-config.xml with connection and schema
              information

                                                                    © 2011 Paul Borgermans
Using DIH
●   Send commands to DIH request handler
            http://<host>:<port>/solr/dataimport?command=<command>



       –   full-import
       –   delta-import
       –   Status
●   You can use eZ Find raw Solr request API




                                                                     © 2011 Paul Borgermans
Apache Manifold CF
    http://incubator.apache.org/connectors/
●   ManifoldCF is a crawler framework
●   Supports:
       –   File System, Windows Shares
       –   JDBC, RSS
       –   Web, LiveLink (OpenText)
       –   Documentum (EMC)
       –   SharePoint (MSFT)
       –   Meridio (Autonomy)
       –   FileNet (IBM)
                                              © 2011 Paul Borgermans
With eZ Find API...

<?php
$solr = new eZSolrBase('http://localhost:8983/solr');

$documents = array( array( 'id' => '1135',...
                           'tags_lk' =>
                              array('London','2011')));


foreach ($documents as $doc){
    ezfSolrUtils::addDocument($solr, $doc);
}

$solr->commit();

?>



                                                          © 2011 Paul Borgermans
Or With Zeta Components Search
●   http://incubator.apache.org/zetacomponents/
       <?php
       require_once 'tutorial_autoload.php';
       // on localhost with the default port
       $handler = new ezcSearchSolrHandler;
       // on another host with a different port
       $handler = new ezcSearchSolrHandler
       ( '10.0.2.184', 9123 );
       ?>




                                                  © 2011 Paul Borgermans
Indexing workflow
●   Assemble documents in the correct XML format
●   Send one or more documents at a time
●   Commit => it becomes searchable
●   Optional parameters
       –   Boosting at the document level
       –   Boosting at the field level
       –   Auto-commit heartbeat interval
              (commitWithin, millisecs)


                                            © 2011 Paul Borgermans
Indexing workflow:
           important properties (...)
●   Update = Add with same global id
●   Deleting
       –   An individual document (id)
       –   A collection of documents (using a Solr query
             expression)
       –   Needs a commit() to really disappear from
            search results




                                                   © 2011 Paul Borgermans
Indexing: performance
                 considerations
●   Commits can become expensive
       –   Use them wisely: in batches where you can
       –   Delay options
               ●   cron job
               ●   CommitWithin parameter
●   From time to time, also need an optimize()
    command
       –   Deletes leave “holes”
       –   File fragmentation with adding/updating
       –   Daily, weekly for very large indexes (multi GB)
                                                    © 2011 Paul Borgermans
But you will also need to configure Solr




                                      © 2011 Paul Borgermans
Field definitions: schema.xml
●   Field types
        –   text
        –   numerical
        –   dates
        –   location
        –   … (about 25 in total)
●   Actual fields (name, definition, properties)
●   Dynamic fields
●   Copy fields (as aggregators)

                                                   © 2011 Paul Borgermans
schema.xml: simple field type examples
    <fieldType name="string" class="solr.StrField"
 sortMissingLast="true" omitNorms="true"/>

     <!-- boolean type: "true" or "false" -->
     <fieldType name="boolean" class="solr.BoolField"
 sortMissingLast="true" omitNorms="true"/>

    <!-- A Trie based date field for faster date range
queries and date faceting. -->
    <fieldType name="tdate" class="solr.TrieDateField"
omitNorms="true" precisionStep="6"
positionIncrementGap="0"/>

  <!-- A text field that only splits on whitespace for exact matching
of words -->
    <fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>



                                                               © 2011 Paul Borgermans
Analysis
●   Solr does not really search your text, but rather
    the terms that result from the analysis of text
●   Typically a chain of
        –   Character filter(s)
        –   Tokenisation
        –   Filter A
        –   Filter B
        –   …



                                               © 2011 Paul Borgermans
Solr comes with many tokenizers and
                   filters

●   Some are language specific
●   Others are very specialised
●   It is very important to get this right

    otherwise, you may not get what you expect!


       Best practice: do like eZ Find, provide multiple incarnations to
       suite facet, filter and search needs




                                                                          © 2011 Paul Borgermans
Semantic aspects




                   © 2011 Paul Borgermans
Semantic aspects:
       using an annotation engine
●   Main use cases for CMS systems
       –   Suggest tags to use for editors
       –   Enhance search engine relevancy
       –   Enhance clustering (related content)
●   Based on
       –   Domain specific ontologies
       –   Public available databases and (RESTful)
            services



                                                  © 2011 Paul Borgermans
Annotation engine: “open”
       databases




                            © 2011 Paul Borgermans
eZ Publish / eZ Find integration
●   Personal initiative
        –   Joined an EC funded project as “early adopter”
●   Initial goals:
        –   eZ Find relevancy optimisation
        –   Annotation suggestions from public data
●   More ambitious
        –   eZ Publish based, domain specific ontology
             definition
        –   TBD, as Apache Stanbol evolves

                                                      © 2011 Paul Borgermans
Something extra ...




                      © 2011 Paul Borgermans
The eZ Publish content model
●   One of the main strengths
●   But
          –   Do you need versioning in all cases?
          –   Translations: quite tightly coupled
          –   Difficulties to have workflows independent of the
                published version
          –   Variability in objects: sometimes too rigid
          –   Want traveling objects (UUID)
          –   ...
●   And of course: scalability of the implementation
    is limited too
                                                              © 2011 Paul Borgermans
So, a call for participation ….




                                  © 2011 Paul Borgermans
A new content repository project
●   Provide a very powerful content model
        –   adaptable to various scenarios and use-cases
●   Exposes a rich service layer, including an
    optional security model
        –   Role / policy based
●   Exposes its content through a variety of ways
        –   Simple to use API PHP
        –   REST-style
        –   Later: various standards (PHPCR, CMIS)

                                                   © 2011 Paul Borgermans
A new content repository ...
●   Builds on top of an IR (information retrieval)
    layer
        –   initially SOLR based
●   Pluggable persistence layer
        –   Traditional RDBMS
        –   Highly scalable NoSQL stores (Hbase,
             MongoDB, CouchDB, ..)




                                                   © 2011 Paul Borgermans
Connects to eZ Publish through ..
●   eZ Find
●   Dedicated modules

    and after refactoring of the kernel

●   Use it as a content store for eZ Publish itself




                                               © 2011 Paul Borgermans
Thank you!

   Questions?


http://joind.in/3443

paul.borgermans@gmail.com
@paulborgermans




                            © 2011 Paul Borgermans

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
SDEC2011 Essentials of Pig
SDEC2011 Essentials of PigSDEC2011 Essentials of Pig
SDEC2011 Essentials of Pig
Korea Sdec
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
Chris Huang
 

Was ist angesagt? (20)

Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Hibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesHibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance Techniques
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Orm and hibernate
Orm and hibernateOrm and hibernate
Orm and hibernate
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Php
PhpPhp
Php
 
Serialization and performance in Java
Serialization and performance in JavaSerialization and performance in Java
Serialization and performance in Java
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
Stardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF DatabaseStardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF Database
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Html5v1
Html5v1Html5v1
Html5v1
 
PostgreSQL Advanced Queries
PostgreSQL Advanced QueriesPostgreSQL Advanced Queries
PostgreSQL Advanced Queries
 
SDEC2011 Essentials of Pig
SDEC2011 Essentials of PigSDEC2011 Essentials of Pig
SDEC2011 Essentials of Pig
 
Avik_RailsTutorial
Avik_RailsTutorialAvik_RailsTutorial
Avik_RailsTutorial
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and Elasticsearch
 
Content Modeling Behavior
Content Modeling BehaviorContent Modeling Behavior
Content Modeling Behavior
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into Spark
 

Ähnlich wie Taking eZ Find beyond full-text search

Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
TNR Global
 
Facebook architecture
Facebook architectureFacebook architecture
Facebook architecture
drewz lin
 
Facebook的架构
Facebook的架构Facebook的架构
Facebook的架构
yiditushe
 
Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01
jgregory1234
 
The Solar Framework for PHP
The Solar Framework for PHPThe Solar Framework for PHP
The Solar Framework for PHP
ConFoo
 

Ähnlich wie Taking eZ Find beyond full-text search (20)

MongoDB at Sailthru: Scaling and Schema Design
MongoDB at Sailthru: Scaling and Schema DesignMongoDB at Sailthru: Scaling and Schema Design
MongoDB at Sailthru: Scaling and Schema Design
 
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
From Lucene to Solr 4 Trunk
From Lucene to Solr 4 TrunkFrom Lucene to Solr 4 Trunk
From Lucene to Solr 4 Trunk
 
Hands on-solr
Hands on-solrHands on-solr
Hands on-solr
 
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
 
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
 
Core os dna_automacon
Core os dna_automaconCore os dna_automacon
Core os dna_automacon
 
Solr -
Solr - Solr -
Solr -
 
Expertezed 2012 Webcast - XML DB Use Cases
Expertezed 2012 Webcast - XML DB Use CasesExpertezed 2012 Webcast - XML DB Use Cases
Expertezed 2012 Webcast - XML DB Use Cases
 
Flume and HBase
Flume and HBase Flume and HBase
Flume and HBase
 
Hadoop for carrier
Hadoop for carrierHadoop for carrier
Hadoop for carrier
 
Otago vre-overview
Otago vre-overviewOtago vre-overview
Otago vre-overview
 
SOLR
SOLRSOLR
SOLR
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Facebook architecture
Facebook architectureFacebook architecture
Facebook architecture
 
Facebook的架构
Facebook的架构Facebook的架构
Facebook的架构
 
Facebook architecture
Facebook architectureFacebook architecture
Facebook architecture
 
Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01
 
The Solar Framework for PHP
The Solar Framework for PHPThe Solar Framework for PHP
The Solar Framework for PHP
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Taking eZ Find beyond full-text search

  • 1. Taking eZ Find beyond full-text search Paul Borgermans eZ Summer Conference London, June 16-17, 2011 © 2011 Paul Borgermans
  • 2. About me ● 10 years in the eZ ecosystem – eZ Lucene → eZ Solr → eZ Find – 3.5 years with eZ Systems (2007-2010) – Independent consultant since 2011 ● Fancying – Apache projects (mainly Solr, Hadoop, Stanbol, Zeta, ..) – NoSQL (Not only SQL) and scalable architectures – eZ Publish & CMS systems in general – Semantic web © 2011 Paul Borgermans
  • 3. Large sites? © 2011 Paul Borgermans
  • 4. Lots of traffic? © 2011 Paul Borgermans
  • 5. Per user navigation needs? © 2011 Paul Borgermans
  • 6. Complex pages? Slow attribute filters? © 2011 Paul Borgermans
  • 7. Need to integrate data from other sources? ERP DB © 2011 Paul Borgermans
  • 8. eZ Find is your friend! Although sometimes more like a rough diamond © 2011 Paul Borgermans
  • 9. Preludium Meet the beast …. © 2011 Paul Borgermans
  • 10. eZ Find RESTful © 2011 Paul Borgermans
  • 11. Solr in a nutshell ● State of the art, advanced full text search and information retrieval engine ● Fast, scalable with native replication features ● Flexible configuration ● Extensible ● Document oriented storage ● Geospatial search (Solr 3.1+) ● Native cloud features* * under active development, almost complete (Solr 4.0) © 2011 Paul Borgermans
  • 12. Solr HTTP Request Servlet Update Servlet Admin Disjunction XML/PHP XML Standard Custom Interface Request Max JSON/... Update Request Request Response Interface Handler Handler Handler Writer Config Schema Caching Update Solr Core Handler Analysis Concurrency Replication Lucene Figure credit: Yonik Seeley © 2011 Paul Borgermans
  • 13. Performance! ● The backend Solr employs intelligent caches – filters – queries – internal indexes ● Optimized for search/retrieval – Slower writing ● When updates are done, caches are reconstructed on the fly in the background ● Horizonthal & vertical scaling © 2011 Paul Borgermans
  • 14. Using eZ Find/Solr beyond search © 2011 Paul Borgermans
  • 15. eZ Find alter egos ● eZ Find/Solr as a scalable IR engine/layer – Remove the burden on your DB – Significant speedups also for regular content – Clustering built-in ● eZ Find/Solr as a content and integration engine – Document oriented storage system (hello NoSQL) – Archive use-case – External content © 2011 Paul Borgermans
  • 16. eZ Find alter egos (...) ● Alternate navigation interfaces – Facets, filtering, sorting – Function queries (!) ● Document clustering – More Like This – Tag based (and more semantic stuff coming up) – Carrot2 based © 2011 Paul Borgermans
  • 17. Provisions in eZ Find ● Attribute storage (serialized content) – Less DB queries ● Multi-core setup ● Distributed search in fetch(ezfind, search) – Query parameters – Filter parameters – Fields to return (for rendering) © 2011 Paul Borgermans
  • 18. © 2011 Paul Borgermans
  • 19. Getting external data into Solr © 2011 Paul Borgermans
  • 20. Tools ● Solr Data Import Handler (DIH) ● Apache Manifold Connector framework ● Using API's – eZ Find – Zeta Components Search © 2011 Paul Borgermans
  • 21. Integrating external data: Solr DIH http://wiki.apache.org/solr/DataImportHandler ● Goals – Read data residing in relational databases and XML files – Build Solr documents according to configuration (joins, views, ...) – Update Solr with such documents – Provide ability to do full imports .. – .. as well as delta imports © 2011 Paul Borgermans
  • 22. Configuring DIH ● Need a more complete Solr: add DIH jars ● solrconfig.xml: <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">/home/username/data-config.xml</str> </lst> </requestHandler> ● Configure data sources (RDBMS, XML files) – data-config.xml with connection and schema information © 2011 Paul Borgermans
  • 23. Using DIH ● Send commands to DIH request handler http://<host>:<port>/solr/dataimport?command=<command> – full-import – delta-import – Status ● You can use eZ Find raw Solr request API © 2011 Paul Borgermans
  • 24. Apache Manifold CF http://incubator.apache.org/connectors/ ● ManifoldCF is a crawler framework ● Supports: – File System, Windows Shares – JDBC, RSS – Web, LiveLink (OpenText) – Documentum (EMC) – SharePoint (MSFT) – Meridio (Autonomy) – FileNet (IBM) © 2011 Paul Borgermans
  • 25. With eZ Find API... <?php $solr = new eZSolrBase('http://localhost:8983/solr'); $documents = array( array( 'id' => '1135',... 'tags_lk' => array('London','2011'))); foreach ($documents as $doc){ ezfSolrUtils::addDocument($solr, $doc); } $solr->commit(); ?> © 2011 Paul Borgermans
  • 26. Or With Zeta Components Search ● http://incubator.apache.org/zetacomponents/ <?php require_once 'tutorial_autoload.php'; // on localhost with the default port $handler = new ezcSearchSolrHandler; // on another host with a different port $handler = new ezcSearchSolrHandler ( '10.0.2.184', 9123 ); ?> © 2011 Paul Borgermans
  • 27. Indexing workflow ● Assemble documents in the correct XML format ● Send one or more documents at a time ● Commit => it becomes searchable ● Optional parameters – Boosting at the document level – Boosting at the field level – Auto-commit heartbeat interval (commitWithin, millisecs) © 2011 Paul Borgermans
  • 28. Indexing workflow: important properties (...) ● Update = Add with same global id ● Deleting – An individual document (id) – A collection of documents (using a Solr query expression) – Needs a commit() to really disappear from search results © 2011 Paul Borgermans
  • 29. Indexing: performance considerations ● Commits can become expensive – Use them wisely: in batches where you can – Delay options ● cron job ● CommitWithin parameter ● From time to time, also need an optimize() command – Deletes leave “holes” – File fragmentation with adding/updating – Daily, weekly for very large indexes (multi GB) © 2011 Paul Borgermans
  • 30. But you will also need to configure Solr © 2011 Paul Borgermans
  • 31. Field definitions: schema.xml ● Field types – text – numerical – dates – location – … (about 25 in total) ● Actual fields (name, definition, properties) ● Dynamic fields ● Copy fields (as aggregators) © 2011 Paul Borgermans
  • 32. schema.xml: simple field type examples <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> <!-- boolean type: "true" or "false" --> <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/> <!-- A Trie based date field for faster date range queries and date faceting. --> <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/> <!-- A text field that only splits on whitespace for exact matching of words --> <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType> © 2011 Paul Borgermans
  • 33. Analysis ● Solr does not really search your text, but rather the terms that result from the analysis of text ● Typically a chain of – Character filter(s) – Tokenisation – Filter A – Filter B – … © 2011 Paul Borgermans
  • 34. Solr comes with many tokenizers and filters ● Some are language specific ● Others are very specialised ● It is very important to get this right otherwise, you may not get what you expect! Best practice: do like eZ Find, provide multiple incarnations to suite facet, filter and search needs © 2011 Paul Borgermans
  • 35. Semantic aspects © 2011 Paul Borgermans
  • 36. Semantic aspects: using an annotation engine ● Main use cases for CMS systems – Suggest tags to use for editors – Enhance search engine relevancy – Enhance clustering (related content) ● Based on – Domain specific ontologies – Public available databases and (RESTful) services © 2011 Paul Borgermans
  • 37. Annotation engine: “open” databases © 2011 Paul Borgermans
  • 38. eZ Publish / eZ Find integration ● Personal initiative – Joined an EC funded project as “early adopter” ● Initial goals: – eZ Find relevancy optimisation – Annotation suggestions from public data ● More ambitious – eZ Publish based, domain specific ontology definition – TBD, as Apache Stanbol evolves © 2011 Paul Borgermans
  • 39. Something extra ... © 2011 Paul Borgermans
  • 40. The eZ Publish content model ● One of the main strengths ● But – Do you need versioning in all cases? – Translations: quite tightly coupled – Difficulties to have workflows independent of the published version – Variability in objects: sometimes too rigid – Want traveling objects (UUID) – ... ● And of course: scalability of the implementation is limited too © 2011 Paul Borgermans
  • 41. So, a call for participation …. © 2011 Paul Borgermans
  • 42. A new content repository project ● Provide a very powerful content model – adaptable to various scenarios and use-cases ● Exposes a rich service layer, including an optional security model – Role / policy based ● Exposes its content through a variety of ways – Simple to use API PHP – REST-style – Later: various standards (PHPCR, CMIS) © 2011 Paul Borgermans
  • 43. A new content repository ... ● Builds on top of an IR (information retrieval) layer – initially SOLR based ● Pluggable persistence layer – Traditional RDBMS – Highly scalable NoSQL stores (Hbase, MongoDB, CouchDB, ..) © 2011 Paul Borgermans
  • 44. Connects to eZ Publish through .. ● eZ Find ● Dedicated modules and after refactoring of the kernel ● Use it as a content store for eZ Publish itself © 2011 Paul Borgermans
  • 45. Thank you! Questions? http://joind.in/3443 paul.borgermans@gmail.com @paulborgermans © 2011 Paul Borgermans