SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Open Source
Search & Retrieval
Platform

                         Enterprise Search
                         EAI
Marc Teutelink           Semantic Web
Datum 21 augustus 2010
How Apache open source software is used
     during the implementation of an
Enterprise Search and Retrieval Platform

 (Lucene/SOLR, Nutch, Tika, ServiceMix/Camel, Felix/Ace)
Marc Teutelink
   marc.teutelink@luminis.eu
   @mteutelink



•Software architect at Luminis
•15+ years experience in software development; specialized in
Enterprise Search, Enterprise Application Integration and
Semantic Web technology
•Currently writing “Enterprise Search in Action” for Manning
(Mid-2011)
Agenda



•Enterprise Search
 • What is Enterprise Search: Functions and features
 • Challenges
 • Logical Architecture
•Enterprise Search Solution
 • Technology Stack
    • Collection Process
    • Publication Process
    • Enricher framework
 • Deployment
•Conclusion
What is Enterprise Search?



“Enterprise Search offers a solution for searching,
finding and presenting enterprise related information
in the larger sense of the word”

Enterprise search is all about searching through documents from
any type and format from any sources located anywhere with the
upmost flexibility
 • Web search: limited to public documents on the web
 • Desktop search: limited to private documents on the local machine
 • Enterprise search: no limitations on document type and location
Enterprise Search
(features)

•Information Sources and Types
 • Wide range of sources: local and remote filesystems, content repositories,
   e-mail, databases, internet, intranet and extranet
 • Type not limited: any type ranging from structured to unstructured data, text
   and binary formats and compound formats (zip)

•Usage
 • Not limited to interactive use  automated business processes

•Security
 • Integrations with enterprise security infrastructure

•User Interaction and personalization
 • Identity enables more personalized search results
Enterprise Search
(features)

•Extended metadata
 • More metadata  better and more precise search results
 • More control over schema (for example Dynamic Fields)

•Ranking
 • More control over ranking: personalized ranking (group)

•Data extraction and derivation
 • Extract data using various techniques: Xpath, Xquery
 • Derive data: using external knowledge models: RDBMS, RDF Store, Web Services
 • Conditional extraction & derivation

•Managing and monitoring
 • On-the-fly management (JMX)
 • Real time monitoring
Enterprise Search
(features)

•User Interfaces
 • Web search
    • All about selling advertisements to the mass
    • Generalistic & minimalistic screens; focus on adds

 • Enterprise search
    • All about finding: rich navigation; focus on quick find
    • Small targeted audience
        • Specialized and customized screens (use of ontologies, taxonomies
           and classifications)
        • Use of identity (results customized to user) and web 2.0
    • Grouping
        • field collapsing, faceted search & clustering
Enterprise Search
(Challenges)

•Performance and scalability
•Rich functions and features
•Managebility
•Flexibility
•Easy maintenance
•Quick issue and problem solving
•Reduce total cost of ownerschip
Enterprise Search
(Challenges)

•Performance and scalability
•Rich functions and features
•Managebility
•Flexibility
•Easy maintenance
•Quick issue and problem solving
•Reduce total cost of ownerschip


   Commercial Search Engines?
Enterprise Search
(Challenges)

•Performance and scalability
•Rich functions and features
•Managebility
•Flexibility
•Easy maintenance
•Quick issue and problem solving
•Reduce total cost of ownerschip

   Apache Based (Open Source)
    Search & Retrieval Platform
Enterprise Search
(Logical Architecture)
Collection Process                                                 Publication Process

               Sources                                                                      Actors


    Pull          Pull          Push        HTTP/Get      HTTP/Post           API                        Stateless              Statefull
  (Crawling)   (Harvesting)   (SOAP/ReST)     (URL)       (SOAP/ReST)     (Java,Perl,...)               (XSLT, SolrJS)    (Webapp Framework)

      Content Inbound                          Request Inbound                                         Response Outbound

     Syntactic            Semantic            Syntactic                 Semantic

     Content Validation                       Request Validation
                                             Redirection            Enhancement                         Redirection         Enhancement
 Extraction Enhancement        Filtering     (Suggestions)         (add/remove clauses)                (more like this)    (metadata, editorial)

   Content Enrichment                        Request Enrichment                                        Response Enrichment

                                                       Filtering                            Grouping                      Sorting

               Indexing                                                   Searching & Ordering
                                            Search Engine
Enterprise Search
(Collection Process)
                                    Collection Process

Sources                                            Sources
 • Any document format
 • Any type                             Pull          Pull
                                                   (Harvesting)
                                                                    Push
                                      (Crawling)                  (SOAP/ReST)

    • Structured and unstructured         Content Inbound

    • Textual and binary                 Syntactic            Semantic

                                         Content Validation
    • Compound
 • Residing Anywhere                 Extraction Enhancement        Filtering

                                       Content Enrichment
 • Security
                                                   Indexing
                                     Search Engine
Enterprise Search
(Collection Process)
                                         Collection Process

Content Inbound                                         Sources
• Pull (Crawling/Spidering)
 • Internet, intranet & extranet             Pull          Pull          Push
                                           (Crawling)   (Harvesting)   (SOAP/ReST)
 • Local and remote filesystems                Content Inbound

                                              Syntactic            Semantic
• Pull (Harvesting)                           Content Validation
 • Databases
                                          Extraction Enhancement        Filtering
 • Content Repositories / Mgmt Systems
                                            Content Enrichment
 • Webservices inbound
                                                        Indexing
• Push
                                          Search Engine
 • Webservices (SOAP/REST)
 • Real time indexing
Enterprise Search
(Collection Process)
                                                 Collection Process

Content Validation                                              Sources
• Syntactic validation
 • Based on DTD / XML-Schema                         Pull          Pull          Push
                                                   (Crawling)   (Harvesting)   (SOAP/ReST)
 • Structure and limited content                       Content Inbound
• Semantic validation                                 Syntactic            Semantic

 • Based on algorithms:                               Content Validation
    • Groovy, XPath, Regex, …                     Extraction Enhancement        Filtering

• Think about exception handling                    Content Enrichment
• Placed anywhere in flow
 • During inbound: XML-Schema validation                        Indexing
 • After Enrichment: Validate derived metadata    Search Engine
Enterprise Search
(Collection Process)
                                            Collection Process

Content Enrichment                                         Sources
• Extraction
 • Metadata                                     Pull          Pull          Push
                                              (Crawling)   (Harvesting)   (SOAP/ReST)
 • Content (free text of document)                Content Inbound
• Enhancing                                      Syntactic            Semantic

 • Derive new and alter existing metadata        Content Validation
• Filtering                                  Extraction Enhancement        Filtering

 • Remove (parts of) metadata                  Content Enrichment

• Leverage external knowledge models
                                                           Indexing
• Conditional enrichment
                                             Search Engine
Enterprise Search
(Collection Process)
                              Collection Process

Indexing                                     Sources
• Store in search engine(s)
 • Content based routing          Pull          Pull          Push
                                (Crawling)   (Harvesting)   (SOAP/ReST)

• Document boosting                 Content Inbound

                                   Syntactic            Semantic

                                   Content Validation

                               Extraction Enhancement        Filtering

                                 Content Enrichment


                                             Indexing
                               Search Engine
Enterprise Search
                       (Publication Process)


                       Publication Process                                                             Request Inbound
                                                                                                        • HTTP/Get
                                                Actors                                                     • URL based with parameters
                                                                                                           • Response in XML, JSON, …
                                                                                                        • HTTP/Post
HTTP/Get      HTTP/Post           API                        Stateless              Statefull
  (URL)       (SOAP/ReST)     (Java,Perl,...)               (XSLT, SolrJS)    (Webapp Framework)

   Request Inbound                                         Response Outbound
                                                                                                           • XML (SOAP, REST) request
  Syntactic                 Semantic

  Request Validation
                                                                                                           • XML (SOAP, REST) response
 Redirection            Enhancement                         Redirection         Enhancement             • API
 (Suggestions)         (add/remove clauses)                (more like this)    (metadata, editorial)

 Request Enrichment                                        Response Enrichment                             • Java, Perl, …
           Filtering                            Grouping                      Sorting                      • Wrappers on HTTP/Get
                              Searching & Ordering
                              Search Engine
Enterprise Search
                       (Publication Process)


                       Publication Process                                                             Request Validation
                                                                                                       • Syntactic Validation
                                                Actors                                                  • Correct Query syntax?
                                                                                                       • Semantic Validation
                                                                                                        • Correct Field Filters?
HTTP/Get      HTTP/Post           API                        Stateless              Statefull
  (URL)       (SOAP/ReST)     (Java,Perl,...)               (XSLT, SolrJS)    (Webapp Framework)

   Request Inbound                                         Response Outbound
                                                                                                        • Based on algorithms: Groovy, Regex
  Syntactic                 Semantic

  Request Validation
 Redirection            Enhancement                         Redirection         Enhancement
                                                                                                       • Placed anywhere in flow
 (Suggestions)

 Request Enrichment
                       (add/remove clauses)                (more like this)

                                                           Response Enrichment
                                                                               (metadata, editorial)
                                                                                                        • @inbound: XML-Schema validation
                                                                                                        • @enrichment: Validate derived request clauses
           Filtering                            Grouping                      Sorting

                              Searching & Ordering
                              Search Engine
Enterprise Search
                       (Publication Process)


                       Publication Process                                                             Request Enrichment
                                                                                                       • Redirection
                                                Actors                                                  • Spelling suggestions
                                                                                                        • Metadata suggestions
HTTP/Get
  (URL)
              HTTP/Post
              (SOAP/ReST)
                                  API
                              (Java,Perl,...)
                                                             Stateless
                                                            (XSLT, SolrJS)
                                                                                    Statefull
                                                                              (Webapp Framework)       • Enhancing
   Request Inbound                                         Response Outbound
                                                                                                        • Add/Remove clauses
  Syntactic                 Semantic
                                                                                                        • Stemming, Synonyms, stop words
  Request Validation
 Redirection            Enhancement                         Redirection         Enhancement
 (Suggestions)         (add/remove clauses)                (more like this)    (metadata, editorial)

 Request Enrichment                                        Response Enrichment

           Filtering                            Grouping                      Sorting

                              Searching & Ordering
                              Search Engine
Enterprise Search
                       (Publication Process)


                       Publication Process                                                             Searching & Ordering
                                                                                                       • Filtering
                                                Actors                                                  • Field Search
                                                                                                       • Grouping
                                                                                                        • Add group information
HTTP/Get      HTTP/Post           API                        Stateless              Statefull
  (URL)       (SOAP/ReST)     (Java,Perl,...)               (XSLT, SolrJS)    (Webapp Framework)

   Request Inbound                                         Response Outbound
                                                                                                        • Field collapsing, Faceted Search & Clustering
                                                                                                       • Sorting
  Syntactic                 Semantic

  Request Validation
 Redirection            Enhancement                         Redirection         Enhancement
                                                                                                        • Sort on Field
 (Suggestions)         (add/remove clauses)                (more like this)    (metadata, editorial)

 Request Enrichment                                        Response Enrichment                          • Ranking
           Filtering                            Grouping                      Sorting

                              Searching & Ordering
                              Search Engine
Enterprise Search
                       (Publication Process)


                       Publication Process                                                             Response Enrichment
                                                                                                       • Redirection
                                                Actors                                                  • Suggestions
                                                                                                        • More like this
HTTP/Get
  (URL)
              HTTP/Post
              (SOAP/ReST)
                                  API
                              (Java,Perl,...)
                                                             Stateless
                                                            (XSLT, SolrJS)
                                                                                    Statefull
                                                                              (Webapp Framework)       • Enhancing
   Request Inbound                                         Response Outbound
                                                                                                        • Add/Remove response fields
  Syntactic                 Semantic
                                                                                                        • Schema information
  Request Validation
                                                                                                        • Editorial information
 Redirection            Enhancement                         Redirection         Enhancement
 (Suggestions)         (add/remove clauses)                (more like this)    (metadata, editorial)

 Request Enrichment                                        Response Enrichment

           Filtering                            Grouping                      Sorting

                              Searching & Ordering
                              Search Engine
Enterprise Search
                       (Publication Process)


                       Publication Process                                                             Response outbound
                                                                                                       • Stateless
                                                Actors                                                  • No security
                                                                                                        • XSLT, SolrJS
HTTP/Get
  (URL)
              HTTP/Post
              (SOAP/ReST)
                                  API
                              (Java,Perl,...)
                                                             Stateless
                                                            (XSLT, SolrJS)
                                                                                    Statefull
                                                                              (Webapp Framework)       • Statefull
   Request Inbound                                         Response Outbound
                                                                                                        • Security
  Syntactic                 Semantic
                                                                                                        • Web2.0
  Request Validation
                                                                                                        • Web Application Framework
 Redirection            Enhancement                         Redirection         Enhancement
 (Suggestions)         (add/remove clauses)                (more like this)    (metadata, editorial)

 Request Enrichment                                        Response Enrichment

           Filtering                            Grouping                      Sorting

                              Searching & Ordering
                              Search Engine
Technology Stack
(Collection Process)

•Use ESB for the flow: Apache ServiceMix with Camel
 • Leverage standard ESB components (Transformers, Validation, Splitter,
  Filter, Routers, Scripting)
 • Leverage standard ESB transports (WS, SMTP, JMS, JCR, JDBC, FILE)
 • Custom: Crawler Apache Nutch
    • Leverage only crawl framework
    • Extend NutchIndexWriter; asynchronously pushing crawled documents
      back into ESB flow (reply-to)
•ESB Makes distributed flow possibleContent based routing
•Hot deploy Easy maintenance
•Reusing services across collection processes
•Search Engine independent
Collection Process Flow
                                Content Inbound

                                              1
                                              2                            D D D
                                               N                           Document
   Push Inbound       Syntactic Validation             Splitter            Messages
                                           Documents
 (Message Endpoint)    (Channel Purger)
                                            Message




                                         Channel


 Content Validation                                    Content Enrichment                              Content Indexer


       Semantic Validation               Channel                                      Channel      Transformer          HTTP Transport
        (Channel Purger)                                           Content Filter               (Message Translator)   (Channel Adapter)




                  ?                                               Content Enricher                                            D
          Invalid Message                                            Enricher                                          SOLR Document
                                                                                                                          Message




                !                                                                                                        Lucene/Solr
                                                                                                                           INDEX
         Invalid Message
             Channel                                                                                                    Lucene/SOLR
                                                                                                                          (SOLRJ)
Technology Stack
(Publication Process)


•Use flow from Apache Lucene/Solr
 • Leverage standard Solr components (synonyms, stopwords,
   stemming, MLT, spelling, faceted search, …)
 • Custom components: using Solr’s extendability framework
     • Security: authority field in schema with Apache Shiro integration
     • Field filters (zipcode,…)


•User interfaces
 • Stateless: SolrJs, XSLTResponseWriter & VelocityResponseWriter
 • Statefull: Apache Wicket with Spring
Enterprise Search
(Logical Architecture)
Collection Process                                                 Publication Process

               Sources                                                                      Actors


    Pull          Pull          Push        HTTP/Get      HTTP/Post           API                        Stateless              Statefull
  (Crawling)   (Harvesting)   (SOAP/ReST)     (URL)       (SOAP/ReST)     (Java,Perl,...)               (XSLT, SolrJS)    (Webapp Framework)

      Content Inbound                          Request Inbound                                         Response Outbound

     Syntactic            Semantic            Syntactic                 Semantic

     Content Validation                       Request Validation
                                             Redirection            Enhancement                         Redirection         Enhancement
 Extraction Enhancement        Filtering     (Suggestions)         (add/remove clauses)                (more like this)    (metadata, editorial)

   Content Enrichment                        Request Enrichment                                        Response Enrichment

                                                       Filtering                            Grouping                      Sorting

               Indexing                                                   Searching & Ordering
                                            Search Engine
Enterprise Search
(Logical Architecture)
 Collection Process                                                 Publication Process

                Sources                                                                      Actors
                  ServiceMix/Camel                                                                                           Apache Wicket
Nutch                                                                                        SolrJS/XSLT
     Pull          Pull          Push        HTTP/Get      HTTP/Post           API                        Stateless              Statefull
   (Crawling)   (Harvesting)   (SOAP/ReST)     (URL)       (SOAP/ReST)     (Java,Perl,...)               (XSLT, SolrJS)    (Webapp Framework)

        Content Inbound                         Request Inbound                                         Response Outbound

      Syntactic            Semantic            Syntactic                 Semantic

      Content Validation                       Request Validation
                                              Redirection            Enhancement                         Redirection         Enhancement
  Extraction Enhancement        Filtering     (Suggestions)         (add/remove clauses)                (more like this)    (metadata, editorial)

    Content Enrichment                        Request Enrichment                                        Response Enrichment

                                                        Filtering                            Grouping                      Sorting

                Indexing                                                   Searching & Ordering
                                             Search Engine
                                                       Lucene/SOLR
Enterprise Search
(Logical Architecture)
Collection Process                                                      Publication Process

               Sources                                                                           Actors


    Pull          Pull          Push             HTTP/Get      HTTP/Post           API                        Stateless              Statefull
  (Crawling)   (Harvesting)   (SOAP/ReST)          (URL)       (SOAP/ReST)     (Java,Perl,...)               (XSLT, SolrJS)    (Webapp Framework)

      Content Inbound                               Request Inbound                                         Response Outbound

     Syntactic            Semantic                 Syntactic                 Semantic

     Content Validation                     LuminisRequest Validation
                                                    Enricher Framework
                                                  Redirection            Enhancement                         Redirection         Enhancement
 Extraction Enhancement        Filtering          (Suggestions)         (add/remove clauses)                (more like this)    (metadata, editorial)

   Content Enrichment                             Request Enrichment                                        Response Enrichment

                                                            Filtering                            Grouping                      Sorting

               Indexing                                                        Searching & Ordering
                                                 Search Engine
Luminis Enricher Framework



•Custom Enricher Framework
 • Existing ESB & SOLR enricher capabilities not sufficient.

 • Enriching = one or more actions (extraction, enhancing &
   filtering) performed on documents with fields

 • Same enricher to be used for:
    • Collection process:
       • Documents  enriching, filtering & splitting
    • Publication process:
       • Search requests’first-components’ searchcomponent
       • Search response’last-components’ searchcomponent
Luminis Enricher Framework
                                 Content Inbound

                                               1
                                               2                            D D D
                                                N                           Document
    Push Inbound       Syntactic Validation             Splitter            Messages
                                            Documents
  (Message Endpoint)    (Channel Purger)
                                             Message



•Custom Enricher Framework
 • Existing ESB & SOLR enricher capabilities not sufficient.
                                          Channel


   Content Validation                                   Content Enrichment                       Content Indexer
 • Enriching = one or more actions (extraction, enhancing &
   filtering) performed on documents with fields
        Semantic Validation               Channel                                      Channel       SOLR Indexer
         (Channel Purger)                                           Content Filter                 (Channel Adapter)



 • Same enricher to be used for:
                   ?                                               Content Enricher
     • Collection process:
                                                                                                           D
           Invalid Message                                            Enricher                      SOLR Document
                                                                                                       Message

        • Documents  enriching, filtering & splitting
          !
     • Publication process:
          Invalid Message
                                                                                                     Lucene/Solr
                                                                                                       INDEX


        • Search requests’first-components’ searchcomponent
              Channel                                                                               Lucene/SOLR
                                                                                                      (SOLRJ)


        • Search response’last-components’ searchcomponent
Luminis Enricher Framework
                                                        Content Inbound

                                                                           1
                                                                           2                                            D D D
                                                                      N                                                 Document
                  Push Inbound               Syntactic Validation                           Splitter                    Messages
                                                                  Documents
                (Message Endpoint)            (Channel Purger)
                                                                   Message



     •Custom Enricher Framework
           • Existing ESB & SOLR enricher capabilities not sufficient.
                                                                    Channel
                                                                                                                                                             <<XSLT>>
                                                                                                                                                           XML2HTML
                    Content Validation                                                         Content Enrichment                                                           Content Indexer
           • Enriching = one or more actions (extraction, enhancing &
<<SOLRQueryRequest>>                                                                                                                                                                <<(X)HTML>>
      Query                                                                                                                                                                          Resultaat
                              <<SearchHandler>>
             filtering) performed on documents with fields
                                                                                                                                   <<XML>>            <<QueryResponseWriter>>
                               RequestHandler                                                                                      Response           XSLTResponseWriter
                           Semantic Validation                      Channel                                                                          Channel                      SOLR Indexer
                               "first-components"
                            (Channel Purger)
                                                                  "components"                       "last-components"
                                                                                                            Content Filter                                                      (Channel Adapter)



           • Same enricher to be used for:
                                     ?                                                                    Content Enricher
                       • Collection process:
                                                                                                                                                                                       D
                             Invalid Message                                                                   Enricher                                                         SOLR Document
                                                                                                                                                                                   Message

     <<SearchComponent>>
                          • Documents  enriching, filtering & splitting
                              <<SearchComponent>>    <<SearchComponent>>       <<SearchComponent>>       <<SearchComponent>>   <<SearchComponent>>
            query                    facet                  mlt                    highlight                    stats                debug
                            !
                       • Publication process:
                            Invalid Message
                                                                                                                                                                                  Lucene/Solr
                                                                                                                                                                                    INDEX


                          • Search requests’first-components’ searchcomponent
                                Channel                                                                                                                                          Lucene/SOLR
                                                                                                                                                                                   (SOLRJ)


                          • Search response’last-components’ searchcomponent
Luminis Enricher Framework
(architecture)

•Pipe-and-filter architecture
  • Documents flow through series of actions
 • Output from one action is input to another action
     • Fields from input document can be used in action’s clauses: values in
       expressions filled by replacing velocity type patterns with field values
   •Conditional flows supported
   •Reuse of flows & Subflows supported
Luminis Enricher Framework
(architecture)

•Pipe-and-filter architecture
  • Documents flow through series of actions
 • Output from one action is input to another action
      • Fields from input document can be used in action’s clauses: values in
        expressions filled by replacing velocity type patterns with field values
   •Conditional flows supported
   •Reuse of flows & Subflows supported
                                                                    Action            Document
                                                            (select C where ${B})   [[A1],[B],[C1]]
                                                      YES

    Document          Action     Document
   [[A1,A2],[B]]                  [[A1],[B]]   If [B=3]
                   (remove A2)

                                                      NO
                                                                    Action            Document
                                                            (select C where ${A})   [[A1],[B],[C2]]
Luminis Enricher Framework
(Configuration)

•Enricher flow and expression configuration via XML based DSL
 • Conditional: if-then-else & switch-case-else (with regex support)
 • Actions: Add & remove fields and field values using expressions
 • Expression handlers currently supported:
    •   Field
    •   Function (execute methods via Java Reflection)
    •   HttpClient (retrieve content by URL described by field values)
    •   Xslt, Xpath, Xquery (external XML databases)
    •   JDBC
    •   SparQL (OpenRDF)
    •   Apache Lucene/Solr
    •   Apache Tika (Meta and Text extraction)
Luminis Enricher Framework
         (Examples)
<enricher name="Field" >
 <field name="a">AA1</field>
 <field name="b">BB1</field>
 <field name="b">BB2</field>
 <multivalue-field name="c">CC1</multivalue-field>
 <multivalue-field name="c">CC2</multivalue-field>
 <if test="field::c" pattern="CC2">
   <then>
     <field name="e">EE1</field>
   </then>
 </if>
 <if test="field::a">
   <then>
     <field name="f">FF1</field>
   </then>
 </if>
 <rename-field name="b">d</rename-field>
 <remove-field name="a"/>
</enricher>
Luminis Enricher Framework
        (Examples)
               <enricher name="XPath”
                   xmlns:str="http://exslt.org/strings"
<enricher name="Field" >
                   xmlns:fn="http://www.w3.org/2005/xpath-functions"
  <field name="a">AA1</field>
                   xmlns:html="http://www.w3.org/1999/xhtml">
  <field name="b">BB1</field>
                 field name="Description" expression-type="xpath">
                   //html:meta[@name='DC.description']/@content
  <field name="b">BB2</field>
                 </field>
  <multivalue-field name="c">CC1</multivalue-field>
                 <multivalue-field name="Type" expression-type="xpath">
  <multivalue-field name="c">CC2</multivalue-field>
                   //html:meta[@name='DC.type' and
  <if test="field::c" pattern="CC2">
                     (@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or
    <then>             @scheme='OVERHEIDbm.bekendmakingtypeProvincie' or
      <field name="e">EE1</field>
                       @scheme='OVERHEIDbm.bekendmakingtypeWaterschap')
    </then>        ]/@content
  </if>          </multivalue-field>
                 <field name="publisher" expression-type="xpath">
  <if test="field::a">
                   fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '')
    <then>
                 </field>
      <field name="f">FF1</field>
                 <field name="publisher" expression-type="xpath">
    </then>        fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content,
  </if>                       //html:meta[@name='DC.creator']/@content)
                 </field>
  <rename-field name="b">d</rename-field>
               </enricher>
  <remove-field name="a"/>
</enricher>
Luminis Enricher Framework
        (Examples)                       <enricher name="SPARQL">
                                           <field name="place">http://www.my.com/#channels</field>
               <enricher name="XPath”      <field expression-type="sparql" repository="TESTRDF">
                   xmlns:str="http://exslt.org/strings"
                                             <![CDATA[
<enricher name="Field" >
                   xmlns:fn="http://www.w3.org/2005/xpath-functions"
                                               PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
  <field name="a">AA1</field>
                   xmlns:html="http://www.w3.org/1999/xhtml">
                                               SELECT ?definition
  <field name="b">BB1</field>
                 field name="Description" expression-type="xpath">
                                               WHERE {
                   //html:meta[@name='DC.description']/@content
  <field name="b">BB2</field>                    ?${place} skos:definition ?definition.
                 </field>
  <multivalue-field name="c">CC1</multivalue-field>
                                               }
                 <multivalue-field name="Type" expression-type="xpath">
                                             ]]>
  <multivalue-field name="c">CC2</multivalue-field>
                   //html:meta[@name='DC.type' and
                                           </field>
  <if test="field::c" pattern="CC2">
                     (@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or
                                         </enricher>
    <then>             @scheme='OVERHEIDbm.bekendmakingtypeProvincie' or
      <field name="e">EE1</field>
                       @scheme='OVERHEIDbm.bekendmakingtypeWaterschap')
    </then>        ]/@content
  </if>          </multivalue-field>
                 <field name="publisher" expression-type="xpath">
  <if test="field::a">
                   fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '')
    <then>
                 </field>
      <field name="f">FF1</field>
                 <field name="publisher" expression-type="xpath">
    </then>        fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content,
  </if>                       //html:meta[@name='DC.creator']/@content)
                 </field>
  <rename-field name="b">d</rename-field>
               </enricher>
  <remove-field name="a"/>
</enricher>
Luminis Enricher Framework
        (Examples)                        <enricher name="SPARQL">
                                            <field name="place">http://www.my.com/#channels</field>
                <enricher name="XPath”      <field expression-type="sparql" repository="TESTRDF">
                    xmlns:str="http://exslt.org/strings"
                                              <![CDATA[
<enricher name="Field" >
                    xmlns:fn="http://www.w3.org/2005/xpath-functions"
                                                PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
  <field name="a">AA1</field>
                    xmlns:html="http://www.w3.org/1999/xhtml">
                                                SELECT ?definition
  <field name="b">BB1</field>
                  field name="Description" expression-type="xpath">
                                                WHERE {
                    //html:meta[@name='DC.description']/@content
  <field name="b">BB2</field>                     ?${place} skos:definition ?definition.
                  </field>
  <multivalue-field name="c">CC1</multivalue-field>
                                                }
                  <multivalue-field name="Type" expression-type="xpath">
                                              ]]>
  <multivalue-field name="c">CC2</multivalue-field>
                    //html:meta[@name='DC.type' and
                                            </field>
  <if test="field::c" pattern="CC2">
                      (@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or
                                          </enricher>
    <then>             @scheme='OVERHEIDbm.bekendmakingtypeProvincie' or
      <field name="e">EE1</field>
                       @scheme='OVERHEIDbm.bekendmakingtypeWaterschap')
      <enricher name=”HttpAndTika">
    </then>         ]/@content
  </if> <field name="content.url"><![CDATA[http://na.apachecon.com/c/acna2010/speakers/501]]></field>
                  </multivalue-field>
        <field expression-type=”http" name="content.file">field:content.url</field>
                  <field name="publisher" expression-type="xpath">
  <if test="field::a">
        <field name="auteur" source="field::content.file">xpath://H1</field>
                    fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '')
    <then>
        <multivalue-field expression-type=”tika.meta” source="field::content.file”/>
                  </field>
      <field name="f">FF1</field>
        <field name=”content" expression-type=”tika.text” source="field::content.file”/>
                  <field name="publisher" expression-type="xpath">
    </then>
        <switch test=”field::content.url
                    fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content,
  </if>    <case pattern=".*.rijksweb.nl.*"><field name=”source">Rijksweb</field></case>
                               //html:meta[@name='DC.creator']/@content)
           <case name="b">d</rename-field>
  <rename-field  pattern=".*.deventer.nl.*"><field name=”source">Gemeente Deventer</field></case>
                  </field>
           <case</enricher>
                 pattern="file:.*"><field name=”source">Locale Harde Schijf</field></case>
  <remove-field name="a"/>
           <else><field name=”source">Overige</field></else>
</enricher>
        </switch>
      </enricher>
Luminis Enricher Framework
(Technology)

•Enricher and expresion handlers are Java based OSGi
services:
 • Hot pluggable and updatable
 • Flow and expression configuration changes no restart
 • Extendible: New expression handlers immediatly available in
   actions after installing OSGi bundle
•Runs in Apache Felix
 • Collection Process: ServiceMix contains OSGi container
 • Publication Process: Custom OSGi loader for Lucene/Solr
•Centralized & transactional provisioning (Apache Ace)
 ‑ Components & Configuration
Deployment Architecture                                                                                                 <<HTTP>>



                                                                          <<device>>
                                                                            Firewall                     <<device>>
                                                                                                      HTTP Load Balancer
                                                                                                                                     <<HTTP>>
  <<device>>
Deployment Server                    <<device>>                                                                  <<HTTP>>
      Felix OSGi                Master Collection Server
       (Apache)                                                                                                                          <<device>>
                                   <<Container>>                                                                                   Slave Publication Server
         Ace
                                   Apache Tomcat                                                                                          (Slave2)
       (Apache)
                            ServiceMix          Felix OSGi
                             (Apache)            (Apache)
                                                                                                                                         <<Container>>
                                                                                                      <<device>>
                             Enricher                       Nutch                                                                        Apache Tomcat
                             (Luminis)
                                                                                                Slave Publication Server
                                                           (Apache)                                                                           Felix OSGi
                                                                                                       (Slave1)
                                                                                                                                               (Apache)
                          <<config>>
                       SOLR::solrconfig.xml
                                                        Lucene/SOLR
                                                                               <<HTTP/ReST>>
    <<PROVISIONING>>        <<config>>
                                                          (Apache)                                                                         Lucene/SOLR
                         Luminis:Enricher.xml                                                       <<Container>>                            (Apache)
                              <<config>>                      Tika                                   Apache Tomcat
                            SOLR::schema.xml
                                                           (Apache)             <<HTTP/ReST>>             Felix OSGi                            Wicket
                                  <<config>>
                             servicenix::config.xml
                                                                                                           (Apache)                            (Apache)
                                                           OpenRDF           <<HTTP>>
                                                                                                       Lucene/SOLR                             Enricher
                                                                                                         (Apache)                              (Luminis)
                                                                           <<HTTP>>
                                                                                                                                            <<config>>
                                   <<Data Container>>                                                       Wicket                       SOLR::solrconfig.xml
                                          SQL                              <<JDBC>>                        (Apache)                           <<config>>
                                                                                                                                           Luminis:Enricher.xml
                                                                           <<JDBC>>
                        <<Database>>                 <<RDFTripleStore>>                                    Enricher                             <<config>>
                       Knowledge Models               Knowledge Models                                                                        SOLR::schema.xml
                                                                                                           (Luminis)

                                                                                                        <<config>>
                                                                                                     SOLR::solrconfig.xml
                                                                                                          <<config>>
                                                                                                       Luminis:Enricher.xml
                                                                                                            <<config>>
                                                                                                          SOLR::schema.xml
Conclusions



•Enterprise Search Solution is not Google search

•Open Source paves the way; misses some ingredients
  • Useful ingredients: Lucene/Solr, Nutch, Tika, ServiceMix/Camel,
    Wicket, MySQL, OpenRDF, Felix/Ace
  • Missing ingredients: Enricher

•Interesting developments:
  • Apache Chemistry (CMIS)
  • Apache Clerezza
  • Apache Nutch
  • Apache Connectors Framework (ManifoldCF)
Questions & (answers?)



Marc Teutelink
  marc.teutelink@luminis.eu

  @mteutelink




                MEAP December 2010 

Weitere ähnliche Inhalte

Ähnlich wie Open source enterprise search and retrieval platform

SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013Agnes Molnar
 
10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 SearchSPC Adriatics
 
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 SearchSPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 SearchAgnes Molnar
 
Fishbowl Solutions WebCenter Search Webinar Presentation
Fishbowl Solutions WebCenter Search Webinar PresentationFishbowl Solutions WebCenter Search Webinar Presentation
Fishbowl Solutions WebCenter Search Webinar PresentationKim Negaard
 
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...Agnes Molnar
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopDmitry Kan
 
How to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceHow to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceBrightEdge
 
SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 SearchSPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 SearchAgnes Molnar
 
Find Information Faster Using SharePoint 2010 Search
Find Information Faster Using SharePoint 2010 SearchFind Information Faster Using SharePoint 2010 Search
Find Information Faster Using SharePoint 2010 SearchPerficient, Inc.
 
SPLive Orlando - Beyond the Search Center - Application or Solution?
SPLive Orlando - Beyond the Search Center - Application or Solution?SPLive Orlando - Beyond the Search Center - Application or Solution?
SPLive Orlando - Beyond the Search Center - Application or Solution?Agnes Molnar
 
Planning SharePoint 2013 Search for IT PROs
Planning SharePoint 2013 Search for IT PROsPlanning SharePoint 2013 Search for IT PROs
Planning SharePoint 2013 Search for IT PROsBenjamin Athawes
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Aditya Varun Chadha
 
Lab EPiServer Find - Advanced developer scenarios
Lab EPiServer Find - Advanced developer scenariosLab EPiServer Find - Advanced developer scenarios
Lab EPiServer Find - Advanced developer scenariosPatrick van Kleef
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache StanbolAlkuvoima
 
Lifecycle of a FAST Search Implementation
Lifecycle of a FAST Search ImplementationLifecycle of a FAST Search Implementation
Lifecycle of a FAST Search ImplementationPerficient, Inc.
 
Concept Searching Portal Solutions Search Engine Face Off
Concept Searching Portal Solutions Search Engine Face OffConcept Searching Portal Solutions Search Engine Face Off
Concept Searching Portal Solutions Search Engine Face Offmartingarland
 
SPConnections - What's new in SharePoint 2013 Search
SPConnections - What's new in SharePoint 2013 SearchSPConnections - What's new in SharePoint 2013 Search
SPConnections - What's new in SharePoint 2013 SearchAgnes Molnar
 
FatWire Tutorial For Site Studio Developers
FatWire Tutorial For Site Studio DevelopersFatWire Tutorial For Site Studio Developers
FatWire Tutorial For Site Studio DevelopersBrian Huff
 

Ähnlich wie Open source enterprise search and retrieval platform (20)

SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013
 
10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search
 
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 SearchSPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
 
Fishbowl Solutions WebCenter Search Webinar Presentation
Fishbowl Solutions WebCenter Search Webinar PresentationFishbowl Solutions WebCenter Search Webinar Presentation
Fishbowl Solutions WebCenter Search Webinar Presentation
 
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
How to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceHow to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User Experience
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
FAST Search for SharePoint
FAST Search for SharePointFAST Search for SharePoint
FAST Search for SharePoint
 
SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 SearchSPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
 
Find Information Faster Using SharePoint 2010 Search
Find Information Faster Using SharePoint 2010 SearchFind Information Faster Using SharePoint 2010 Search
Find Information Faster Using SharePoint 2010 Search
 
SPLive Orlando - Beyond the Search Center - Application or Solution?
SPLive Orlando - Beyond the Search Center - Application or Solution?SPLive Orlando - Beyond the Search Center - Application or Solution?
SPLive Orlando - Beyond the Search Center - Application or Solution?
 
Planning SharePoint 2013 Search for IT PROs
Planning SharePoint 2013 Search for IT PROsPlanning SharePoint 2013 Search for IT PROs
Planning SharePoint 2013 Search for IT PROs
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010
 
Lab EPiServer Find - Advanced developer scenarios
Lab EPiServer Find - Advanced developer scenariosLab EPiServer Find - Advanced developer scenarios
Lab EPiServer Find - Advanced developer scenarios
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Lifecycle of a FAST Search Implementation
Lifecycle of a FAST Search ImplementationLifecycle of a FAST Search Implementation
Lifecycle of a FAST Search Implementation
 
Concept Searching Portal Solutions Search Engine Face Off
Concept Searching Portal Solutions Search Engine Face OffConcept Searching Portal Solutions Search Engine Face Off
Concept Searching Portal Solutions Search Engine Face Off
 
SPConnections - What's new in SharePoint 2013 Search
SPConnections - What's new in SharePoint 2013 SearchSPConnections - What's new in SharePoint 2013 Search
SPConnections - What's new in SharePoint 2013 Search
 
FatWire Tutorial For Site Studio Developers
FatWire Tutorial For Site Studio DevelopersFatWire Tutorial For Site Studio Developers
FatWire Tutorial For Site Studio Developers
 

Open source enterprise search and retrieval platform

  • 1. Open Source Search & Retrieval Platform Enterprise Search EAI Marc Teutelink Semantic Web Datum 21 augustus 2010
  • 2. How Apache open source software is used during the implementation of an Enterprise Search and Retrieval Platform (Lucene/SOLR, Nutch, Tika, ServiceMix/Camel, Felix/Ace)
  • 3. Marc Teutelink marc.teutelink@luminis.eu @mteutelink •Software architect at Luminis •15+ years experience in software development; specialized in Enterprise Search, Enterprise Application Integration and Semantic Web technology •Currently writing “Enterprise Search in Action” for Manning (Mid-2011)
  • 4. Agenda •Enterprise Search • What is Enterprise Search: Functions and features • Challenges • Logical Architecture •Enterprise Search Solution • Technology Stack • Collection Process • Publication Process • Enricher framework • Deployment •Conclusion
  • 5. What is Enterprise Search? “Enterprise Search offers a solution for searching, finding and presenting enterprise related information in the larger sense of the word” Enterprise search is all about searching through documents from any type and format from any sources located anywhere with the upmost flexibility • Web search: limited to public documents on the web • Desktop search: limited to private documents on the local machine • Enterprise search: no limitations on document type and location
  • 6. Enterprise Search (features) •Information Sources and Types • Wide range of sources: local and remote filesystems, content repositories, e-mail, databases, internet, intranet and extranet • Type not limited: any type ranging from structured to unstructured data, text and binary formats and compound formats (zip) •Usage • Not limited to interactive use  automated business processes •Security • Integrations with enterprise security infrastructure •User Interaction and personalization • Identity enables more personalized search results
  • 7. Enterprise Search (features) •Extended metadata • More metadata  better and more precise search results • More control over schema (for example Dynamic Fields) •Ranking • More control over ranking: personalized ranking (group) •Data extraction and derivation • Extract data using various techniques: Xpath, Xquery • Derive data: using external knowledge models: RDBMS, RDF Store, Web Services • Conditional extraction & derivation •Managing and monitoring • On-the-fly management (JMX) • Real time monitoring
  • 8. Enterprise Search (features) •User Interfaces • Web search • All about selling advertisements to the mass • Generalistic & minimalistic screens; focus on adds • Enterprise search • All about finding: rich navigation; focus on quick find • Small targeted audience • Specialized and customized screens (use of ontologies, taxonomies and classifications) • Use of identity (results customized to user) and web 2.0 • Grouping • field collapsing, faceted search & clustering
  • 9. Enterprise Search (Challenges) •Performance and scalability •Rich functions and features •Managebility •Flexibility •Easy maintenance •Quick issue and problem solving •Reduce total cost of ownerschip
  • 10. Enterprise Search (Challenges) •Performance and scalability •Rich functions and features •Managebility •Flexibility •Easy maintenance •Quick issue and problem solving •Reduce total cost of ownerschip Commercial Search Engines?
  • 11. Enterprise Search (Challenges) •Performance and scalability •Rich functions and features •Managebility •Flexibility •Easy maintenance •Quick issue and problem solving •Reduce total cost of ownerschip Apache Based (Open Source) Search & Retrieval Platform
  • 12. Enterprise Search (Logical Architecture) Collection Process Publication Process Sources Actors Pull Pull Push HTTP/Get HTTP/Post API Stateless Statefull (Crawling) (Harvesting) (SOAP/ReST) (URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework) Content Inbound Request Inbound Response Outbound Syntactic Semantic Syntactic Semantic Content Validation Request Validation Redirection Enhancement Redirection Enhancement Extraction Enhancement Filtering (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Content Enrichment Request Enrichment Response Enrichment Filtering Grouping Sorting Indexing Searching & Ordering Search Engine
  • 13. Enterprise Search (Collection Process) Collection Process Sources Sources • Any document format • Any type Pull Pull (Harvesting) Push (Crawling) (SOAP/ReST) • Structured and unstructured Content Inbound • Textual and binary Syntactic Semantic Content Validation • Compound • Residing Anywhere Extraction Enhancement Filtering Content Enrichment • Security Indexing Search Engine
  • 14. Enterprise Search (Collection Process) Collection Process Content Inbound Sources • Pull (Crawling/Spidering) • Internet, intranet & extranet Pull Pull Push (Crawling) (Harvesting) (SOAP/ReST) • Local and remote filesystems Content Inbound Syntactic Semantic • Pull (Harvesting) Content Validation • Databases Extraction Enhancement Filtering • Content Repositories / Mgmt Systems Content Enrichment • Webservices inbound Indexing • Push Search Engine • Webservices (SOAP/REST) • Real time indexing
  • 15. Enterprise Search (Collection Process) Collection Process Content Validation Sources • Syntactic validation • Based on DTD / XML-Schema Pull Pull Push (Crawling) (Harvesting) (SOAP/ReST) • Structure and limited content Content Inbound • Semantic validation Syntactic Semantic • Based on algorithms: Content Validation • Groovy, XPath, Regex, … Extraction Enhancement Filtering • Think about exception handling Content Enrichment • Placed anywhere in flow • During inbound: XML-Schema validation Indexing • After Enrichment: Validate derived metadata Search Engine
  • 16. Enterprise Search (Collection Process) Collection Process Content Enrichment Sources • Extraction • Metadata Pull Pull Push (Crawling) (Harvesting) (SOAP/ReST) • Content (free text of document) Content Inbound • Enhancing Syntactic Semantic • Derive new and alter existing metadata Content Validation • Filtering Extraction Enhancement Filtering • Remove (parts of) metadata Content Enrichment • Leverage external knowledge models Indexing • Conditional enrichment Search Engine
  • 17. Enterprise Search (Collection Process) Collection Process Indexing Sources • Store in search engine(s) • Content based routing Pull Pull Push (Crawling) (Harvesting) (SOAP/ReST) • Document boosting Content Inbound Syntactic Semantic Content Validation Extraction Enhancement Filtering Content Enrichment Indexing Search Engine
  • 18. Enterprise Search (Publication Process) Publication Process Request Inbound • HTTP/Get Actors • URL based with parameters • Response in XML, JSON, … • HTTP/Post HTTP/Get HTTP/Post API Stateless Statefull (URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework) Request Inbound Response Outbound • XML (SOAP, REST) request Syntactic Semantic Request Validation • XML (SOAP, REST) response Redirection Enhancement Redirection Enhancement • API (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Request Enrichment Response Enrichment • Java, Perl, … Filtering Grouping Sorting • Wrappers on HTTP/Get Searching & Ordering Search Engine
  • 19. Enterprise Search (Publication Process) Publication Process Request Validation • Syntactic Validation Actors • Correct Query syntax? • Semantic Validation • Correct Field Filters? HTTP/Get HTTP/Post API Stateless Statefull (URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework) Request Inbound Response Outbound • Based on algorithms: Groovy, Regex Syntactic Semantic Request Validation Redirection Enhancement Redirection Enhancement • Placed anywhere in flow (Suggestions) Request Enrichment (add/remove clauses) (more like this) Response Enrichment (metadata, editorial) • @inbound: XML-Schema validation • @enrichment: Validate derived request clauses Filtering Grouping Sorting Searching & Ordering Search Engine
  • 20. Enterprise Search (Publication Process) Publication Process Request Enrichment • Redirection Actors • Spelling suggestions • Metadata suggestions HTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Stateless (XSLT, SolrJS) Statefull (Webapp Framework) • Enhancing Request Inbound Response Outbound • Add/Remove clauses Syntactic Semantic • Stemming, Synonyms, stop words Request Validation Redirection Enhancement Redirection Enhancement (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Request Enrichment Response Enrichment Filtering Grouping Sorting Searching & Ordering Search Engine
  • 21. Enterprise Search (Publication Process) Publication Process Searching & Ordering • Filtering Actors • Field Search • Grouping • Add group information HTTP/Get HTTP/Post API Stateless Statefull (URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework) Request Inbound Response Outbound • Field collapsing, Faceted Search & Clustering • Sorting Syntactic Semantic Request Validation Redirection Enhancement Redirection Enhancement • Sort on Field (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Request Enrichment Response Enrichment • Ranking Filtering Grouping Sorting Searching & Ordering Search Engine
  • 22. Enterprise Search (Publication Process) Publication Process Response Enrichment • Redirection Actors • Suggestions • More like this HTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Stateless (XSLT, SolrJS) Statefull (Webapp Framework) • Enhancing Request Inbound Response Outbound • Add/Remove response fields Syntactic Semantic • Schema information Request Validation • Editorial information Redirection Enhancement Redirection Enhancement (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Request Enrichment Response Enrichment Filtering Grouping Sorting Searching & Ordering Search Engine
  • 23. Enterprise Search (Publication Process) Publication Process Response outbound • Stateless Actors • No security • XSLT, SolrJS HTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Stateless (XSLT, SolrJS) Statefull (Webapp Framework) • Statefull Request Inbound Response Outbound • Security Syntactic Semantic • Web2.0 Request Validation • Web Application Framework Redirection Enhancement Redirection Enhancement (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Request Enrichment Response Enrichment Filtering Grouping Sorting Searching & Ordering Search Engine
  • 24. Technology Stack (Collection Process) •Use ESB for the flow: Apache ServiceMix with Camel • Leverage standard ESB components (Transformers, Validation, Splitter, Filter, Routers, Scripting) • Leverage standard ESB transports (WS, SMTP, JMS, JCR, JDBC, FILE) • Custom: Crawler Apache Nutch • Leverage only crawl framework • Extend NutchIndexWriter; asynchronously pushing crawled documents back into ESB flow (reply-to) •ESB Makes distributed flow possibleContent based routing •Hot deploy Easy maintenance •Reusing services across collection processes •Search Engine independent
  • 25. Collection Process Flow Content Inbound 1 2 D D D N Document Push Inbound Syntactic Validation Splitter Messages Documents (Message Endpoint) (Channel Purger) Message Channel Content Validation Content Enrichment Content Indexer Semantic Validation Channel Channel Transformer HTTP Transport (Channel Purger) Content Filter (Message Translator) (Channel Adapter) ? Content Enricher D Invalid Message Enricher SOLR Document Message ! Lucene/Solr INDEX Invalid Message Channel Lucene/SOLR (SOLRJ)
  • 26. Technology Stack (Publication Process) •Use flow from Apache Lucene/Solr • Leverage standard Solr components (synonyms, stopwords, stemming, MLT, spelling, faceted search, …) • Custom components: using Solr’s extendability framework • Security: authority field in schema with Apache Shiro integration • Field filters (zipcode,…) •User interfaces • Stateless: SolrJs, XSLTResponseWriter & VelocityResponseWriter • Statefull: Apache Wicket with Spring
  • 27. Enterprise Search (Logical Architecture) Collection Process Publication Process Sources Actors Pull Pull Push HTTP/Get HTTP/Post API Stateless Statefull (Crawling) (Harvesting) (SOAP/ReST) (URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework) Content Inbound Request Inbound Response Outbound Syntactic Semantic Syntactic Semantic Content Validation Request Validation Redirection Enhancement Redirection Enhancement Extraction Enhancement Filtering (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Content Enrichment Request Enrichment Response Enrichment Filtering Grouping Sorting Indexing Searching & Ordering Search Engine
  • 28. Enterprise Search (Logical Architecture) Collection Process Publication Process Sources Actors ServiceMix/Camel Apache Wicket Nutch SolrJS/XSLT Pull Pull Push HTTP/Get HTTP/Post API Stateless Statefull (Crawling) (Harvesting) (SOAP/ReST) (URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework) Content Inbound Request Inbound Response Outbound Syntactic Semantic Syntactic Semantic Content Validation Request Validation Redirection Enhancement Redirection Enhancement Extraction Enhancement Filtering (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Content Enrichment Request Enrichment Response Enrichment Filtering Grouping Sorting Indexing Searching & Ordering Search Engine Lucene/SOLR
  • 29. Enterprise Search (Logical Architecture) Collection Process Publication Process Sources Actors Pull Pull Push HTTP/Get HTTP/Post API Stateless Statefull (Crawling) (Harvesting) (SOAP/ReST) (URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework) Content Inbound Request Inbound Response Outbound Syntactic Semantic Syntactic Semantic Content Validation LuminisRequest Validation Enricher Framework Redirection Enhancement Redirection Enhancement Extraction Enhancement Filtering (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Content Enrichment Request Enrichment Response Enrichment Filtering Grouping Sorting Indexing Searching & Ordering Search Engine
  • 30. Luminis Enricher Framework •Custom Enricher Framework • Existing ESB & SOLR enricher capabilities not sufficient. • Enriching = one or more actions (extraction, enhancing & filtering) performed on documents with fields • Same enricher to be used for: • Collection process: • Documents  enriching, filtering & splitting • Publication process: • Search requests’first-components’ searchcomponent • Search response’last-components’ searchcomponent
  • 31. Luminis Enricher Framework Content Inbound 1 2 D D D N Document Push Inbound Syntactic Validation Splitter Messages Documents (Message Endpoint) (Channel Purger) Message •Custom Enricher Framework • Existing ESB & SOLR enricher capabilities not sufficient. Channel Content Validation Content Enrichment Content Indexer • Enriching = one or more actions (extraction, enhancing & filtering) performed on documents with fields Semantic Validation Channel Channel SOLR Indexer (Channel Purger) Content Filter (Channel Adapter) • Same enricher to be used for: ? Content Enricher • Collection process: D Invalid Message Enricher SOLR Document Message • Documents  enriching, filtering & splitting ! • Publication process: Invalid Message Lucene/Solr INDEX • Search requests’first-components’ searchcomponent Channel Lucene/SOLR (SOLRJ) • Search response’last-components’ searchcomponent
  • 32. Luminis Enricher Framework Content Inbound 1 2 D D D N Document Push Inbound Syntactic Validation Splitter Messages Documents (Message Endpoint) (Channel Purger) Message •Custom Enricher Framework • Existing ESB & SOLR enricher capabilities not sufficient. Channel <<XSLT>> XML2HTML Content Validation Content Enrichment Content Indexer • Enriching = one or more actions (extraction, enhancing & <<SOLRQueryRequest>> <<(X)HTML>> Query Resultaat <<SearchHandler>> filtering) performed on documents with fields <<XML>> <<QueryResponseWriter>> RequestHandler Response XSLTResponseWriter Semantic Validation Channel Channel SOLR Indexer "first-components" (Channel Purger) "components" "last-components" Content Filter (Channel Adapter) • Same enricher to be used for: ? Content Enricher • Collection process: D Invalid Message Enricher SOLR Document Message <<SearchComponent>> • Documents  enriching, filtering & splitting <<SearchComponent>> <<SearchComponent>> <<SearchComponent>> <<SearchComponent>> <<SearchComponent>> query facet mlt highlight stats debug ! • Publication process: Invalid Message Lucene/Solr INDEX • Search requests’first-components’ searchcomponent Channel Lucene/SOLR (SOLRJ) • Search response’last-components’ searchcomponent
  • 33. Luminis Enricher Framework (architecture) •Pipe-and-filter architecture • Documents flow through series of actions • Output from one action is input to another action • Fields from input document can be used in action’s clauses: values in expressions filled by replacing velocity type patterns with field values •Conditional flows supported •Reuse of flows & Subflows supported
  • 34. Luminis Enricher Framework (architecture) •Pipe-and-filter architecture • Documents flow through series of actions • Output from one action is input to another action • Fields from input document can be used in action’s clauses: values in expressions filled by replacing velocity type patterns with field values •Conditional flows supported •Reuse of flows & Subflows supported Action Document (select C where ${B}) [[A1],[B],[C1]] YES Document Action Document [[A1,A2],[B]] [[A1],[B]] If [B=3] (remove A2) NO Action Document (select C where ${A}) [[A1],[B],[C2]]
  • 35. Luminis Enricher Framework (Configuration) •Enricher flow and expression configuration via XML based DSL • Conditional: if-then-else & switch-case-else (with regex support) • Actions: Add & remove fields and field values using expressions • Expression handlers currently supported: • Field • Function (execute methods via Java Reflection) • HttpClient (retrieve content by URL described by field values) • Xslt, Xpath, Xquery (external XML databases) • JDBC • SparQL (OpenRDF) • Apache Lucene/Solr • Apache Tika (Meta and Text extraction)
  • 36. Luminis Enricher Framework (Examples) <enricher name="Field" > <field name="a">AA1</field> <field name="b">BB1</field> <field name="b">BB2</field> <multivalue-field name="c">CC1</multivalue-field> <multivalue-field name="c">CC2</multivalue-field> <if test="field::c" pattern="CC2"> <then> <field name="e">EE1</field> </then> </if> <if test="field::a"> <then> <field name="f">FF1</field> </then> </if> <rename-field name="b">d</rename-field> <remove-field name="a"/> </enricher>
  • 37. Luminis Enricher Framework (Examples) <enricher name="XPath” xmlns:str="http://exslt.org/strings" <enricher name="Field" > xmlns:fn="http://www.w3.org/2005/xpath-functions" <field name="a">AA1</field> xmlns:html="http://www.w3.org/1999/xhtml"> <field name="b">BB1</field> field name="Description" expression-type="xpath"> //html:meta[@name='DC.description']/@content <field name="b">BB2</field> </field> <multivalue-field name="c">CC1</multivalue-field> <multivalue-field name="Type" expression-type="xpath"> <multivalue-field name="c">CC2</multivalue-field> //html:meta[@name='DC.type' and <if test="field::c" pattern="CC2"> (@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or <then> @scheme='OVERHEIDbm.bekendmakingtypeProvincie' or <field name="e">EE1</field> @scheme='OVERHEIDbm.bekendmakingtypeWaterschap') </then> ]/@content </if> </multivalue-field> <field name="publisher" expression-type="xpath"> <if test="field::a"> fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '') <then> </field> <field name="f">FF1</field> <field name="publisher" expression-type="xpath"> </then> fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content, </if> //html:meta[@name='DC.creator']/@content) </field> <rename-field name="b">d</rename-field> </enricher> <remove-field name="a"/> </enricher>
  • 38. Luminis Enricher Framework (Examples) <enricher name="SPARQL"> <field name="place">http://www.my.com/#channels</field> <enricher name="XPath” <field expression-type="sparql" repository="TESTRDF"> xmlns:str="http://exslt.org/strings" <![CDATA[ <enricher name="Field" > xmlns:fn="http://www.w3.org/2005/xpath-functions" PREFIX skos: <http://www.w3.org/2004/02/skos/core#> <field name="a">AA1</field> xmlns:html="http://www.w3.org/1999/xhtml"> SELECT ?definition <field name="b">BB1</field> field name="Description" expression-type="xpath"> WHERE { //html:meta[@name='DC.description']/@content <field name="b">BB2</field> ?${place} skos:definition ?definition. </field> <multivalue-field name="c">CC1</multivalue-field> } <multivalue-field name="Type" expression-type="xpath"> ]]> <multivalue-field name="c">CC2</multivalue-field> //html:meta[@name='DC.type' and </field> <if test="field::c" pattern="CC2"> (@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or </enricher> <then> @scheme='OVERHEIDbm.bekendmakingtypeProvincie' or <field name="e">EE1</field> @scheme='OVERHEIDbm.bekendmakingtypeWaterschap') </then> ]/@content </if> </multivalue-field> <field name="publisher" expression-type="xpath"> <if test="field::a"> fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '') <then> </field> <field name="f">FF1</field> <field name="publisher" expression-type="xpath"> </then> fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content, </if> //html:meta[@name='DC.creator']/@content) </field> <rename-field name="b">d</rename-field> </enricher> <remove-field name="a"/> </enricher>
  • 39. Luminis Enricher Framework (Examples) <enricher name="SPARQL"> <field name="place">http://www.my.com/#channels</field> <enricher name="XPath” <field expression-type="sparql" repository="TESTRDF"> xmlns:str="http://exslt.org/strings" <![CDATA[ <enricher name="Field" > xmlns:fn="http://www.w3.org/2005/xpath-functions" PREFIX skos: <http://www.w3.org/2004/02/skos/core#> <field name="a">AA1</field> xmlns:html="http://www.w3.org/1999/xhtml"> SELECT ?definition <field name="b">BB1</field> field name="Description" expression-type="xpath"> WHERE { //html:meta[@name='DC.description']/@content <field name="b">BB2</field> ?${place} skos:definition ?definition. </field> <multivalue-field name="c">CC1</multivalue-field> } <multivalue-field name="Type" expression-type="xpath"> ]]> <multivalue-field name="c">CC2</multivalue-field> //html:meta[@name='DC.type' and </field> <if test="field::c" pattern="CC2"> (@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or </enricher> <then> @scheme='OVERHEIDbm.bekendmakingtypeProvincie' or <field name="e">EE1</field> @scheme='OVERHEIDbm.bekendmakingtypeWaterschap') <enricher name=”HttpAndTika"> </then> ]/@content </if> <field name="content.url"><![CDATA[http://na.apachecon.com/c/acna2010/speakers/501]]></field> </multivalue-field> <field expression-type=”http" name="content.file">field:content.url</field> <field name="publisher" expression-type="xpath"> <if test="field::a"> <field name="auteur" source="field::content.file">xpath://H1</field> fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '') <then> <multivalue-field expression-type=”tika.meta” source="field::content.file”/> </field> <field name="f">FF1</field> <field name=”content" expression-type=”tika.text” source="field::content.file”/> <field name="publisher" expression-type="xpath"> </then> <switch test=”field::content.url fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content, </if> <case pattern=".*.rijksweb.nl.*"><field name=”source">Rijksweb</field></case> //html:meta[@name='DC.creator']/@content) <case name="b">d</rename-field> <rename-field pattern=".*.deventer.nl.*"><field name=”source">Gemeente Deventer</field></case> </field> <case</enricher> pattern="file:.*"><field name=”source">Locale Harde Schijf</field></case> <remove-field name="a"/> <else><field name=”source">Overige</field></else> </enricher> </switch> </enricher>
  • 40. Luminis Enricher Framework (Technology) •Enricher and expresion handlers are Java based OSGi services: • Hot pluggable and updatable • Flow and expression configuration changes no restart • Extendible: New expression handlers immediatly available in actions after installing OSGi bundle •Runs in Apache Felix • Collection Process: ServiceMix contains OSGi container • Publication Process: Custom OSGi loader for Lucene/Solr •Centralized & transactional provisioning (Apache Ace) ‑ Components & Configuration
  • 41. Deployment Architecture <<HTTP>> <<device>> Firewall <<device>> HTTP Load Balancer <<HTTP>> <<device>> Deployment Server <<device>> <<HTTP>> Felix OSGi Master Collection Server (Apache) <<device>> <<Container>> Slave Publication Server Ace Apache Tomcat (Slave2) (Apache) ServiceMix Felix OSGi (Apache) (Apache) <<Container>> <<device>> Enricher Nutch Apache Tomcat (Luminis) Slave Publication Server (Apache) Felix OSGi (Slave1) (Apache) <<config>> SOLR::solrconfig.xml Lucene/SOLR <<HTTP/ReST>> <<PROVISIONING>> <<config>> (Apache) Lucene/SOLR Luminis:Enricher.xml <<Container>> (Apache) <<config>> Tika Apache Tomcat SOLR::schema.xml (Apache) <<HTTP/ReST>> Felix OSGi Wicket <<config>> servicenix::config.xml (Apache) (Apache) OpenRDF <<HTTP>> Lucene/SOLR Enricher (Apache) (Luminis) <<HTTP>> <<config>> <<Data Container>> Wicket SOLR::solrconfig.xml SQL <<JDBC>> (Apache) <<config>> Luminis:Enricher.xml <<JDBC>> <<Database>> <<RDFTripleStore>> Enricher <<config>> Knowledge Models Knowledge Models SOLR::schema.xml (Luminis) <<config>> SOLR::solrconfig.xml <<config>> Luminis:Enricher.xml <<config>> SOLR::schema.xml
  • 42. Conclusions •Enterprise Search Solution is not Google search •Open Source paves the way; misses some ingredients • Useful ingredients: Lucene/Solr, Nutch, Tika, ServiceMix/Camel, Wicket, MySQL, OpenRDF, Felix/Ace • Missing ingredients: Enricher •Interesting developments: • Apache Chemistry (CMIS) • Apache Clerezza • Apache Nutch • Apache Connectors Framework (ManifoldCF)
  • 43. Questions & (answers?) Marc Teutelink marc.teutelink@luminis.eu @mteutelink MEAP December 2010 

Hinweis der Redaktion

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. Content Repository vs Content Management Systems\nSecurity: mention LDAP \nIdentity: you have to be authorized\n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. Security: logging in on the source\n
  14. Security: logging in on the source\n
  15. Security: logging in on the source\n
  16. Security: logging in on the source\n
  17. Security: logging in on the source\n
  18. Security: logging in on the source\n
  19. Security: logging in on the source\n
  20. Security: logging in on the source\n
  21. Security: logging in on the source\n
  22. Security: logging in on the source\n
  23. Security: logging in on the source\n
  24. Security: logging in on the source\n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n