SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Text categorization with
   Lucene and Solr
     Tommaso Teofili
     tommaso [at] apache [dot] org




                                     1
About me
l      ASF member having fun with:
  l    Lucene / Solr
  l    Hama
  l    UIMA
  l    Stanbol
  l    … some others
l      SW engineer @ Adobe R&D




                                      2
Agenda
l    Classification
l    Lucene classification module
l    Solr text categorization services
l    Conclusions




                                          3
Classification
l    Let the algorithm assign one or more labels
      (classes) to some item given some
      previous knowledge
      l    Spam filter
      l    Tagging system
      l    Digit recognition system
      l    Text categorization
      l    etc.




                                           4
Classification?
why with Lucene?




                   5
The short story
l      Lucene already has a lot of features for
        common information retrieval needs
  l    Postings
  l    Term vectors
  l    Statistics
  l    Positions
  l    TF / IDF
  l    maybe Payloads
  l    etc.
l      We may avoid bringing in new components
        to do classification just leveraging what we
        get for free from Lucene
                                              6
The (slightly) longer story #1
l      While playing with NLP stuff
l      Need to implement a naïve bayes classifier
  l    Not possible to plug in stuff requiring touching
        the architecture
  l    Not really interested in (near) real time
        performance
l      Iteration 1
  l    Plain in memory Java stuff
l      Iteration 2
  l    Same stuff but using Lucene instead of loading
        things into memory
         l    Too much faster J
                                                    7
The (slightly) longer story #2
l      So I realized
  l    Lucene has so many features stored you can
        take advantage of for free
  l    Therefore writing the classification algorithm is
        relatively simple
  l    In many cases you’re just not adding anything to
        the architecture
         l    Your Lucene index was already there for searching
  l    Lucene index is, to some extent, already a model
        which we just need to “query” with the proper
        algorithm
  l    And it is fast enough
                                                           8
Lucene classification module
l      Work in progress on trunk
l      LUCENE-4345
  l    Establishing classification API
  l    With currently two implementations
         l    Naïve bayes
         l    K nearest neighbor




                                             9
Lucene classification module
l      Classifier API
  l    Training
  l    void train(atomicReader, contentField,
        classField, analyzer) throws IOException

         l    atomicReader : the reader on the Lucene index to
               use for classification
               l    still unsure if IR’d be better
         l    textFieldName : the name of the field which contains
               documents’ texts
         l    classFieldName : the name of the field which contains
               the class assigned to existing documents
         l    analyzer : the item used for analyzing the unseen
               texts
                                                            10
Lucene classification module
l      Classifier API
  l    Classifying
  l    ClassificationResult assignClass(String text)
        throws IOException
         l    text: the unseen text of the document to classify
         l    ClassificationResult : the object containing the
               assigned class along with the related score




                                                              11
K Nearest neighbor classifier
l    Fairly simple classification algorithm
l    Given some new unseen item
l    I search in my knowledge base the k items
      which are nearer to the new one
l    I get the k classes assigned to the k
      nearest items
l    I assign to the new item the class that is
      most frequent in the k returned items



                                           12
K Nearest neighbor classifier
l      How can we do this in Lucene?
  l    We have VSM for representing documents as
        vectors and eventually find distances
  l    Lucene MoreLikeThis module can do a lot for it
  l    Given a new document
         l    It’s represented as a MoreLikeThisQuery which filters
               out too frequent words and helps on keeping only the
               relevant tokens for finding the neighbors
         l    The query is executed returning only the first k results
         l    The result is then browsed in order to find the most
               frequent class and that is then assigned with a score
               of classFreq / k


                                                               13
Naïve Bayes classifier
l      Slightly more complicated
l      Based on probabilities
l      C = argmax( P(d|c) * P(c) )
  l    P(d|c) : likelihood
  l    P(c) : prior
  l    With some assumptions:
         l    bag of words assumption: positions don't matter
         l    conditional independence: the feature probabilities
               are independent given a class




                                                             14
Naïve Bayes classifier
l      Prior calculation is easy
  l    It’s the relative frequency of each class
        #docsWithClassC / #docs
l      Likelihood is easy too because of the bag
        of words assumption
  l    P(d|c) := P(x1,..,xn|c) == P(x1|c)*...P(xn|c)
  l    So we just need probabilities of single terms
         l    P(x|c) := (tf of x in documents with class c + 1)/
               (#terms in docs with class c + #docs)




                                                                15
Naïve Bayes classifier
l      Does the bag of words assumption affect
        the classifier’s precision?
  l    Yes in theory
         l    in text documents (nearby) words are strictly
               correlated
  l    Not always in practice
         l    depending on your index data it may or not have an
               impact




                                                               16
Using different indexes
l      The Classifier API makes usage of an
        AtomicReader to get the data for the
        training
l      It must not be the very same index used for
        every day index / search
  l    For performance reasons
  l    For enhancing classifier effectiveness
         l    Using more specific analyzers
         l    Indexing data in a different way
                l    e.g. one big document for each class and use kNN (with a
                      small k) or TF-IDF similarity



                                                                       17
Things to consider - bootstrapping
l      How are your first documents classified?
  l    Manually
         l    Categories are already there in the documents
         l    Someone is explicitly charged to do that (e.g. article
               authors) at some point in time
  l    (semi) automatically
         l    Using some existing service / library
                l    With or without human supervision

  l    In either case the classifier needs something to
        be fed with to be effective



                                                               18
Things to consider – tokenizing
l      How are your content field tokenized?
  l    Whitespace
         l    It doesn’t work for each language
  l    Standard
  l    Sentence
  l    What about using N-Grams?
  l    What about using Shingles?




                                                   19
Things to consider - filtering
l      Some words may / should be filtered while
  l    Training
  l    Classifying
l      Often
  l    Stopwords
  l    Punctuation
  l    Not relevant PoS tagged tokens



                                            20
Raw benchmarking
l      Tried both algorithms on ~1M docs index
  l    Naïve bayes is affected by the # of classes
  l    kNN is affected by k being large
l      None of them took more than 1-2m to train
        even with great number of classes or large
        k values




                                                  21
From Lucene to Solr
l      The Lucene classifiers can be easily used
        in Solr
  l    As specific search services
         l    A classification based more like this
  l    While indexing
         l    For automatic text categorization




                                                       22
Classification based MLT
l      Use case:
  l    “give me all the documents that belong to the
        same category of a new not indexed document”
  l    Slightly different from basic MLT since it does
        not return the nearest docs
  l    That is useful if the user doesn’t want / need to
        index the document and still want to find all the
        other documents of the same category, whatever
        this category means


                                                   23
Classification based MLT
l      ClassificationMLTHandler
  l    String d = req.getParams().get(DOC);
  l    ClassificationResult r = classifier.assignClass(d);
  l    String c = r.getAssignedClass();
  l    req.getSearcher().search(new TermQuery(new
        Term(classFieldName, c)), rows);




                                                    24
Automatic text categorization
l      Once a doc reaches Solr
l      We can use the Lucene classifiers to
        automate assigning document’s category
l      We can leverage existing Solr facilites for
        enhancing the indexing pipeline
  l    An UpdateChain can be decorated with one or
        more UpdateRequestProcessors




                                               25
Automatic text categorization
l      Configuration
  l    <updateRequestProcessorChain name=“ctgr">
  l    <processor
        class=”solr.CategorizationUpdateRequestProces
        sorFactory">
  l    <processor
        class="solr.RunUpdateProcessorFactory" />
  l    </updateRequestProcessorChain>



                                               26
Automatic text categorization
l      CategorizationUpdateRequestProcessor
l      void processAdd(AddUpdateCommand
        cmd) throws IOException
         l    String text = solrInputDocument.getFieldValue(“text”);
         l    String class = classifier.assignClass(text);
         l    solrInputDocument.addField(“cat”, class);
  l    Every now and then need to retrain to get latest
        stuff in the current index, but that can be done in
        the background without affecting performances


                                                             27
Automatic text categorization
l      CategorizationUpdateRequestProcessor
l      Finer grained control
  l    Use automatic text categorization only if a value
        does not exist for the “cat” field
  l    Add the classifier output class to the “cat” field
        only if it’s above a certain score




                                                      28
Wrap up
l      Simple classifiers with no or little effort
l      No architecture change
l      Both available to Lucene and Solr
l      Still reasonably fast
l      A lot more can be done
  l    Implement a MaxEnt Lucene based classifier
         l    which takes into account words correlation



                                                            29
Thanks!




          30

Weitere ähnliche Inhalte

Was ist angesagt?

NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리Junyi Song
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overviewABC Talks
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Lucidworks
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Elasticsearch development case
Elasticsearch development caseElasticsearch development case
Elasticsearch development case일규 최
 
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Kishor Datta Gupta
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화NAVER D2
 
검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민종민 김
 
Elasticsearch V/s Relational Database
Elasticsearch V/s Relational DatabaseElasticsearch V/s Relational Database
Elasticsearch V/s Relational DatabaseRicha Budhraja
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionRrubaa Panchendrarajan
 
Introdução a web semântica e o case da globo.com
Introdução a web semântica e o case da globo.comIntrodução a web semântica e o case da globo.com
Introdução a web semântica e o case da globo.comRenan Moreira de Oliveira
 
Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in luceneLucidworks (Archived)
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into ElasticsearchKnoldus Inc.
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 

Was ist angesagt? (20)

NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
Lucene
LuceneLucene
Lucene
 
elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
 
Elasticsearch development case
Elasticsearch development caseElasticsearch development case
Elasticsearch development case
 
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 
검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민
 
Elasticsearch V/s Relational Database
Elasticsearch V/s Relational DatabaseElasticsearch V/s Relational Database
Elasticsearch V/s Relational Database
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
Introdução a web semântica e o case da globo.com
Introdução a web semântica e o case da globo.comIntrodução a web semântica e o case da globo.com
Introdução a web semântica e o case da globo.com
 
Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in lucene
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 

Ähnlich wie Text Categorization with Lucene and Solr

Object Oriented Programming.pptx
Object Oriented Programming.pptxObject Oriented Programming.pptx
Object Oriented Programming.pptxSAICHARANREDDYN
 
Programming in Scala - Lecture One
Programming in Scala - Lecture OneProgramming in Scala - Lecture One
Programming in Scala - Lecture OneAngelo Corsaro
 
Object oriented programming in python
Object oriented programming in pythonObject oriented programming in python
Object oriented programming in pythonnitamhaske
 
Summer Training Project On Python Programming
Summer Training Project On Python ProgrammingSummer Training Project On Python Programming
Summer Training Project On Python ProgrammingKAUSHAL KUMAR JHA
 
What To Leave Implicit
What To Leave ImplicitWhat To Leave Implicit
What To Leave ImplicitMartin Odersky
 
Object Oriented Programming Tutorial.pptx
Object Oriented Programming Tutorial.pptxObject Oriented Programming Tutorial.pptx
Object Oriented Programming Tutorial.pptxethiouniverse
 
Conventional and Reasonable
Conventional and ReasonableConventional and Reasonable
Conventional and ReasonableKevlin Henney
 
PRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptxPRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptxabhishek364864
 
Torch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitionersTorch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitionersHoffman Lab
 
Fundamental of deep learning
Fundamental of deep learningFundamental of deep learning
Fundamental of deep learningStanley Wang
 
Learning puppet chapter 2
Learning puppet chapter 2Learning puppet chapter 2
Learning puppet chapter 2Vishal Biyani
 
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScriptLotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScriptBill Buchan
 
EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP Max Kleiner
 

Ähnlich wie Text Categorization with Lucene and Solr (20)

Oop basic concepts
Oop basic conceptsOop basic concepts
Oop basic concepts
 
Java Notes
Java NotesJava Notes
Java Notes
 
Object Oriented Programming.pptx
Object Oriented Programming.pptxObject Oriented Programming.pptx
Object Oriented Programming.pptx
 
Programming in Scala - Lecture One
Programming in Scala - Lecture OneProgramming in Scala - Lecture One
Programming in Scala - Lecture One
 
Object oriented programming in python
Object oriented programming in pythonObject oriented programming in python
Object oriented programming in python
 
Proyecto
ProyectoProyecto
Proyecto
 
Summer Training Project On Python Programming
Summer Training Project On Python ProgrammingSummer Training Project On Python Programming
Summer Training Project On Python Programming
 
What To Leave Implicit
What To Leave ImplicitWhat To Leave Implicit
What To Leave Implicit
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Seminar dm
Seminar dmSeminar dm
Seminar dm
 
Object Oriented Programming Tutorial.pptx
Object Oriented Programming Tutorial.pptxObject Oriented Programming Tutorial.pptx
Object Oriented Programming Tutorial.pptx
 
Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
 
Conventional and Reasonable
Conventional and ReasonableConventional and Reasonable
Conventional and Reasonable
 
PRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptxPRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptx
 
Torch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitionersTorch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitioners
 
Fundamental of deep learning
Fundamental of deep learningFundamental of deep learning
Fundamental of deep learning
 
Learning puppet chapter 2
Learning puppet chapter 2Learning puppet chapter 2
Learning puppet chapter 2
 
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScriptLotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
 
EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP
 

Mehr von Tommaso Teofili

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRTommaso Teofili
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr Tommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache HamaTommaso Teofili
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiTommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaTommaso Teofili
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on codeTommaso Teofili
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA IntroductionTommaso Teofili
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU TourTommaso Teofili
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesTommaso Teofili
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationTommaso Teofili
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the WebTommaso Teofili
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic SearchTommaso Teofili
 

Mehr von Tommaso Teofili (19)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
 

Kürzlich hochgeladen

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 

Kürzlich hochgeladen (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 

Text Categorization with Lucene and Solr

  • 1. Text categorization with Lucene and Solr Tommaso Teofili tommaso [at] apache [dot] org 1
  • 2. About me l  ASF member having fun with: l  Lucene / Solr l  Hama l  UIMA l  Stanbol l  … some others l  SW engineer @ Adobe R&D 2
  • 3. Agenda l  Classification l  Lucene classification module l  Solr text categorization services l  Conclusions 3
  • 4. Classification l  Let the algorithm assign one or more labels (classes) to some item given some previous knowledge l  Spam filter l  Tagging system l  Digit recognition system l  Text categorization l  etc. 4
  • 6. The short story l  Lucene already has a lot of features for common information retrieval needs l  Postings l  Term vectors l  Statistics l  Positions l  TF / IDF l  maybe Payloads l  etc. l  We may avoid bringing in new components to do classification just leveraging what we get for free from Lucene 6
  • 7. The (slightly) longer story #1 l  While playing with NLP stuff l  Need to implement a naïve bayes classifier l  Not possible to plug in stuff requiring touching the architecture l  Not really interested in (near) real time performance l  Iteration 1 l  Plain in memory Java stuff l  Iteration 2 l  Same stuff but using Lucene instead of loading things into memory l  Too much faster J 7
  • 8. The (slightly) longer story #2 l  So I realized l  Lucene has so many features stored you can take advantage of for free l  Therefore writing the classification algorithm is relatively simple l  In many cases you’re just not adding anything to the architecture l  Your Lucene index was already there for searching l  Lucene index is, to some extent, already a model which we just need to “query” with the proper algorithm l  And it is fast enough 8
  • 9. Lucene classification module l  Work in progress on trunk l  LUCENE-4345 l  Establishing classification API l  With currently two implementations l  Naïve bayes l  K nearest neighbor 9
  • 10. Lucene classification module l  Classifier API l  Training l  void train(atomicReader, contentField, classField, analyzer) throws IOException l  atomicReader : the reader on the Lucene index to use for classification l  still unsure if IR’d be better l  textFieldName : the name of the field which contains documents’ texts l  classFieldName : the name of the field which contains the class assigned to existing documents l  analyzer : the item used for analyzing the unseen texts 10
  • 11. Lucene classification module l  Classifier API l  Classifying l  ClassificationResult assignClass(String text) throws IOException l  text: the unseen text of the document to classify l  ClassificationResult : the object containing the assigned class along with the related score 11
  • 12. K Nearest neighbor classifier l  Fairly simple classification algorithm l  Given some new unseen item l  I search in my knowledge base the k items which are nearer to the new one l  I get the k classes assigned to the k nearest items l  I assign to the new item the class that is most frequent in the k returned items 12
  • 13. K Nearest neighbor classifier l  How can we do this in Lucene? l  We have VSM for representing documents as vectors and eventually find distances l  Lucene MoreLikeThis module can do a lot for it l  Given a new document l  It’s represented as a MoreLikeThisQuery which filters out too frequent words and helps on keeping only the relevant tokens for finding the neighbors l  The query is executed returning only the first k results l  The result is then browsed in order to find the most frequent class and that is then assigned with a score of classFreq / k 13
  • 14. Naïve Bayes classifier l  Slightly more complicated l  Based on probabilities l  C = argmax( P(d|c) * P(c) ) l  P(d|c) : likelihood l  P(c) : prior l  With some assumptions: l  bag of words assumption: positions don't matter l  conditional independence: the feature probabilities are independent given a class 14
  • 15. Naïve Bayes classifier l  Prior calculation is easy l  It’s the relative frequency of each class #docsWithClassC / #docs l  Likelihood is easy too because of the bag of words assumption l  P(d|c) := P(x1,..,xn|c) == P(x1|c)*...P(xn|c) l  So we just need probabilities of single terms l  P(x|c) := (tf of x in documents with class c + 1)/ (#terms in docs with class c + #docs) 15
  • 16. Naïve Bayes classifier l  Does the bag of words assumption affect the classifier’s precision? l  Yes in theory l  in text documents (nearby) words are strictly correlated l  Not always in practice l  depending on your index data it may or not have an impact 16
  • 17. Using different indexes l  The Classifier API makes usage of an AtomicReader to get the data for the training l  It must not be the very same index used for every day index / search l  For performance reasons l  For enhancing classifier effectiveness l  Using more specific analyzers l  Indexing data in a different way l  e.g. one big document for each class and use kNN (with a small k) or TF-IDF similarity 17
  • 18. Things to consider - bootstrapping l  How are your first documents classified? l  Manually l  Categories are already there in the documents l  Someone is explicitly charged to do that (e.g. article authors) at some point in time l  (semi) automatically l  Using some existing service / library l  With or without human supervision l  In either case the classifier needs something to be fed with to be effective 18
  • 19. Things to consider – tokenizing l  How are your content field tokenized? l  Whitespace l  It doesn’t work for each language l  Standard l  Sentence l  What about using N-Grams? l  What about using Shingles? 19
  • 20. Things to consider - filtering l  Some words may / should be filtered while l  Training l  Classifying l  Often l  Stopwords l  Punctuation l  Not relevant PoS tagged tokens 20
  • 21. Raw benchmarking l  Tried both algorithms on ~1M docs index l  Naïve bayes is affected by the # of classes l  kNN is affected by k being large l  None of them took more than 1-2m to train even with great number of classes or large k values 21
  • 22. From Lucene to Solr l  The Lucene classifiers can be easily used in Solr l  As specific search services l  A classification based more like this l  While indexing l  For automatic text categorization 22
  • 23. Classification based MLT l  Use case: l  “give me all the documents that belong to the same category of a new not indexed document” l  Slightly different from basic MLT since it does not return the nearest docs l  That is useful if the user doesn’t want / need to index the document and still want to find all the other documents of the same category, whatever this category means 23
  • 24. Classification based MLT l  ClassificationMLTHandler l  String d = req.getParams().get(DOC); l  ClassificationResult r = classifier.assignClass(d); l  String c = r.getAssignedClass(); l  req.getSearcher().search(new TermQuery(new Term(classFieldName, c)), rows); 24
  • 25. Automatic text categorization l  Once a doc reaches Solr l  We can use the Lucene classifiers to automate assigning document’s category l  We can leverage existing Solr facilites for enhancing the indexing pipeline l  An UpdateChain can be decorated with one or more UpdateRequestProcessors 25
  • 26. Automatic text categorization l  Configuration l  <updateRequestProcessorChain name=“ctgr"> l  <processor class=”solr.CategorizationUpdateRequestProces sorFactory"> l  <processor class="solr.RunUpdateProcessorFactory" /> l  </updateRequestProcessorChain> 26
  • 27. Automatic text categorization l  CategorizationUpdateRequestProcessor l  void processAdd(AddUpdateCommand cmd) throws IOException l  String text = solrInputDocument.getFieldValue(“text”); l  String class = classifier.assignClass(text); l  solrInputDocument.addField(“cat”, class); l  Every now and then need to retrain to get latest stuff in the current index, but that can be done in the background without affecting performances 27
  • 28. Automatic text categorization l  CategorizationUpdateRequestProcessor l  Finer grained control l  Use automatic text categorization only if a value does not exist for the “cat” field l  Add the classifier output class to the “cat” field only if it’s above a certain score 28
  • 29. Wrap up l  Simple classifiers with no or little effort l  No architecture change l  Both available to Lucene and Solr l  Still reasonably fast l  A lot more can be done l  Implement a MaxEnt Lucene based classifier l  which takes into account words correlation 29
  • 30. Thanks! 30