See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
Text analytics is a large and interesting subject, covering a wide range of topics. In the world of enterprise search however, the usual application of text analytics rarely ranges beyond extracting semi-structured information from the source data. As some of the more advanced concepts in text analytics, such as automatic text categorization, can be easily leveraged to bring a search installation from a search tool to a tool for discovery.
2. What will I cover?
Intro
About Text Analytics
Benefits and possibilities
Examples
Solution Techniques to Examples
Conclusions
3
3. My Background
Daniel Ling
Findwise
Enterprise Search and Findability Consultant
Experience and expertise
5+ years of Enterprise Search Experience
20+ enterprise search implementations, ranging industries
Lucene, FAST ESP, Solr
Apache Solr my primary search platform
Focus areas includes Findability and Search Architecture and
Implementation, Text Analytics, Document Processing.
4
5. Text Analytics in the Enterprise
Challenges:
80% of data in the Enterprise is unstructured.
Reduce the time looking for information (currently 9.6 hours per week)
Reduce the time reading documents / e-mails (currently 14.5 hours per
week)
Benefits:
More predictable scale and domain
Well-understood domain
Supporting content for analytics can be identified
6
6. Text Analytics
The definition
A set of linguistic, statistical and machine learning techniques
used to model and structure information content of textual
source.
- Wikipedia.org
7
10. Benefits and possibilities
Text analytics can bring some structure to the unstructured content
Enhance discovery and findability of content
• Works well together with search
Increase relevance and precision with extracted keywords and meta-
data
Generating content for dynamic pages / topic pages
• Selection of documents and extracts from documents
Track and discover sentiments
Reduce the time for user to analyze content
11
16. Example Solution: Entity Extraction
Rule-based entity extraction
Combination of lists and regular expressions
Works within well-understood domains.
Requires maintaining lists.
Lists from: Country lists from World Factbook, Public Companies from
Google Finance, Customers from CRM.
Workflow: Document for indexing > Update Request Handler >
Update Chain (lookup and match entities) > Writes to index
Update Chain
(processor) Lucene Index
(lists | input fields | entity fields)
(entity fields)
17
17. Example Solution: Entity Extraction
Register a custom class to lookup resources and extract found entities
to specific Solr fields, setup in solrconfig.xml:
18
18. Document Categorization
To assign a label to the document / content / data.
Labels for the category or for the sentiment.
Threshold values for matching a category before labeling.
Statistics and “knowledge” from previous examples can be used.
19
20. Example Solution: Document
Categorization
*
Training the component, Mallet (Machine Learning for Language
Toolkit).
• Alternative components includes Lucene (TFIDF) index
(MoreLikeThis), OpenNLP, Textcat, Classifier4j.
Running the new documents against the model/index of trained
documents.
Training from interface, adhoc, or index pre-categorized.
* Figure from the book Taming Text.
21
22. Example Solution: Document
Categorization
Evaluation of new document:
Setting the evaluated category tag to the document in pipeline:
Update Chain
(processor) Lucene Index
(input document)
(category field)
23
23. Document Summarization
Summarize a document, at index time or on-demand.
Leverage from the knowledge and term statistics of the document
and the index.
Picks the “most important” sentences based on the statistics and
displays those.
24
27. Example Solution: Document
Summarization
Custom RequestHandler that receives document ID and field to
summarize.
Custom Search Component making the selection of top sentences.
Selecting a subset of sentences and sends these back in a field.
RequestHandler Lucene Index
(SearchComponent for summariziation)
28
28. Wrap Up
• Examples: Entity Extraction, Document Categorization,
Summarization.
• Technology: You can take small steps and get a great
deal of gain, since you can leverage from features and
components of Solr and Lucene (as well as other open
source NLP frameworks).
• Value: Benefits from text analytics includes the increase
in discovery, findability and productivity from the
solution.
29