Dev Dives: Streamline document processing with UiPath Studio Web
Solr 4
1. Solr 4.1
Abhey Gupta
Software Engineer (Java)
Value First Digital Pvt Ltd
2. Outline
This presentation will guide from series of question in try to answer
usability of Solr in MIS 3
– What is Solr ?
– Why use Solr ?
– Current Scenairo
– Scope of Improvement
– Indexing Data
– Import MIS Data
– Challenges
– Demo
– Query Example
3. What is Solr?
Solr is the popular, blazing fast open source enterprise search platform
from the Apache Lucene project.
• Its major features include powerful full-text search, hit
highlighting, faceted search, dynamic clustering, database
integration, rich document (e.g., Word, PDF) handling, and
geospatial search.
• Solr is highly scalable, providing distributed search and index
replication, and it powers the search and navigation features
of many of the world's largest internet sites.
4. What is Lucene?
Apache Lucene is a high-performance, full-featured text search engine library
written entirely in Java. It is a technology suitable for nearly any application
that requires full-text search, especially cross-platform
– An open source Java-based IR library with best practice indexing
and query capabilities, fast and lightweight search and indexing.
– 100% Java (.NET, Perl and other versions too).
– Stable, mature API.
– Continuously improved and tuned over more than 10 years.
– Cleanly implemented, easy to embed in an application.
– Compact, portable index representation.
– Programmable text analyzers, spell checking and highlighting.
– Not a crawler or a text extraction tool.
5. Who uses Lucene/Solr?
Here are five noteworthy public sites that use Solr to handle search:
– WhiteHouse.gov – The Obama administration's keystone web site is
Drupal and Solr!
– Netflix – Solr powers basic movie searching on this extremely busy
site.
– Internet Archive – Search this vast repository of music, documents
and video using Solr.
– StubHub.com – This ticket reseller uses Solr to help visitors search
for concerts and sporting events.
– The Smithsonian Institution – Search the Smithsonian’s collection of
over 4 million items.
7. Why uses Solr?
Assuming the user has a relational DB, why use Solr? If your use case
requires a person to type words into a search box, you want a text search
engine like Solr.
Databases and Solr have complementary strengths and weaknesses.
SQL supports very simple wildcard-based text search with some simple
normalization like matching upper case to lower case. The problem is that
these are full table scans. In Solr all searchable words are stored in an
"inverse index", which searches orders of magnitude faster.
For Deatils Please consult below link
– http://wiki.apache.org/solr/WhyUseSolr
8. Current Scenario
In current ,MIS 3 use mysql FULL TEXT search for text based search which
lacks behind solr in terms of Query Speed & Text Search
1. Full Text 2. Full Text Search
Search MIS UI Query
USER
(80)
MYSQL
4. Result
3. Full table Scan for text
Search
9. Scope of Improvement
Instead of quering MYSQL for text search , we can deploy Solr inbetween ,
which will return result , being inverted index , this quering is fast and
efficient.
1. Full Text
Search MIS UI
USER
(80)
MYSQL
4. Result
2. Full Text Search
Query
SOLR
3. Scan for tokenized text
Search in inverted index
10. Indexing Data in Solr
A Solr index can accept data from many different sources, including XML
files, comma-separated value (CSV) files, data extracted from tables in a
database, and files in common file formats such as Microsoft Word or PDF.
Here are the most common ways of loading data into a Solr index:
– Uploading Structured Data Store Data with the Data Import Handler
– Using the Solr Cell framework built on Apache Tika for ingesting
binary files or structured files such as Office, Word, PDF, and other
proprietary formats.
– Uploading XML files by sending HTTP requests to the Solr server
from any environment where such requests can be generated.
– Writing a custom Java application to ingest data through Solr's Java
11. Indexing MIS Data
MIS has structured data on MIS server and Structured files on services
server , so this way we can index data in two ways , These are following
– Data Import Handler on MIS Database
• This has benefit of manageability , as this needs to be
deployed on MIS servers only,which are very few.
• We can import data on delta incremental.
– Script to import CSV files from services
• This will increase in manageability and deployability of scripts
on services
• Need to implement partial import for DLRLOG data.
12. Indexing Bean
Solr can also import bean type for indexing , in Services we build bean of
Every MT and DLR , we can Directly import them on Solr.
This could increase into unneccesary load , as API will index bean per
messages.
Index Data call per MT and DLR
API 15 SOLR
13. Data Import Handler
We can import data to index in solr from mysql , we can do this in two ways ,
disctributed or centeral
SOLR
MYSQL MYSQL
SOLR
MYSQL MYSQL
SOLR
SOLR MYSQL
MYSQL
MYSQL SOLR MYSQL
14. Import CSV
We can import data to index in solr from each services , we can also do this in
two ways , disctributed or centeral
SCRIPT
Service 1 SOLR
SCRIPT
Service 1 SOLR
Service 1 SCRIPT SOLR
…..........
Service 1 SCRIPT
SOLR
15. Challenges
Every Import Scenario advantages trade off with some disadvantages and
challenges .
For Example
– DIH : Data Import handler require joins with sql query to import data
from mttextlog,mtlog and dlrlog.
• Or we can get messageid from Solr and query again to mysql
for complete data with in clause query for message id.
– CSV Import : It requires scripts to be deployed on every service
server and lots of managebablity of files proccessed or not
proccessed.
– BEAN Import : It requires changes at API level and could result into