3. Problem Description
●Search among large data-sets across
thousands of documents, databases, etc.. with a
simple query
●SQL does not support full text search across
multiple fields, with ranking and other data
mining
●Data might contain geospatial data and
operations searching by distant and buffers
●Also most data intensive applications demand a
high availability and be persistent
4. About Lucene, Solr, and Hadoop HDFS
●Lucene: Java Index Engine
oRanked searching
oQuery types: Phrase, wild-card, proximity
oDocument fields searching
oSorting
●Solr:
oWeb Application that interacts with Lucene Engine
oRestful interfaces for searching, indexing, deleting, etc...
oExtend Lucene: Geospatial Search, Schemas integration, Monitoring,
Sharding index
●Hadoop HDFS
oDistributed File system
5. Project Implementation
●Implementation of Hadoop (Cloudera version)
●Integration with the FS through Fuse
●Setup of Solr instances
●Data manipulation in DB(Oracle and SQLServer)
●Data Index from Database to Solr
●Distributed Search implementation in Solr
●Solr Client Web Application development
7. Data tested
●Public Data from the Carl Vinson Institute of Government -
ITOS
oIndexed 10 Schema
oMore than 500 columns indexed including Location
information
oApproximately 200,000 document created
oMore than 15,000,000 data items indexed for each
document
oInformation related with: Government Buildings, Clinics,
Hospitals, Fire Stations, Teen centers, Service facilities,
shelters, Child support offices, Historical resources, and
Archaeological sites
9. Conclusions
●HDFS offers high availability to store index documents
●Solr offer a light-weight solution to implement a powerful search
engine
●Solr is a "cheap" solution to implement basic geospatial search
engine
●Solr's Restful API makes it easy to integrate with any Enterprise
System