Ads final project

Information Retrieval Theory :
A case study involving Apache Lucene + Solr: A
distributed Search Engine
By
Alok Dhamanaskar
Manuel Correa

Outline
●Problem Description
●About Lucene, Solr, and Hadoop HDFS
●Solution: Implementation
●Data tested
●Demo
●Conclusions
●Questions

Problem Description
●Search among large data-sets across
thousands of documents, databases, etc.. with a
simple query
●SQL does not support full text search across
multiple fields, with ranking and other data
mining
●Data might contain geospatial data and
operations searching by distant and buffers
●Also most data intensive applications demand a
high availability and be persistent

About Lucene, Solr, and Hadoop HDFS
●Lucene: Java Index Engine
oRanked searching
oQuery types: Phrase, wild-card, proximity
oDocument fields searching
oSorting
●Solr:
oWeb Application that interacts with Lucene Engine
oRestful interfaces for searching, indexing, deleting, etc...
oExtend Lucene: Geospatial Search, Schemas integration, Monitoring,
Sharding index
●Hadoop HDFS
oDistributed File system

Project Implementation
●Implementation of Hadoop (Cloudera version)
●Integration with the FS through Fuse
●Setup of Solr instances
●Data manipulation in DB(Oracle and SQLServer)
●Data Index from Database to Solr
●Distributed Search implementation in Solr
●Solr Client Web Application development

Data tested
●Public Data from the Carl Vinson Institute of Government -
ITOS
oIndexed 10 Schema
oMore than 500 columns indexed including Location
information
oApproximately 200,000 document created
oMore than 15,000,000 data items indexed for each
document
oInformation related with: Government Buildings, Clinics,
Hospitals, Fire Stations, Teen centers, Service facilities,
shelters, Child support offices, Historical resources, and
Archaeological sites

DEMO
●Hadoop HDFS pseudo-distributed
implementation
●HDFS mountable with Fuse
●Solr instances configuration
●Solr Client Web application

Conclusions
●HDFS offers high availability to store index documents
●Solr offer a light-weight solution to implement a powerful search
engine
●Solr is a "cheap" solution to implement basic geospatial search
engine
●Solr's Restful API makes it easy to integrate with any Enterprise
System

Ads final project

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Ads final project

Ähnlich wie Ads final project (20)

Mehr von Manuel Correa

Mehr von Manuel Correa (7)

Ads final project