SlideShare ist ein Scribd-Unternehmen logo
1 von 10
Information Retrieval Theory :
A case study involving Apache Lucene + Solr: A
distributed Search Engine
By
Alok Dhamanaskar
Manuel Correa
Outline
●Problem Description
●About Lucene, Solr, and Hadoop HDFS
●Solution: Implementation
●Data tested
●Demo
●Conclusions
●Questions
Problem Description
●Search among large data-sets across
thousands of documents, databases, etc.. with a
simple query
●SQL does not support full text search across
multiple fields, with ranking and other data
mining
●Data might contain geospatial data and
operations searching by distant and buffers
●Also most data intensive applications demand a
high availability and be persistent
About Lucene, Solr, and Hadoop HDFS
●Lucene: Java Index Engine
oRanked searching
oQuery types: Phrase, wild-card, proximity
oDocument fields searching
oSorting
●Solr:
oWeb Application that interacts with Lucene Engine
oRestful interfaces for searching, indexing, deleting, etc...
oExtend Lucene: Geospatial Search, Schemas integration, Monitoring,
Sharding index
●Hadoop HDFS
oDistributed File system
Project Implementation
●Implementation of Hadoop (Cloudera version)
●Integration with the FS through Fuse
●Setup of Solr instances
●Data manipulation in DB(Oracle and SQLServer)
●Data Index from Database to Solr
●Distributed Search implementation in Solr
●Solr Client Web Application development
Solution: Implementation
Data tested
●Public Data from the Carl Vinson Institute of Government -
ITOS
oIndexed 10 Schema
oMore than 500 columns indexed including Location
information
oApproximately 200,000 document created
oMore than 15,000,000 data items indexed for each
document
oInformation related with: Government Buildings, Clinics,
Hospitals, Fire Stations, Teen centers, Service facilities,
shelters, Child support offices, Historical resources, and
Archaeological sites
DEMO
●Hadoop HDFS pseudo-distributed
implementation
●HDFS mountable with Fuse
●Solr instances configuration
●Solr Client Web application
Conclusions
●HDFS offers high availability to store index documents
●Solr offer a light-weight solution to implement a powerful search
engine
●Solr is a "cheap" solution to implement basic geospatial search
engine
●Solr's Restful API makes it easy to integrate with any Enterprise
System
Questions?

Weitere ähnliche Inhalte

Ähnlich wie Ads final project

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN
 
Geospatial search with SOLR
Geospatial search with SOLRGeospatial search with SOLR
Geospatial search with SOLRNicolas Leroy
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & SolrLucidworks
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopDmitry Kan
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1YI-CHING WU
 
Hadoop and Netezza - Co-existence or Competition?
Hadoop and Netezza - Co-existence or Competition?Hadoop and Netezza - Co-existence or Competition?
Hadoop and Netezza - Co-existence or Competition?Krishnan Parasuraman
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Big Data - Hadoop Ecosystem
Big Data -  Hadoop Ecosystem Big Data -  Hadoop Ecosystem
Big Data - Hadoop Ecosystem nuriadelasheras
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 

Ähnlich wie Ads final project (20)

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Nov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on HadoopNov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on Hadoop
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Geospatial search with SOLR
Geospatial search with SOLRGeospatial search with SOLR
Geospatial search with SOLR
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Solr5
Solr5Solr5
Solr5
 
Hadoop and Netezza - Co-existence or Competition?
Hadoop and Netezza - Co-existence or Competition?Hadoop and Netezza - Co-existence or Competition?
Hadoop and Netezza - Co-existence or Competition?
 
Analytics 3
Analytics 3Analytics 3
Analytics 3
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data - Hadoop Ecosystem
Big Data -  Hadoop Ecosystem Big Data -  Hadoop Ecosystem
Big Data - Hadoop Ecosystem
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 

Mehr von Manuel Correa

Mehr von Manuel Correa (7)

How Netflix does Microservices
How Netflix does Microservices How Netflix does Microservices
How Netflix does Microservices
 
Big table
Big tableBig table
Big table
 
Big table
Big tableBig table
Big table
 
Protocol buffers
Protocol buffersProtocol buffers
Protocol buffers
 
Optimal Adaptation
Optimal Adaptation Optimal Adaptation
Optimal Adaptation
 
RESTFul Web Services - Intro
RESTFul Web Services - IntroRESTFul Web Services - Intro
RESTFul Web Services - Intro
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Ads final project

  • 1. Information Retrieval Theory : A case study involving Apache Lucene + Solr: A distributed Search Engine By Alok Dhamanaskar Manuel Correa
  • 2. Outline ●Problem Description ●About Lucene, Solr, and Hadoop HDFS ●Solution: Implementation ●Data tested ●Demo ●Conclusions ●Questions
  • 3. Problem Description ●Search among large data-sets across thousands of documents, databases, etc.. with a simple query ●SQL does not support full text search across multiple fields, with ranking and other data mining ●Data might contain geospatial data and operations searching by distant and buffers ●Also most data intensive applications demand a high availability and be persistent
  • 4. About Lucene, Solr, and Hadoop HDFS ●Lucene: Java Index Engine oRanked searching oQuery types: Phrase, wild-card, proximity oDocument fields searching oSorting ●Solr: oWeb Application that interacts with Lucene Engine oRestful interfaces for searching, indexing, deleting, etc... oExtend Lucene: Geospatial Search, Schemas integration, Monitoring, Sharding index ●Hadoop HDFS oDistributed File system
  • 5. Project Implementation ●Implementation of Hadoop (Cloudera version) ●Integration with the FS through Fuse ●Setup of Solr instances ●Data manipulation in DB(Oracle and SQLServer) ●Data Index from Database to Solr ●Distributed Search implementation in Solr ●Solr Client Web Application development
  • 7. Data tested ●Public Data from the Carl Vinson Institute of Government - ITOS oIndexed 10 Schema oMore than 500 columns indexed including Location information oApproximately 200,000 document created oMore than 15,000,000 data items indexed for each document oInformation related with: Government Buildings, Clinics, Hospitals, Fire Stations, Teen centers, Service facilities, shelters, Child support offices, Historical resources, and Archaeological sites
  • 8. DEMO ●Hadoop HDFS pseudo-distributed implementation ●HDFS mountable with Fuse ●Solr instances configuration ●Solr Client Web application
  • 9. Conclusions ●HDFS offers high availability to store index documents ●Solr offer a light-weight solution to implement a powerful search engine ●Solr is a "cheap" solution to implement basic geospatial search engine ●Solr's Restful API makes it easy to integrate with any Enterprise System