4. Why Search + Big Data?
What Hadoop is good at What Search is good at
Distributed File storage Free text retrieval
Store large data sets Index large data sets
Distributed Processing Textual Analysis
Filtering and Sorting
= Intelligence Discovery System
of large textual data sets
5. How we Integrated Search and Big Data
Hbase Replication Facade
Take advantage of results of Analytical Pig and Hive jobs
in Hadoop to make retrieval more intelligent
Done with inbuilt replication and it scales
Fast access since in Memory
Push architecture so its near real time
CRUD
Store in HDFS and Search in LW/Solr
Gives reference to source when integrated this way
Hbase has a RestFul API to retrieve data given ID that Solr
would have after replication/indexing
7. A Use Case of this Architecture
Monitor tweets with words “Hadoop”,
“Lucidworks”, and “Big Data”
Automatically extract url’s mentioned when
talking about these terms
In near real time visualize which urls seem to
be mentioned with these terms
Discover urls that are becoming the most
popular when mentioned with the topics “Big
Data”, “Lucidworks”, and “Hadoop” and
those might be urls you want to read
8. Demo
Any one want to send a tweet? Just use
one or more of the words “Hadoop”,
“Lucidworks”, “Big Data”
Add the any url to the tweet that you’d
like to share. Try:
www.avalonconsult.com or
www.lucidworks.com
9. So much potential
You can apply this to so many things.
Do intelligent entity extraction to discover
topics with UIMA integration of Solr
Do similar analysis of popular mentions
and people of the topics of choice
Endless …
Any questions?
10. Team
Client Implementation done by Kevin
Risden @ Avalon
(risdenk@avalonconsult.com)
Demo Architecture Team
Varun Rao @ Avalon
(raov@avalonconsult.com)
Pritesh Patel @ Avalon
(patelp@avalonconsult.com)
Hinweis der Redaktion
We’ve all seen this.
You see search showing up there, but what does that really mean?
--Is it push or is it pull?
Well we have multiple options
--Directly from Ingestion, you can send to solr with the respective serializer classes.
--Hbase is interesting. It’s the SQL like store for HDFS
--Notice that all of these are pushes. I haven’t included pull yet, but they do exist.
--One thing to note however is that HBase does have a Web access layer where you can make RestFul calls to grab data.
Complimentary
= Intelligence system of large textual data sets
--Hbase is the SQL Store in HDFS
--Has distribution with Master and RegionServers
--There is an open source project called the Hbase Indexer that creates a façade
Most importantly, you can store data in HDFS and search it with Solr without storing in Solr so taking advantage of the strengths of both.
This is what the architecture of this setup looks like.—
--Our data source is twitter.
--Flume is serializing it and writing directly to Hbase
--Hbase is setup with a façade replication that behind the scenes is an indexer to solr
--Then we are using SilK (i.e. banana) to visualize that that comes through
You can apply type of architecture to many use cases …