SlideShare ist ein Scribd-Unternehmen logo
1 von 39
INTEGRATE SOLR WITH REAL-TIME STREAM
PROCESSING APPLICATIONS
Timothy Potter
@thelabdude
linkedin.com/thelabdude
whoami
independent consultant search / big data projects
soon to be joining engineering team @LucidWorks
co-author Solr In Action
previously big data architect Dachis Group
my storm story
re-designed a complex batch-oriented indexing
pipeline based on Hadoop (Oozie, Pig, Hive, Sqoop)
to real-time storm topology
agenda
walk through how to develop a storm topology
common integration points with Solr
(near real-time indexing, percolator, real-time get)
example
listen to click events from 1.usa.gov URL shortener
(bit.ly) to determine trending US government sites
stream of click events:
http://developer.usa.gov/1usagov
http://www.smartgrid.gov -> http://1.usa.gov/ayu0Ru
beyond word count
tackle real challenges you’ll encounter when
developing a storm topology
and what about ... unit testing, dependency injection,
measure runtime behavior of your components, separation of
concerns, reducing boilerplate, hiding complexity ...
storm
open source distributed computation system
scalability, fault-tolerance, guaranteed message
processing (optional)
storm primitives
• tuple: ordered list of values
• stream: unbounded sequence of tuples
• spout: emit a stream of tuples (source)
• bolt: performs some operation on each tuple
• topology: dag of spouts and tuples
solution requirements
• receive click events from 1.usa.gov stream
• count frequency of pages in a time window
• rank top N sites per time window
• extract title, body text, image for each link
• persist rankings and metadata for visualization
trending snapshot (sept 12, 2013)
Solr
Metrics
DB
EnrichLink
Bolt
Solr
Indexing
Bolt
1.usa.gov
Spout
Rolling
Count
Bolt
Intermediate
Rankings
Bolt
Total
Rankings
Bolt
embed.ly
API
field
grouping
bit.ly hash
field
grouping
bit.ly hash
global
grouping
Persist
Rankings
Bolt
field
grouping
obj
global
grouping
provided by in the
storm-starter project
data store
bolt
spout
stream
grouping
stream grouping
• shuffle: random distribution of tuples to all instances of a bolt
• field(s): group tuples by one or more fields in common
• global: reduce down to one
• all: replicate stream to all instances of a bolt
source: https://github.com/nathanmarz/storm/wiki/Concepts
useful storm concepts
• bolts can receive input from many spouts
• tuples in a stream can be grouped together
• streams can be split and joined
• bolts can inject new tuples into the stream
• components can be distributed across a cluster at a
configurable parallelism level
• optionally, storm keeps track of each tuple emitted by a spout
(ack or fail)
tools
• Spring framework – dependency injection, configuration, unit
testing, mature, etc.
• Groovy – keeps your code tidy and elegant
• Mockito – ignore stuff your test doesn’t care about
• Netty – fast & powerful NIO networking library
• Coda Hale metrics – get visibility into how your bolts and
spouts are performing (at a very low-level)
spout
easy! just produce a stream of tuples ...
and ... avoid blocking when waiting for more data, ease off throttle if topology
is not processing fast enough, deal with failed tuples, choose if it should use
message Ids for each tuple emitted, data model / schema, etc ...
SpringBoltSpringSpout
Streaming
DataAction
(POJO)
Streaming
DataProvider
(POJO)
Spring container (1 per topology per JVM)
Spring
Dependency
Injection
JDBC WebService
Hide complexity
of implementing
Storm contract
developer
focuses on
business
logic
streaming data provider
class OneUsaGovStreamingDataProvider implements StreamingDataProvider, MessageHandler {
MessageStream messageStream
...
void open(Map stormConf) { messageStream.receive(this) }
boolean next(NamedValues nv) {
String msg = queue.poll()
if (msg) {
OneUsaGovRequest req = objectMapper.readValue(msg, OneUsaGovRequest)
if (req != null && req.globalBitlyHash != null) {
nv.set(OneUsaGovTopology.GLOBAL_BITLY_HASH, req.globalBitlyHash)
nv.set(OneUsaGovTopology.JSON_PAYLOAD, req)
return true
}
}
return false
}
void handleMessage(String msg) { queue.offer(msg) }
Spring Dependency Injection
non-blocking call to get the
next message from 1.usa.gov
use Jackson JSON parser
to create an object from the
raw incoming data
jackson json to java
@JsonIgnoreProperties(ignoreUnknown = true)
class OneUsaGovRequest implements Serializable {
@JsonProperty("a")
String userAgent;
@JsonProperty("c")
String countryCode;
@JsonProperty("nk")
int knownUser;
@JsonProperty("g")
String globalBitlyHash;
@JsonProperty("h")
String encodingUserBitlyHash;
@JsonProperty("l")
String encodingUserLogin;
...
}
Spring converts json to java object for you:
<bean id="restTemplate"
class="org.springframework.web.client.RestTemplate">
<property name="messageConverters">
<list>
<bean id="messageConverter”
class="...json.MappingJackson2HttpMessageConverter">
</bean>
</list>
</property>
</bean>
spout data provider spring-managed bean
<bean id="oneUsaGovStreamingDataProvider"
class="com.bigdatajumpstart.storm.OneUsaGovStreamingDataProvider">
<property name="messageStream">
<bean class="com.bigdatajumpstart.netty.HttpClient">
<constructor-arg index="0" value="${streamUrl}"/>
</bean>
</property>
</bean>
builder.setSpout("1.usa.gov-spout",
new SpringSpout("oneUsaGovStreamingDataProvider", spoutFields), 1)
Note: when building the StormTopology to submit to Storm, you do:
class OneUsaGovStreamingDataProviderTest extends StreamingDataProviderTestBase {
@Test
void testDataProvider() {
String jsonStr = '''{
"a": "user-agent", "c": "US",
"nk": 0, "tz": "America/Los_Angeles",
"gr": "OR", "g": "2BktiW",
"h": "12Me4B2", "l": "usairforce",
"al": "en-us", "hh": "1.usa.gov",
"r": "http://example.com/foo",
...
}'''
OneUsaGovStreamingDataProvider dataProvider = new OneUsaGovStreamingDataProvider()
dataProvider.setMessageStream(mock(MessageStream))
dataProvider.open(stormConf) // Config setup in base class
dataProvider.handleMessage(jsonStr)
NamedValues record = new NamedValues(OneUsaGovTopology.spoutFields)
assertTrue dataProvider.next(record)
...
}
}
spout data provider unit test
mock json to simulate
data from 1.usa.gov feed
use Mockito to satisfy
dependencies not needed
for this test
asserts to verify
data provider
works correctly
rolling count bolt
• counts frequency of links in a sliding time window
• emits topN in current window every M seconds
• uses tick tuple trick provided by Storm to emit every
M seconds (configurable)
• provided with the storm-starter project
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
• calls out to embed.ly API
• caches results locally in the bolt instance
• relies on field grouping (incoming tuples)
• outputs data to be indexed in Solr
• benefits from parallelism to enrich more links
concurrently (watch those rate limits)
enrich link metadata bolt
embed.ly service
class EmbedlyService {
@Autowired
RestTemplate restTemplate
String apiKey
private Timer apiTimer = MetricsSupport.timer(EmbedlyService, "apiCall")
Embedly getLinkMetadata(String link) {
String urlEncoded = URLEncoder.encode(link,"UTF-8")
URI uri = new URI("https://api.embed.ly/1/oembed?key=${apiKey}&url=${urlEncoded}")
Embedly embedly = null
MetricsSupport.withTimer(apiTimer, {
embedly = restTemplate.getForObject(uri, Embedly)
})
return embedly
}
simple closure to time our
requests to the Web service
integrate coda hale metrics
• capture runtime behavior of the components in your
topology
• Coda Hale metrics - http://metrics.codahale.com/
• output metrics every N minutes
• report metrics to JMX, ganglia, graphite, etc
metrics
-- Meters ----------------------------------------------------------------------
EnrichLinkBoltLogic.solrQueries
count = 97
mean rate = 0.81 events/second
1-minute rate = 0.89 events/second
5-minute rate = 1.62 events/second
15-minute rate = 1.86 events/second
SolrBoltLogic.linksIndexed
count = 60
mean rate = 0.50 events/second
1-minute rate = 0.41 events/second
5-minute rate = 0.16 events/second
15-minute rate = 0.06 events/second
-- Timers ----------------------------------------------------------------------
EmbedlyService.apiCall
count = 60
mean rate = 0.50 calls/second
1-minute rate = 0.40 calls/second
5-minute rate = 0.16 calls/second
15-minute rate = 0.06 calls/second
min = 138.70 milliseconds
max = 7642.92 milliseconds
mean = 1148.29 milliseconds
stddev = 1281.40 milliseconds
median = 652.83 milliseconds
75% <= 1620.96 milliseconds
...
storm cluster concepts
• nimbus: master node (~job tracker in Hadoop)
• zookeeper: cluster management / coordination
• supervisor: one per node in the cluster to manage worker
processes
• worker: one or more per supervisor (JVM process)
• executor: thread in worker
• task: work performed by a spout or bolt
Worker 1 (port 6701)
Nimbus
Supervisor (1 per node)
Topology
JAR
Node 1
JVM process
executor
(thread)
... N workers
... M nodes
Each component (spout or bolt)
is distributed across a cluster of
workers based on a configurable
parallelism
Zookeeper
@Override
StormTopology build(StreamingApp app) throws Exception {
...
TopologyBuilder builder = new TopologyBuilder()
builder.setSpout("1.usa.gov-spout",
new SpringSpout("oneUsaGovStreamingDataProvider", spoutFields), 1)
builder.setBolt("enrich-link-bolt",
new SpringBolt("enrichLinkAction", enrichedLinkFields), 3)
.fieldsGrouping("1.usa.gov-spout", globalBitlyHashGrouping)
...
parallelism hint to
the framework
(can be rebalanced)
solr integration points
• real-time get
• near real-time indexing (NRT)
• percolate (match incoming docs to pre-existing
queries)
real-time get
use Solr for fast lookups by document ID
class SolrClient {
@Autowired
SolrServer solrServer
SolrDocument get(String docId, String... fields) {
SolrQuery q = new SolrQuery()
q.setRequestHandler("/get")
q.set("id", docId)
q.setFields(fields)
QueryRequest req = new QueryRequest(q)
req.setResponseParser(new BinaryResponseParser())
QueryResponse rsp = req.process(solrServer)
return (SolrDocument)rsp.getResponse().get("doc")
}
}
send the request to the
“get” request handler
near real-time indexing
• If possible, use CloudSolrServer to route documents directly
to the correct shard leaders (SOLR-4816)
• Use <openSearcher>false</openSearcher> for auto “hard”
commits
• Use auto soft commits as needed
• Use parallelism of Storm bolt to distribute indexing work to N
nodes
http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
percolate
• match incoming documents to pre-configured
queries (inverted search)
– example: Is this tweet related to campaign Y for brand X?
• use storm’s distributed computation support to
evaluate M pre-configured queries per doc
two possible approaches
• Lucene-only solution using MemoryIndex
– See presentation by Charlie Hull and Alan Woodward
• EmbeddedSolrServer
– Full solrconfig.xml / schema.xml
– RAMDirectory
– Relies on Storm to scale up documents / second
– Easy solution for up to a few thousand queries
Twitter
Spout
PercolatorBolt 1
Embedded
SolrServer
Pre-configured
queries stored in
a database
PercolatorBolt N
Embedded
SolrServer
... Could be 100’s of these
random
stream
grouping ZeroMQ
pub/sub to push
query changes
to percolator
tick tuples
• send a special kind of tuple to a bolt every N
seconds
if (TupleHelpers.isTickTuple(input)) {
// do special work
}
used in percolator to delete accumulated documents every minute or so ...
references
• Storm Wiki
• https://github.com/nathanmarz/storm/wiki/Documentation
• Overview: Krishna Gade
• http://www.slideshare.net/KrishnaGade2/storm-at-twitter
• Trending Topics: Michael Knoll
• http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-
trending-topics-in-storm/
• Understanding Parallelism: Michael Knoll
• http://www.michael-noll.com/blog/2012/10/16/understanding-the-
parallelism-of-a-storm-topology/
get the code: https://github.com/thelabdude/lsrdublin
Q & A
Manning coupon codes for conference related books:
http://deals.manningpublications.com/RevolutionsEU2013.html

Weitere ähnliche Inhalte

Was ist angesagt?

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Lucidworks
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Lucidworks
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Lucidworks
 

Was ist angesagt? (20)

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScript
 
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
 
Faster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache SolrFaster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache Solr
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data Analytics
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7
 

Ähnlich wie Integrate Solr with real-time stream processing applications

ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics Patterns
Srinath Perera
 

Ähnlich wie Integrate Solr with real-time stream processing applications (20)

Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and Reactor
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics Patterns
 
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsDEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
 
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
 
AI Development with H2O.ai
AI Development with H2O.aiAI Development with H2O.ai
AI Development with H2O.ai
 
Concurrency (Fisher Syer S2GX 2010)
Concurrency (Fisher Syer S2GX 2010)Concurrency (Fisher Syer S2GX 2010)
Concurrency (Fisher Syer S2GX 2010)
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
 
Real time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.ly
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...
 
Stream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysStream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdays
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 

Mehr von thelabdude

NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
thelabdude
 

Mehr von thelabdude (7)

Running Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with AlluxioRunning Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with Alluxio
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4
 
Boosting Documents in Solr (Lucene Revolution 2011)
Boosting Documents in Solr (Lucene Revolution 2011)Boosting Documents in Solr (Lucene Revolution 2011)
Boosting Documents in Solr (Lucene Revolution 2011)
 
Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

Integrate Solr with real-time stream processing applications

  • 1.
  • 2. INTEGRATE SOLR WITH REAL-TIME STREAM PROCESSING APPLICATIONS Timothy Potter @thelabdude linkedin.com/thelabdude
  • 3. whoami independent consultant search / big data projects soon to be joining engineering team @LucidWorks co-author Solr In Action previously big data architect Dachis Group
  • 4. my storm story re-designed a complex batch-oriented indexing pipeline based on Hadoop (Oozie, Pig, Hive, Sqoop) to real-time storm topology
  • 5. agenda walk through how to develop a storm topology common integration points with Solr (near real-time indexing, percolator, real-time get)
  • 6. example listen to click events from 1.usa.gov URL shortener (bit.ly) to determine trending US government sites stream of click events: http://developer.usa.gov/1usagov http://www.smartgrid.gov -> http://1.usa.gov/ayu0Ru
  • 7. beyond word count tackle real challenges you’ll encounter when developing a storm topology and what about ... unit testing, dependency injection, measure runtime behavior of your components, separation of concerns, reducing boilerplate, hiding complexity ...
  • 8. storm open source distributed computation system scalability, fault-tolerance, guaranteed message processing (optional)
  • 9. storm primitives • tuple: ordered list of values • stream: unbounded sequence of tuples • spout: emit a stream of tuples (source) • bolt: performs some operation on each tuple • topology: dag of spouts and tuples
  • 10. solution requirements • receive click events from 1.usa.gov stream • count frequency of pages in a time window • rank top N sites per time window • extract title, body text, image for each link • persist rankings and metadata for visualization
  • 12.
  • 14. stream grouping • shuffle: random distribution of tuples to all instances of a bolt • field(s): group tuples by one or more fields in common • global: reduce down to one • all: replicate stream to all instances of a bolt source: https://github.com/nathanmarz/storm/wiki/Concepts
  • 15. useful storm concepts • bolts can receive input from many spouts • tuples in a stream can be grouped together • streams can be split and joined • bolts can inject new tuples into the stream • components can be distributed across a cluster at a configurable parallelism level • optionally, storm keeps track of each tuple emitted by a spout (ack or fail)
  • 16. tools • Spring framework – dependency injection, configuration, unit testing, mature, etc. • Groovy – keeps your code tidy and elegant • Mockito – ignore stuff your test doesn’t care about • Netty – fast & powerful NIO networking library • Coda Hale metrics – get visibility into how your bolts and spouts are performing (at a very low-level)
  • 17. spout easy! just produce a stream of tuples ... and ... avoid blocking when waiting for more data, ease off throttle if topology is not processing fast enough, deal with failed tuples, choose if it should use message Ids for each tuple emitted, data model / schema, etc ...
  • 18. SpringBoltSpringSpout Streaming DataAction (POJO) Streaming DataProvider (POJO) Spring container (1 per topology per JVM) Spring Dependency Injection JDBC WebService Hide complexity of implementing Storm contract developer focuses on business logic
  • 19. streaming data provider class OneUsaGovStreamingDataProvider implements StreamingDataProvider, MessageHandler { MessageStream messageStream ... void open(Map stormConf) { messageStream.receive(this) } boolean next(NamedValues nv) { String msg = queue.poll() if (msg) { OneUsaGovRequest req = objectMapper.readValue(msg, OneUsaGovRequest) if (req != null && req.globalBitlyHash != null) { nv.set(OneUsaGovTopology.GLOBAL_BITLY_HASH, req.globalBitlyHash) nv.set(OneUsaGovTopology.JSON_PAYLOAD, req) return true } } return false } void handleMessage(String msg) { queue.offer(msg) } Spring Dependency Injection non-blocking call to get the next message from 1.usa.gov use Jackson JSON parser to create an object from the raw incoming data
  • 20. jackson json to java @JsonIgnoreProperties(ignoreUnknown = true) class OneUsaGovRequest implements Serializable { @JsonProperty("a") String userAgent; @JsonProperty("c") String countryCode; @JsonProperty("nk") int knownUser; @JsonProperty("g") String globalBitlyHash; @JsonProperty("h") String encodingUserBitlyHash; @JsonProperty("l") String encodingUserLogin; ... } Spring converts json to java object for you: <bean id="restTemplate" class="org.springframework.web.client.RestTemplate"> <property name="messageConverters"> <list> <bean id="messageConverter” class="...json.MappingJackson2HttpMessageConverter"> </bean> </list> </property> </bean>
  • 21. spout data provider spring-managed bean <bean id="oneUsaGovStreamingDataProvider" class="com.bigdatajumpstart.storm.OneUsaGovStreamingDataProvider"> <property name="messageStream"> <bean class="com.bigdatajumpstart.netty.HttpClient"> <constructor-arg index="0" value="${streamUrl}"/> </bean> </property> </bean> builder.setSpout("1.usa.gov-spout", new SpringSpout("oneUsaGovStreamingDataProvider", spoutFields), 1) Note: when building the StormTopology to submit to Storm, you do:
  • 22. class OneUsaGovStreamingDataProviderTest extends StreamingDataProviderTestBase { @Test void testDataProvider() { String jsonStr = '''{ "a": "user-agent", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "gr": "OR", "g": "2BktiW", "h": "12Me4B2", "l": "usairforce", "al": "en-us", "hh": "1.usa.gov", "r": "http://example.com/foo", ... }''' OneUsaGovStreamingDataProvider dataProvider = new OneUsaGovStreamingDataProvider() dataProvider.setMessageStream(mock(MessageStream)) dataProvider.open(stormConf) // Config setup in base class dataProvider.handleMessage(jsonStr) NamedValues record = new NamedValues(OneUsaGovTopology.spoutFields) assertTrue dataProvider.next(record) ... } } spout data provider unit test mock json to simulate data from 1.usa.gov feed use Mockito to satisfy dependencies not needed for this test asserts to verify data provider works correctly
  • 23. rolling count bolt • counts frequency of links in a sliding time window • emits topN in current window every M seconds • uses tick tuple trick provided by Storm to emit every M seconds (configurable) • provided with the storm-starter project http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
  • 24. • calls out to embed.ly API • caches results locally in the bolt instance • relies on field grouping (incoming tuples) • outputs data to be indexed in Solr • benefits from parallelism to enrich more links concurrently (watch those rate limits) enrich link metadata bolt
  • 25. embed.ly service class EmbedlyService { @Autowired RestTemplate restTemplate String apiKey private Timer apiTimer = MetricsSupport.timer(EmbedlyService, "apiCall") Embedly getLinkMetadata(String link) { String urlEncoded = URLEncoder.encode(link,"UTF-8") URI uri = new URI("https://api.embed.ly/1/oembed?key=${apiKey}&url=${urlEncoded}") Embedly embedly = null MetricsSupport.withTimer(apiTimer, { embedly = restTemplate.getForObject(uri, Embedly) }) return embedly } simple closure to time our requests to the Web service integrate coda hale metrics
  • 26. • capture runtime behavior of the components in your topology • Coda Hale metrics - http://metrics.codahale.com/ • output metrics every N minutes • report metrics to JMX, ganglia, graphite, etc metrics
  • 27. -- Meters ---------------------------------------------------------------------- EnrichLinkBoltLogic.solrQueries count = 97 mean rate = 0.81 events/second 1-minute rate = 0.89 events/second 5-minute rate = 1.62 events/second 15-minute rate = 1.86 events/second SolrBoltLogic.linksIndexed count = 60 mean rate = 0.50 events/second 1-minute rate = 0.41 events/second 5-minute rate = 0.16 events/second 15-minute rate = 0.06 events/second -- Timers ---------------------------------------------------------------------- EmbedlyService.apiCall count = 60 mean rate = 0.50 calls/second 1-minute rate = 0.40 calls/second 5-minute rate = 0.16 calls/second 15-minute rate = 0.06 calls/second min = 138.70 milliseconds max = 7642.92 milliseconds mean = 1148.29 milliseconds stddev = 1281.40 milliseconds median = 652.83 milliseconds 75% <= 1620.96 milliseconds ...
  • 28. storm cluster concepts • nimbus: master node (~job tracker in Hadoop) • zookeeper: cluster management / coordination • supervisor: one per node in the cluster to manage worker processes • worker: one or more per supervisor (JVM process) • executor: thread in worker • task: work performed by a spout or bolt
  • 29. Worker 1 (port 6701) Nimbus Supervisor (1 per node) Topology JAR Node 1 JVM process executor (thread) ... N workers ... M nodes Each component (spout or bolt) is distributed across a cluster of workers based on a configurable parallelism Zookeeper
  • 30. @Override StormTopology build(StreamingApp app) throws Exception { ... TopologyBuilder builder = new TopologyBuilder() builder.setSpout("1.usa.gov-spout", new SpringSpout("oneUsaGovStreamingDataProvider", spoutFields), 1) builder.setBolt("enrich-link-bolt", new SpringBolt("enrichLinkAction", enrichedLinkFields), 3) .fieldsGrouping("1.usa.gov-spout", globalBitlyHashGrouping) ... parallelism hint to the framework (can be rebalanced)
  • 31. solr integration points • real-time get • near real-time indexing (NRT) • percolate (match incoming docs to pre-existing queries)
  • 32. real-time get use Solr for fast lookups by document ID class SolrClient { @Autowired SolrServer solrServer SolrDocument get(String docId, String... fields) { SolrQuery q = new SolrQuery() q.setRequestHandler("/get") q.set("id", docId) q.setFields(fields) QueryRequest req = new QueryRequest(q) req.setResponseParser(new BinaryResponseParser()) QueryResponse rsp = req.process(solrServer) return (SolrDocument)rsp.getResponse().get("doc") } } send the request to the “get” request handler
  • 33. near real-time indexing • If possible, use CloudSolrServer to route documents directly to the correct shard leaders (SOLR-4816) • Use <openSearcher>false</openSearcher> for auto “hard” commits • Use auto soft commits as needed • Use parallelism of Storm bolt to distribute indexing work to N nodes http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
  • 34. percolate • match incoming documents to pre-configured queries (inverted search) – example: Is this tweet related to campaign Y for brand X? • use storm’s distributed computation support to evaluate M pre-configured queries per doc
  • 35. two possible approaches • Lucene-only solution using MemoryIndex – See presentation by Charlie Hull and Alan Woodward • EmbeddedSolrServer – Full solrconfig.xml / schema.xml – RAMDirectory – Relies on Storm to scale up documents / second – Easy solution for up to a few thousand queries
  • 36. Twitter Spout PercolatorBolt 1 Embedded SolrServer Pre-configured queries stored in a database PercolatorBolt N Embedded SolrServer ... Could be 100’s of these random stream grouping ZeroMQ pub/sub to push query changes to percolator
  • 37. tick tuples • send a special kind of tuple to a bolt every N seconds if (TupleHelpers.isTickTuple(input)) { // do special work } used in percolator to delete accumulated documents every minute or so ...
  • 38. references • Storm Wiki • https://github.com/nathanmarz/storm/wiki/Documentation • Overview: Krishna Gade • http://www.slideshare.net/KrishnaGade2/storm-at-twitter • Trending Topics: Michael Knoll • http://www.michael-noll.com/blog/2013/01/18/implementing-real-time- trending-topics-in-storm/ • Understanding Parallelism: Michael Knoll • http://www.michael-noll.com/blog/2012/10/16/understanding-the- parallelism-of-a-storm-topology/
  • 39. get the code: https://github.com/thelabdude/lsrdublin Q & A Manning coupon codes for conference related books: http://deals.manningpublications.com/RevolutionsEU2013.html