SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
INTEGRATE SOLR WITH REAL-TIME STREAM
PROCESSING APPLICATIONS
Timothy Potter
@thelabdude
linkedin.com/thelabdude
whoami
independent consultant search / big data projects
soon to be joining engineering team @LucidWorks
co-author Solr In Action
previously big data architect Dachis Group
my storm story
re-designed a complex batch-oriented indexing
pipeline based on Hadoop (Oozie, Pig, Hive, Sqoop)
to real-time storm topology
agenda
walk through how to develop a storm topology
common integration points with Solr
(near real-time indexing, percolator, real-time get)
example
listen to click events from 1.usa.gov URL shortener
(bit.ly) to determine trending US government sites
stream of click events:
http://developer.usa.gov/1usagov
http://www.smartgrid.gov -> http://1.usa.gov/ayu0Ru
beyond word count
tackle real challenges you’ll encounter when
developing a storm topology
and what about ... unit testing, dependency injection,
measure runtime behavior of your components, separation of
concerns, reducing boilerplate, hiding complexity ...
storm
open source distributed computation system
scalability, fault-tolerance, guaranteed message
processing (optional)
storm primitives
• tuple: ordered list of values
• stream: unbounded sequence of tuples
• spout: emit a stream of tuples (source)
• bolt: performs some operation on each tuple
• topology: dag of spouts and tuples
solution requirements
• receive click events from 1.usa.gov stream
• count frequency of pages in a time window
• rank top N sites per time window
• extract title, body text, image for each link
• persist rankings and metadata for visualization
trending snapshot (sept 12, 2013)
Solr
Metrics
DB
EnrichLink
Bolt
Solr
Indexing
Bolt
1.usa.gov
Spout
Rolling
Count
Bolt
Intermediate
Rankings
Bolt
Total
Rankings
Bolt
embed.ly
API
field
grouping
bit.ly hash
field
grouping
bit.ly hash
global
grouping
Persist
Rankings
Bolt
field
grouping
obj
global
grouping
provided by in the
storm-starter project
data store
bolt
spout
stream
grouping
stream grouping
• shuffle: random distribution of tuples to all instances of a bolt
• field(s): group tuples by one or more fields in common
• global: reduce down to one
• all: replicate stream to all instances of a bolt
source: https://github.com/nathanmarz/storm/wiki/Concepts
useful storm concepts
• bolts can receive input from many spouts
• tuples in a stream can be grouped together
• streams can be split and joined
• bolts can inject new tuples into the stream
• components can be distributed across a cluster at a
configurable parallelism level
• optionally, storm keeps track of each tuple emitted by a spout
(ack or fail)
tools
• Spring framework – dependency injection, configuration, unit
testing, mature, etc.
• Groovy – keeps your code tidy and elegant
• Mockito – ignore stuff your test doesn’t care about
• Netty – fast & powerful NIO networking library
• Coda Hale metrics – get visibility into how your bolts and
spouts are performing (at a very low-level)
spout
easy! just produce a stream of tuples ...
and ... avoid blocking when waiting for more data, ease off throttle if topology
is not processing fast enough, deal with failed tuples, choose if it should use
message Ids for each tuple emitted, data model / schema, etc ...
SpringBoltSpringSpout
Streaming
DataAction
(POJO)
Streaming
DataProvider
(POJO)
Spring container (1 per topology per JVM)
Spring
Dependency
Injection
JDBC WebService
Hide complexity
of implementing
Storm contract
developer
focuses on
business
logic
streaming data provider
class OneUsaGovStreamingDataProvider implements StreamingDataProvider, MessageHandler {
MessageStream messageStream
...
void open(Map stormConf) { messageStream.receive(this) }
boolean next(NamedValues nv) {
String msg = queue.poll()
if (msg) {
OneUsaGovRequest req = objectMapper.readValue(msg, OneUsaGovRequest)
if (req != null && req.globalBitlyHash != null) {
nv.set(OneUsaGovTopology.GLOBAL_BITLY_HASH, req.globalBitlyHash)
nv.set(OneUsaGovTopology.JSON_PAYLOAD, req)
return true
}
}
return false
}
void handleMessage(String msg) { queue.offer(msg) }
Spring Dependency Injection
non-blocking call to get the
next message from 1.usa.gov
use Jackson JSON parser
to create an object from the
raw incoming data
jackson json to java
@JsonIgnoreProperties(ignoreUnknown = true)
class OneUsaGovRequest implements Serializable {
@JsonProperty("a")
String userAgent;
@JsonProperty("c")
String countryCode;
@JsonProperty("nk")
int knownUser;
@JsonProperty("g")
String globalBitlyHash;
@JsonProperty("h")
String encodingUserBitlyHash;
@JsonProperty("l")
String encodingUserLogin;
...
}
Spring converts json to java object for you:
<bean id="restTemplate"
class="org.springframework.web.client.RestTemplate">
<property name="messageConverters">
<list>
<bean id="messageConverter”
class="...json.MappingJackson2HttpMessageConverter">
</bean>
</list>
</property>
</bean>
spout data provider spring-managed bean
<bean id="oneUsaGovStreamingDataProvider"
class="com.bigdatajumpstart.storm.OneUsaGovStreamingDataProvider">
<property name="messageStream">
<bean class="com.bigdatajumpstart.netty.HttpClient">
<constructor-arg index="0" value="${streamUrl}"/>
</bean>
</property>
</bean>
builder.setSpout("1.usa.gov-spout",
new SpringSpout("oneUsaGovStreamingDataProvider", spoutFields), 1)
Note: when building the StormTopology to submit to Storm, you do:
class OneUsaGovStreamingDataProviderTest extends StreamingDataProviderTestBase {
@Test
void testDataProvider() {
String jsonStr = '''{
"a": "user-agent", "c": "US",
"nk": 0, "tz": "America/Los_Angeles",
"gr": "OR", "g": "2BktiW",
"h": "12Me4B2", "l": "usairforce",
"al": "en-us", "hh": "1.usa.gov",
"r": "http://example.com/foo",
...
}'''
OneUsaGovStreamingDataProvider dataProvider = new OneUsaGovStreamingDataProvider()
dataProvider.setMessageStream(mock(MessageStream))
dataProvider.open(stormConf) // Config setup in base class
dataProvider.handleMessage(jsonStr)
NamedValues record = new NamedValues(OneUsaGovTopology.spoutFields)
assertTrue dataProvider.next(record)
...
}
}
spout data provider unit test
mock json to simulate
data from 1.usa.gov feed
use Mockito to satisfy
dependencies not needed
for this test
asserts to verify
data provider
works correctly
rolling count bolt
• counts frequency of links in a sliding time window
• emits topN in current window every M seconds
• uses tick tuple trick provided by Storm to emit every
M seconds (configurable)
• provided with the storm-starter project
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
• calls out to embed.ly API
• caches results locally in the bolt instance
• relies on field grouping (incoming tuples)
• outputs data to be indexed in Solr
• benefits from parallelism to enrich more links
concurrently (watch those rate limits)
enrich link metadata bolt
embed.ly service
class EmbedlyService {
@Autowired
RestTemplate restTemplate
String apiKey
private Timer apiTimer = MetricsSupport.timer(EmbedlyService, "apiCall")
Embedly getLinkMetadata(String link) {
String urlEncoded = URLEncoder.encode(link,"UTF-8")
URI uri = new URI("https://api.embed.ly/1/oembed?key=${apiKey}&url=${urlEncoded}")
Embedly embedly = null
MetricsSupport.withTimer(apiTimer, {
embedly = restTemplate.getForObject(uri, Embedly)
})
return embedly
}
simple closure to time our
requests to the Web service
integrate coda hale metrics
• capture runtime behavior of the components in your
topology
• Coda Hale metrics - http://metrics.codahale.com/
• output metrics every N minutes
• report metrics to JMX, ganglia, graphite, etc
metrics
-- Meters ----------------------------------------------------------------------
EnrichLinkBoltLogic.solrQueries
count = 97
mean rate = 0.81 events/second
1-minute rate = 0.89 events/second
5-minute rate = 1.62 events/second
15-minute rate = 1.86 events/second
SolrBoltLogic.linksIndexed
count = 60
mean rate = 0.50 events/second
1-minute rate = 0.41 events/second
5-minute rate = 0.16 events/second
15-minute rate = 0.06 events/second
-- Timers ----------------------------------------------------------------------
EmbedlyService.apiCall
count = 60
mean rate = 0.50 calls/second
1-minute rate = 0.40 calls/second
5-minute rate = 0.16 calls/second
15-minute rate = 0.06 calls/second
min = 138.70 milliseconds
max = 7642.92 milliseconds
mean = 1148.29 milliseconds
stddev = 1281.40 milliseconds
median = 652.83 milliseconds
75% <= 1620.96 milliseconds
...
storm cluster concepts
• nimbus: master node (~job tracker in Hadoop)
• zookeeper: cluster management / coordination
• supervisor: one per node in the cluster to manage worker
processes
• worker: one or more per supervisor (JVM process)
• executor: thread in worker
• task: work performed by a spout or bolt
Worker 1 (port 6701)
Nimbus
Supervisor (1 per node)
Topology
JAR
Node 1
JVM process
executor
(thread)
... N workers
... M nodes
Each component (spout or bolt)
is distributed across a cluster of
workers based on a configurable
parallelism
Zookeeper
@Override
StormTopology build(StreamingApp app) throws Exception {
...
TopologyBuilder builder = new TopologyBuilder()
builder.setSpout("1.usa.gov-spout",
new SpringSpout("oneUsaGovStreamingDataProvider", spoutFields), 1)
builder.setBolt("enrich-link-bolt",
new SpringBolt("enrichLinkAction", enrichedLinkFields), 3)
.fieldsGrouping("1.usa.gov-spout", globalBitlyHashGrouping)
...
parallelism hint to
the framework
(can be rebalanced)
solr integration points
• real-time get
• near real-time indexing (NRT)
• percolate (match incoming docs to pre-existing
queries)
real-time get
use Solr for fast lookups by document ID
class SolrClient {
@Autowired
SolrServer solrServer
SolrDocument get(String docId, String... fields) {
SolrQuery q = new SolrQuery()
q.setRequestHandler("/get")
q.set("id", docId)
q.setFields(fields)
QueryRequest req = new QueryRequest(q)
req.setResponseParser(new BinaryResponseParser())
QueryResponse rsp = req.process(solrServer)
return (SolrDocument)rsp.getResponse().get("doc")
}
}
send the request to the
“get” request handler
near real-time indexing
• If possible, use CloudSolrServer to route documents directly
to the correct shard leaders (SOLR-4816)
• Use <openSearcher>false</openSearcher> for auto “hard”
commits
• Use auto soft commits as needed
• Use parallelism of Storm bolt to distribute indexing work to N
nodes
http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
percolate
• match incoming documents to pre-configured
queries (inverted search)
– example: Is this tweet related to campaign Y for brand X?
• use storm’s distributed computation support to
evaluate M pre-configured queries per doc
two possible approaches
• Lucene-only solution using MemoryIndex
– See presentation by Charlie Hull and Alan Woodward
• EmbeddedSolrServer
– Full solrconfig.xml / schema.xml
– RAMDirectory
– Relies on Storm to scale up documents / second
– Easy solution for up to a few thousand queries
Twitter
Spout
PercolatorBolt 1
Embedded
SolrServer
Pre-configured
queries stored in
a database
PercolatorBolt N
Embedded
SolrServer
... Could be 100’s of these
random
stream
grouping ZeroMQ
pub/sub to push
query changes
to percolator
tick tuples
• send a special kind of tuple to a bolt every N
seconds
if (TupleHelpers.isTickTuple(input)) {
// do special work
}
used in percolator to delete accumulated documents every minute or so ...
references
• Storm Wiki
• https://github.com/nathanmarz/storm/wiki/Documentation
• Overview: Krishna Gade
• http://www.slideshare.net/KrishnaGade2/storm-at-twitter
• Trending Topics: Michael Knoll
• http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-
trending-topics-in-storm/
• Understanding Parallelism: Michael Knoll
• http://www.michael-noll.com/blog/2012/10/16/understanding-the-
parallelism-of-a-storm-topology/
get the code: https://github.com/thelabdude/lsrdublin
Q & A
Manning coupon codes for conference related books:
http://deals.manningpublications.com/RevolutionsEU2013.html

Weitere ähnliche Inhalte

Was ist angesagt?

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologyLucidworks
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature PreviewYonik Seeley
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptLucidworks
 
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)Yonik Seeley
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Lucidworks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrlucenerevolution
 
Faster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache SolrFaster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache SolrChitturi Kiran
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)MongoDB
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Lucidworks
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachAlexandre Rafalovitch
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrChristos Manios
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Lucidworks
 

Was ist angesagt? (20)

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScript
 
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
 
Faster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache SolrFaster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache Solr
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data Analytics
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7
 

Ähnlich wie Integrate Solr with real-time stream processing applications

Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorStéphane Maldini
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsSrinath Perera
 
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsDEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsSriskandarajah Suhothayan
 
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...Zhenzhong Xu
 
Concurrency (Fisher Syer S2GX 2010)
Concurrency (Fisher Syer S2GX 2010)Concurrency (Fisher Syer S2GX 2010)
Concurrency (Fisher Syer S2GX 2010)Dave Syer
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics PlatformSrinath Perera
 
Real time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyVarun Vijayaraghavan
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Felicia Haggarty
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Eric Sammer
 
Stream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysStream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysSmartNews, Inc.
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Stormjustinjleet
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQXin Wang
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 

Ähnlich wie Integrate Solr with real-time stream processing applications (20)

Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and Reactor
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics Patterns
 
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsDEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
 
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
 
AI Development with H2O.ai
AI Development with H2O.aiAI Development with H2O.ai
AI Development with H2O.ai
 
Concurrency (Fisher Syer S2GX 2010)
Concurrency (Fisher Syer S2GX 2010)Concurrency (Fisher Syer S2GX 2010)
Concurrency (Fisher Syer S2GX 2010)
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
 
Real time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.ly
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...
 
Stream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysStream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdays
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 

Mehr von thelabdude

Running Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with AlluxioRunning Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with Alluxiothelabdude
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudthelabdude
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
 
Boosting Documents in Solr (Lucene Revolution 2011)
Boosting Documents in Solr (Lucene Revolution 2011)Boosting Documents in Solr (Lucene Revolution 2011)
Boosting Documents in Solr (Lucene Revolution 2011)thelabdude
 
Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202thelabdude
 

Mehr von thelabdude (7)

Running Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with AlluxioRunning Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with Alluxio
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4
 
Boosting Documents in Solr (Lucene Revolution 2011)
Boosting Documents in Solr (Lucene Revolution 2011)Boosting Documents in Solr (Lucene Revolution 2011)
Boosting Documents in Solr (Lucene Revolution 2011)
 
Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202
 

Kürzlich hochgeladen

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 

Kürzlich hochgeladen (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 

Integrate Solr with real-time stream processing applications

  • 1.
  • 2. INTEGRATE SOLR WITH REAL-TIME STREAM PROCESSING APPLICATIONS Timothy Potter @thelabdude linkedin.com/thelabdude
  • 3. whoami independent consultant search / big data projects soon to be joining engineering team @LucidWorks co-author Solr In Action previously big data architect Dachis Group
  • 4. my storm story re-designed a complex batch-oriented indexing pipeline based on Hadoop (Oozie, Pig, Hive, Sqoop) to real-time storm topology
  • 5. agenda walk through how to develop a storm topology common integration points with Solr (near real-time indexing, percolator, real-time get)
  • 6. example listen to click events from 1.usa.gov URL shortener (bit.ly) to determine trending US government sites stream of click events: http://developer.usa.gov/1usagov http://www.smartgrid.gov -> http://1.usa.gov/ayu0Ru
  • 7. beyond word count tackle real challenges you’ll encounter when developing a storm topology and what about ... unit testing, dependency injection, measure runtime behavior of your components, separation of concerns, reducing boilerplate, hiding complexity ...
  • 8. storm open source distributed computation system scalability, fault-tolerance, guaranteed message processing (optional)
  • 9. storm primitives • tuple: ordered list of values • stream: unbounded sequence of tuples • spout: emit a stream of tuples (source) • bolt: performs some operation on each tuple • topology: dag of spouts and tuples
  • 10. solution requirements • receive click events from 1.usa.gov stream • count frequency of pages in a time window • rank top N sites per time window • extract title, body text, image for each link • persist rankings and metadata for visualization
  • 12.
  • 14. stream grouping • shuffle: random distribution of tuples to all instances of a bolt • field(s): group tuples by one or more fields in common • global: reduce down to one • all: replicate stream to all instances of a bolt source: https://github.com/nathanmarz/storm/wiki/Concepts
  • 15. useful storm concepts • bolts can receive input from many spouts • tuples in a stream can be grouped together • streams can be split and joined • bolts can inject new tuples into the stream • components can be distributed across a cluster at a configurable parallelism level • optionally, storm keeps track of each tuple emitted by a spout (ack or fail)
  • 16. tools • Spring framework – dependency injection, configuration, unit testing, mature, etc. • Groovy – keeps your code tidy and elegant • Mockito – ignore stuff your test doesn’t care about • Netty – fast & powerful NIO networking library • Coda Hale metrics – get visibility into how your bolts and spouts are performing (at a very low-level)
  • 17. spout easy! just produce a stream of tuples ... and ... avoid blocking when waiting for more data, ease off throttle if topology is not processing fast enough, deal with failed tuples, choose if it should use message Ids for each tuple emitted, data model / schema, etc ...
  • 18. SpringBoltSpringSpout Streaming DataAction (POJO) Streaming DataProvider (POJO) Spring container (1 per topology per JVM) Spring Dependency Injection JDBC WebService Hide complexity of implementing Storm contract developer focuses on business logic
  • 19. streaming data provider class OneUsaGovStreamingDataProvider implements StreamingDataProvider, MessageHandler { MessageStream messageStream ... void open(Map stormConf) { messageStream.receive(this) } boolean next(NamedValues nv) { String msg = queue.poll() if (msg) { OneUsaGovRequest req = objectMapper.readValue(msg, OneUsaGovRequest) if (req != null && req.globalBitlyHash != null) { nv.set(OneUsaGovTopology.GLOBAL_BITLY_HASH, req.globalBitlyHash) nv.set(OneUsaGovTopology.JSON_PAYLOAD, req) return true } } return false } void handleMessage(String msg) { queue.offer(msg) } Spring Dependency Injection non-blocking call to get the next message from 1.usa.gov use Jackson JSON parser to create an object from the raw incoming data
  • 20. jackson json to java @JsonIgnoreProperties(ignoreUnknown = true) class OneUsaGovRequest implements Serializable { @JsonProperty("a") String userAgent; @JsonProperty("c") String countryCode; @JsonProperty("nk") int knownUser; @JsonProperty("g") String globalBitlyHash; @JsonProperty("h") String encodingUserBitlyHash; @JsonProperty("l") String encodingUserLogin; ... } Spring converts json to java object for you: <bean id="restTemplate" class="org.springframework.web.client.RestTemplate"> <property name="messageConverters"> <list> <bean id="messageConverter” class="...json.MappingJackson2HttpMessageConverter"> </bean> </list> </property> </bean>
  • 21. spout data provider spring-managed bean <bean id="oneUsaGovStreamingDataProvider" class="com.bigdatajumpstart.storm.OneUsaGovStreamingDataProvider"> <property name="messageStream"> <bean class="com.bigdatajumpstart.netty.HttpClient"> <constructor-arg index="0" value="${streamUrl}"/> </bean> </property> </bean> builder.setSpout("1.usa.gov-spout", new SpringSpout("oneUsaGovStreamingDataProvider", spoutFields), 1) Note: when building the StormTopology to submit to Storm, you do:
  • 22. class OneUsaGovStreamingDataProviderTest extends StreamingDataProviderTestBase { @Test void testDataProvider() { String jsonStr = '''{ "a": "user-agent", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "gr": "OR", "g": "2BktiW", "h": "12Me4B2", "l": "usairforce", "al": "en-us", "hh": "1.usa.gov", "r": "http://example.com/foo", ... }''' OneUsaGovStreamingDataProvider dataProvider = new OneUsaGovStreamingDataProvider() dataProvider.setMessageStream(mock(MessageStream)) dataProvider.open(stormConf) // Config setup in base class dataProvider.handleMessage(jsonStr) NamedValues record = new NamedValues(OneUsaGovTopology.spoutFields) assertTrue dataProvider.next(record) ... } } spout data provider unit test mock json to simulate data from 1.usa.gov feed use Mockito to satisfy dependencies not needed for this test asserts to verify data provider works correctly
  • 23. rolling count bolt • counts frequency of links in a sliding time window • emits topN in current window every M seconds • uses tick tuple trick provided by Storm to emit every M seconds (configurable) • provided with the storm-starter project http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
  • 24. • calls out to embed.ly API • caches results locally in the bolt instance • relies on field grouping (incoming tuples) • outputs data to be indexed in Solr • benefits from parallelism to enrich more links concurrently (watch those rate limits) enrich link metadata bolt
  • 25. embed.ly service class EmbedlyService { @Autowired RestTemplate restTemplate String apiKey private Timer apiTimer = MetricsSupport.timer(EmbedlyService, "apiCall") Embedly getLinkMetadata(String link) { String urlEncoded = URLEncoder.encode(link,"UTF-8") URI uri = new URI("https://api.embed.ly/1/oembed?key=${apiKey}&url=${urlEncoded}") Embedly embedly = null MetricsSupport.withTimer(apiTimer, { embedly = restTemplate.getForObject(uri, Embedly) }) return embedly } simple closure to time our requests to the Web service integrate coda hale metrics
  • 26. • capture runtime behavior of the components in your topology • Coda Hale metrics - http://metrics.codahale.com/ • output metrics every N minutes • report metrics to JMX, ganglia, graphite, etc metrics
  • 27. -- Meters ---------------------------------------------------------------------- EnrichLinkBoltLogic.solrQueries count = 97 mean rate = 0.81 events/second 1-minute rate = 0.89 events/second 5-minute rate = 1.62 events/second 15-minute rate = 1.86 events/second SolrBoltLogic.linksIndexed count = 60 mean rate = 0.50 events/second 1-minute rate = 0.41 events/second 5-minute rate = 0.16 events/second 15-minute rate = 0.06 events/second -- Timers ---------------------------------------------------------------------- EmbedlyService.apiCall count = 60 mean rate = 0.50 calls/second 1-minute rate = 0.40 calls/second 5-minute rate = 0.16 calls/second 15-minute rate = 0.06 calls/second min = 138.70 milliseconds max = 7642.92 milliseconds mean = 1148.29 milliseconds stddev = 1281.40 milliseconds median = 652.83 milliseconds 75% <= 1620.96 milliseconds ...
  • 28. storm cluster concepts • nimbus: master node (~job tracker in Hadoop) • zookeeper: cluster management / coordination • supervisor: one per node in the cluster to manage worker processes • worker: one or more per supervisor (JVM process) • executor: thread in worker • task: work performed by a spout or bolt
  • 29. Worker 1 (port 6701) Nimbus Supervisor (1 per node) Topology JAR Node 1 JVM process executor (thread) ... N workers ... M nodes Each component (spout or bolt) is distributed across a cluster of workers based on a configurable parallelism Zookeeper
  • 30. @Override StormTopology build(StreamingApp app) throws Exception { ... TopologyBuilder builder = new TopologyBuilder() builder.setSpout("1.usa.gov-spout", new SpringSpout("oneUsaGovStreamingDataProvider", spoutFields), 1) builder.setBolt("enrich-link-bolt", new SpringBolt("enrichLinkAction", enrichedLinkFields), 3) .fieldsGrouping("1.usa.gov-spout", globalBitlyHashGrouping) ... parallelism hint to the framework (can be rebalanced)
  • 31. solr integration points • real-time get • near real-time indexing (NRT) • percolate (match incoming docs to pre-existing queries)
  • 32. real-time get use Solr for fast lookups by document ID class SolrClient { @Autowired SolrServer solrServer SolrDocument get(String docId, String... fields) { SolrQuery q = new SolrQuery() q.setRequestHandler("/get") q.set("id", docId) q.setFields(fields) QueryRequest req = new QueryRequest(q) req.setResponseParser(new BinaryResponseParser()) QueryResponse rsp = req.process(solrServer) return (SolrDocument)rsp.getResponse().get("doc") } } send the request to the “get” request handler
  • 33. near real-time indexing • If possible, use CloudSolrServer to route documents directly to the correct shard leaders (SOLR-4816) • Use <openSearcher>false</openSearcher> for auto “hard” commits • Use auto soft commits as needed • Use parallelism of Storm bolt to distribute indexing work to N nodes http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
  • 34. percolate • match incoming documents to pre-configured queries (inverted search) – example: Is this tweet related to campaign Y for brand X? • use storm’s distributed computation support to evaluate M pre-configured queries per doc
  • 35. two possible approaches • Lucene-only solution using MemoryIndex – See presentation by Charlie Hull and Alan Woodward • EmbeddedSolrServer – Full solrconfig.xml / schema.xml – RAMDirectory – Relies on Storm to scale up documents / second – Easy solution for up to a few thousand queries
  • 36. Twitter Spout PercolatorBolt 1 Embedded SolrServer Pre-configured queries stored in a database PercolatorBolt N Embedded SolrServer ... Could be 100’s of these random stream grouping ZeroMQ pub/sub to push query changes to percolator
  • 37. tick tuples • send a special kind of tuple to a bolt every N seconds if (TupleHelpers.isTickTuple(input)) { // do special work } used in percolator to delete accumulated documents every minute or so ...
  • 38. references • Storm Wiki • https://github.com/nathanmarz/storm/wiki/Documentation • Overview: Krishna Gade • http://www.slideshare.net/KrishnaGade2/storm-at-twitter • Trending Topics: Michael Knoll • http://www.michael-noll.com/blog/2013/01/18/implementing-real-time- trending-topics-in-storm/ • Understanding Parallelism: Michael Knoll • http://www.michael-noll.com/blog/2012/10/16/understanding-the- parallelism-of-a-storm-topology/
  • 39. get the code: https://github.com/thelabdude/lsrdublin Q & A Manning coupon codes for conference related books: http://deals.manningpublications.com/RevolutionsEU2013.html