Integrate Solr with real-time stream processing applications

INTEGRATE SOLR WITH REAL-TIME STREAM
PROCESSING APPLICATIONS
Timothy Potter
@thelabdude
linkedin.com/thelabdude

whoami
independent consultant search / big data projects
soon to be joining engineering team @LucidWorks
co-author Solr In Action
previously big data architect Dachis Group

my storm story
re-designed a complex batch-oriented indexing
pipeline based on Hadoop (Oozie, Pig, Hive, Sqoop)
to real-time storm topology

agenda
walk through how to develop a storm topology
common integration points with Solr
(near real-time indexing, percolator, real-time get)

example
listen to click events from 1.usa.gov URL shortener
(bit.ly) to determine trending US government sites
stream of click events:
http://developer.usa.gov/1usagov
http://www.smartgrid.gov -> http://1.usa.gov/ayu0Ru

beyond word count
tackle real challenges you’ll encounter when
developing a storm topology
and what about ... unit testing, dependency injection,
measure runtime behavior of your components, separation of
concerns, reducing boilerplate, hiding complexity ...

storm
open source distributed computation system
scalability, fault-tolerance, guaranteed message
processing (optional)

storm primitives
• tuple: ordered list of values
• stream: unbounded sequence of tuples
• spout: emit a stream of tuples (source)
• bolt: performs some operation on each tuple
• topology: dag of spouts and tuples

solution requirements
• receive click events from 1.usa.gov stream
• count frequency of pages in a time window
• rank top N sites per time window
• extract title, body text, image for each link
• persist rankings and metadata for visualization

trending snapshot (sept 12, 2013)

Solr
Metrics
DB
EnrichLink
Bolt
Solr
Indexing
Bolt
1.usa.gov
Spout
Rolling
Count
Bolt
Intermediate
Rankings
Bolt
Total
Rankings
Bolt
embed.ly
API
field
grouping
bit.ly hash
field
grouping
bit.ly hash
global
grouping
Persist
Rankings
Bolt
field
grouping
obj
global
grouping
provided by in the
storm-starter project
data store
bolt
spout
stream
grouping

stream grouping
• shuffle: random distribution of tuples to all instances of a bolt
• field(s): group tuples by one or more fields in common
• global: reduce down to one
• all: replicate stream to all instances of a bolt
source: https://github.com/nathanmarz/storm/wiki/Concepts

useful storm concepts
• bolts can receive input from many spouts
• tuples in a stream can be grouped together
• streams can be split and joined
• bolts can inject new tuples into the stream
• components can be distributed across a cluster at a
configurable parallelism level
• optionally, storm keeps track of each tuple emitted by a spout
(ack or fail)

tools
• Spring framework – dependency injection, configuration, unit
testing, mature, etc.
• Groovy – keeps your code tidy and elegant
• Mockito – ignore stuff your test doesn’t care about
• Netty – fast & powerful NIO networking library
• Coda Hale metrics – get visibility into how your bolts and
spouts are performing (at a very low-level)

spout
easy! just produce a stream of tuples ...
and ... avoid blocking when waiting for more data, ease off throttle if topology
is not processing fast enough, deal with failed tuples, choose if it should use
message Ids for each tuple emitted, data model / schema, etc ...

SpringBoltSpringSpout
Streaming
DataAction
(POJO)
Streaming
DataProvider
(POJO)
Spring container (1 per topology per JVM)
Spring
Dependency
Injection
JDBC WebService
Hide complexity
of implementing
Storm contract
developer
focuses on
business
logic

streaming data provider
class OneUsaGovStreamingDataProvider implements StreamingDataProvider, MessageHandler {
MessageStream messageStream
...
void open(Map stormConf) { messageStream.receive(this) }
boolean next(NamedValues nv) {
String msg = queue.poll()
if (msg) {
OneUsaGovRequest req = objectMapper.readValue(msg, OneUsaGovRequest)
if (req != null && req.globalBitlyHash != null) {
nv.set(OneUsaGovTopology.GLOBAL_BITLY_HASH, req.globalBitlyHash)
nv.set(OneUsaGovTopology.JSON_PAYLOAD, req)
return true
}
}
return false
}
void handleMessage(String msg) { queue.offer(msg) }
Spring Dependency Injection
non-blocking call to get the
next message from 1.usa.gov
use Jackson JSON parser
to create an object from the
raw incoming data

jackson json to java
@JsonIgnoreProperties(ignoreUnknown = true)
class OneUsaGovRequest implements Serializable {
@JsonProperty("a")
String userAgent;
@JsonProperty("c")
String countryCode;
@JsonProperty("nk")
int knownUser;
@JsonProperty("g")
String globalBitlyHash;
@JsonProperty("h")
String encodingUserBitlyHash;
@JsonProperty("l")
String encodingUserLogin;
...
}
Spring converts json to java object for you:
<bean id="restTemplate"
class="org.springframework.web.client.RestTemplate">
<property name="messageConverters">
<list>
<bean id="messageConverter”
class="...json.MappingJackson2HttpMessageConverter">
</bean>
</list>
</property>
</bean>

spout data provider spring-managed bean
<bean id="oneUsaGovStreamingDataProvider"
class="com.bigdatajumpstart.storm.OneUsaGovStreamingDataProvider">
<property name="messageStream">
<bean class="com.bigdatajumpstart.netty.HttpClient">
<constructor-arg index="0" value="${streamUrl}"/>
</bean>
</property>
</bean>
builder.setSpout("1.usa.gov-spout",
new SpringSpout("oneUsaGovStreamingDataProvider", spoutFields), 1)
Note: when building the StormTopology to submit to Storm, you do:

class OneUsaGovStreamingDataProviderTest extends StreamingDataProviderTestBase {
@Test
void testDataProvider() {
String jsonStr = '''{
"a": "user-agent", "c": "US",
"nk": 0, "tz": "America/Los_Angeles",
"gr": "OR", "g": "2BktiW",
"h": "12Me4B2", "l": "usairforce",
"al": "en-us", "hh": "1.usa.gov",
"r": "http://example.com/foo",
...
}'''
OneUsaGovStreamingDataProvider dataProvider = new OneUsaGovStreamingDataProvider()
dataProvider.setMessageStream(mock(MessageStream))
dataProvider.open(stormConf) // Config setup in base class
dataProvider.handleMessage(jsonStr)
NamedValues record = new NamedValues(OneUsaGovTopology.spoutFields)
assertTrue dataProvider.next(record)
...
}
}
spout data provider unit test
mock json to simulate
data from 1.usa.gov feed
use Mockito to satisfy
dependencies not needed
for this test
asserts to verify
data provider
works correctly

rolling count bolt
• counts frequency of links in a sliding time window
• emits topN in current window every M seconds
• uses tick tuple trick provided by Storm to emit every
M seconds (configurable)
• provided with the storm-starter project
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/

• calls out to embed.ly API
• caches results locally in the bolt instance
• relies on field grouping (incoming tuples)
• outputs data to be indexed in Solr
• benefits from parallelism to enrich more links
concurrently (watch those rate limits)
enrich link metadata bolt

embed.ly service
class EmbedlyService {
@Autowired
RestTemplate restTemplate
String apiKey
private Timer apiTimer = MetricsSupport.timer(EmbedlyService, "apiCall")
Embedly getLinkMetadata(String link) {
String urlEncoded = URLEncoder.encode(link,"UTF-8")
URI uri = new URI("https://api.embed.ly/1/oembed?key=${apiKey}&url=${urlEncoded}")
Embedly embedly = null
MetricsSupport.withTimer(apiTimer, {
embedly = restTemplate.getForObject(uri, Embedly)
})
return embedly
}
simple closure to time our
requests to the Web service
integrate coda hale metrics

• capture runtime behavior of the components in your
topology
• Coda Hale metrics - http://metrics.codahale.com/
• output metrics every N minutes
• report metrics to JMX, ganglia, graphite, etc
metrics

-- Meters ----------------------------------------------------------------------
EnrichLinkBoltLogic.solrQueries
count = 97
mean rate = 0.81 events/second
1-minute rate = 0.89 events/second
SolrBoltLogic.linksIndexed
count = 60
mean rate = 0.50 events/second
-- Timers ----------------------------------------------------------------------
EmbedlyService.apiCall
count = 60
mean rate = 0.50 calls/second
1-minute rate = 0.40 calls/second
min = 138.70 milliseconds
max = 7642.92 milliseconds
mean = 1148.29 milliseconds
stddev = 1281.40 milliseconds
median = 652.83 milliseconds
75% <= 1620.96 milliseconds
...

storm cluster concepts
• nimbus: master node (~job tracker in Hadoop)
• zookeeper: cluster management / coordination
• supervisor: one per node in the cluster to manage worker
processes
• worker: one or more per supervisor (JVM process)
• executor: thread in worker
• task: work performed by a spout or bolt

Worker 1 (port 6701)
Nimbus
Supervisor (1 per node)
Topology
JAR
Node 1
JVM process
executor
(thread)
... N workers
... M nodes
Each component (spout or bolt)
is distributed across a cluster of
workers based on a configurable
parallelism
Zookeeper

@Override
StormTopology build(StreamingApp app) throws Exception {
...
TopologyBuilder builder = new TopologyBuilder()
builder.setSpout("1.usa.gov-spout",
new SpringSpout("oneUsaGovStreamingDataProvider", spoutFields), 1)
builder.setBolt("enrich-link-bolt",
new SpringBolt("enrichLinkAction", enrichedLinkFields), 3)
.fieldsGrouping("1.usa.gov-spout", globalBitlyHashGrouping)
...
parallelism hint to
the framework
(can be rebalanced)

solr integration points
• real-time get
• near real-time indexing (NRT)
• percolate (match incoming docs to pre-existing
queries)

real-time get
use Solr for fast lookups by document ID
class SolrClient {
@Autowired
SolrServer solrServer
SolrDocument get(String docId, String... fields) {
SolrQuery q = new SolrQuery()
q.setRequestHandler("/get")
q.set("id", docId)
q.setFields(fields)
QueryRequest req = new QueryRequest(q)
req.setResponseParser(new BinaryResponseParser())
QueryResponse rsp = req.process(solrServer)
return (SolrDocument)rsp.getResponse().get("doc")
}
}
send the request to the
“get” request handler

near real-time indexing
• If possible, use CloudSolrServer to route documents directly
to the correct shard leaders (SOLR-4816)
• Use <openSearcher>false</openSearcher> for auto “hard”
commits
• Use auto soft commits as needed
• Use parallelism of Storm bolt to distribute indexing work to N
nodes
http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

percolate
• match incoming documents to pre-configured
queries (inverted search)
– example: Is this tweet related to campaign Y for brand X?
• use storm’s distributed computation support to
evaluate M pre-configured queries per doc

two possible approaches
• Lucene-only solution using MemoryIndex
– See presentation by Charlie Hull and Alan Woodward
• EmbeddedSolrServer
– Full solrconfig.xml / schema.xml
– RAMDirectory
– Relies on Storm to scale up documents / second
– Easy solution for up to a few thousand queries

Twitter
Spout
PercolatorBolt 1
Embedded
SolrServer
Pre-configured
queries stored in
a database
PercolatorBolt N
Embedded
SolrServer
... Could be 100’s of these
random
stream
grouping ZeroMQ
pub/sub to push
query changes
to percolator

tick tuples
• send a special kind of tuple to a bolt every N
seconds
if (TupleHelpers.isTickTuple(input)) {
// do special work
}
used in percolator to delete accumulated documents every minute or so ...

references
• Storm Wiki
• https://github.com/nathanmarz/storm/wiki/Documentation
• Overview: Krishna Gade
• http://www.slideshare.net/KrishnaGade2/storm-at-twitter
• Trending Topics: Michael Knoll
• http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-
trending-topics-in-storm/
• Understanding Parallelism: Michael Knoll
• http://www.michael-noll.com/blog/2012/10/16/understanding-the-
parallelism-of-a-storm-topology/

get the code: https://github.com/thelabdude/lsrdublin
Q & A
Manning coupon codes for conference related books:
http://deals.manningpublications.com/RevolutionsEU2013.html

Integrate Solr with real-time stream processing applications

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Integrate Solr with real-time stream processing applications

Ähnlich wie Integrate Solr with real-time stream processing applications (20)

Mehr von thelabdude

Mehr von thelabdude (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Integrate Solr with real-time stream processing applications