2. What are we doing?
● Discovering relevant domains in the Fashion Web
● Use modified HITS - a.k.a. Hubs & Authorities
Why use Kafka Streams?
● Why not Spark, Hadoop, Whatevs?
● Advantages of Kafka Streams
Show me the nitty gritty!
● Okay!
8. HITS in a nutshell
Adjacency Matrix
Image courtesy of
http://faculty.ycp.edu/~dbabcock/PastCourses/cs360/lectures/lecture15.html
9. HITS in a nutshell
● Why not just counts?
● If you point to good sites, you’re a better
Hub
● If good sites point to you, you’re a better
Authority
● Each iteration, the weights change, until
they reach convergence
12. Please tell me you aren’t using Kafka
to do Matrix multiplication!
13. Please tell me you aren’t using Kafka
to do Matrix multiplication!
No, I am not crazy enough* to use Kafka to do
matrix multiplication!
14. Please tell me you aren’t using Kafka
to do Matrix multiplication!
No, I am not crazy enough* to use Kafka to do
matrix multiplication!
*Although, I probably did spend too much
time thinking about it on the bus!
15. Why not use Map-Reduce?
Basically what it was invented for,
right?
18. You could use Spark…
● High infrastructure overhead if not using it
for anything else
● Bad initial experience w/ Spark Streaming
snapshotting and recovery
● Already using Kafka!
19. Why use Kafka Streams?
● Has primitives necessary for Map-Reduce
○ Map step groupBy groupByKey
○ Reduce step reduce aggregate
● Focus is on your data
not distributed computing machinery
● Streaming allows us to have
(near)
real-time, up-to-date data
21. Kafka Streams 101
● No explicit consumers/producers -
plumbing handled for you
● Topics still a fundamental communication
piece
● Think of individual datum flowing through
KStreams & KTables
22. Kafka Streams 101 - KSTreams
● Focus is on specific functional
transformations - map, filter, flatMap
● Also supports various flavours of joins with
other KStreams
● Usually created from one or more topics or
a transformation on another KStream
23. Kafka Streams 101 - KTable
● Still offers functional transforms, but on a
primary-keyed table
● Offers persistent storage
● Created from aggregations on a KStream
or transforms on other KTable
24. ● The bridge between KStream and KTable
● Created by doing a groupBy or
groupByKey on a KStream
● Create KTables by doing reduce,
aggregate, count
Kafka Streams 101 - KGroupedStream
25. Output Topic
(log compacted)
KTable ops:
groupByKey,
reduce,
toStream
KStream
Input Topic
KStream ops:
flatMapDomain Link Extractor
Domain Reducer
HITS Calculator & API
Data Flow
33. TL;DR
● Kafka Streams can help solve problems in
your application domain
● Focus on your data!
● Naturally decompose the problem into
flexible microservices