Relevance & Personalization with Real 
Time Big Data Analytics" 
Ameya 
Kanitkar 
ameya@groupon.com
About Me" 
• Lead 
Engineer 
working 
on 
Real 
Time 
Data 
Infrastructure 
@ 
Groupon 
• Graduate 
of 
Carnegie 
Mellon 
...
What are Groupon Deals?"
Our Relevance Scenario" 
Users
Scaling: Keeping Up With a Changing Business" 
Growing 
Deals 
Growing 
Users 
2011 
2012 
2014 
20+ 
400+ 
2000+ 
• 100 
...
Changing Business: Shift from Email to Mobile" 
• Growth 
in 
Mobile 
Business 
• Reducing 
dependence 
on 
email 
markeQn...
Deal Personalization Infrastructure Use Cases" 
Deliver Personalized 
Emails! 
Deliver Personalized 
Website & Mobile 
Exp...
Deal Personalization Infrastructure Use Cases" 
Deliver Personalized 
Website, Mobile and Email 
Experience! 
Deal 
Level ...
Earlier System" 
Email 
Offline 
PersonalizaQon 
Map/Reduce 
Online 
Deal 
PersonalizaQon 
API 
MySQL 
Store 
Data 
Pipeli...
Earlier System" 
Email 
Offline 
PersonalizaQon 
Map/Reduce 
Online 
Deal 
PersonalizaQon 
Data 
Pipeline 
API 
MySQL 
Sto...
Email 
Offline 
PersonalizaQon 
Map/Reduce 
Real 
Time 
Data 
Pipeline 
Online 
Deal 
PersonalizaQon 
API 
Ideal 
Data 
St...
Email 
Offline 
PersonalizaQon 
Map/Reduce 
Web 
Site 
Logs 
Online 
Deal 
PersonalizaQon 
API 
HBase 
Final Design" 
Mobi...
Two Challenges With HBase" 
How 
to 
scale 
100,000 
writes/ 
second? 
HBase 
HBase 
• How 
to 
run 
Map 
Reduce 
Programs...
Scaling with HBase" 
• Every 
new 
write 
is 
wri,en 
as 
a 
new 
column 
in 
HBase. 
This 
also 
ensures 
de-­‐duplica;on...
Final Design" 
Real 
Time 
HBase 
Batch 
HBase 
ReplicaQon 
Bulk 
Load 
data 
via 
HFiles 
Map 
Reduce 
Over 
HBase
Leveraging System for Real Time Analytics" 
Various 
requirements 
from 
relevance 
algorithms 
to 
pre-­‐ 
compute 
real ...
Leveraging System for Real Time Analytics" 
More 
Complex 
Examples 
Category 
Level 
MulQdimensional 
Performance 
Metric...
Leveraging System for Real Time Analytics" 
Even 
More 
Complex 
Examples 
How 
does 
women 
in 
Barcelona 
from 
Las 
Ram...
Power of Simple Counting" 
Turns 
out 
all 
earlier 
quesQons 
can 
be 
answered 
if 
we 
could 
count 
appropriate 
event...
Real Time Analytics Infrastructure" 
Ka_a 
Topic 
– 
With 
Real 
Time 
User 
events 
Storm 
– 
Running 
AnalyQcs 
Topology...
Real Time Analytics Infrastructure - 
Explained" 
Ka_a 
Topic 
– 
With 
Real 
Time 
User 
events 
Read 
user 
event 
Data ...
Scaling Challenges - Kafka - Storm" 
• Massive 
reads 
and 
writes 
to 
Ka_a 
cluster 
created 
backpressure 
in 
Storm 
t...
Scaling Challenges - Redis" 
• Reduce 
memory 
footprint 
– 
use 
hashes. 
Very 
memory 
efficient 
compared 
to 
normal 
...
When Small is Big – Bloom Filters" 
• Since 
both 
Ka_a 
and 
Storm 
can 
send 
same 
data 
twice 
specially 
at 
scale, 
...
Avoiding Errors – Backups/ Recovery 
Strategy" 
With 
such 
a 
high 
volume 
system, 
which 
also 
drives 
so 
much 
reven...
We Are Hiring…"
Questions?"
Ques;ons? 
Thanks! 
ameya@groupon.com! 
www.groupon.com/techjobs!
Nächste SlideShare
Wird geladen in …5
×

Ameya Kanitkar – Scaling Real Time Analytics with Storm & HBase - NoSQL matters Barcelona 2014

1.219 Aufrufe

Veröffentlicht am

Ameya Kanitkar – Scaling Real Time Analytics with Storm & HBase

Relevance and Personalization is crucial to building personalized local commerce experience at Groupon. Talk covers overview of the real time analytics infrastructure that handles over 1 million events/ second and stores and scales to billions of data points. Solution includes leveraging Apache HBase & Redis for storage and Apache Storm for building real time analytics. Solution includes various architecture design choices and tradeoffs including some interesting algorithmic choices such as Bloom Filters & Hyper Log Log. Attendees can take away learnings from our real-life experience that can help them understand various tuning methods, their tradeoffs and apply them in their solutionsNote: This is a default abstract – talk can be customized to focus on one area or the other – or can be change to 40 minute / 25 minute talk. More importantly – we feel its a fascinating story of how these big data technologies can be actually used in real life situations – and powerful applications are built when you combine NoSQL store such as HBase and real time processing power of Storm. Also some times small is big with clever algorithms such as bloom filters and hyperloglog which can be handy when building scalable systems. I am very passionate to share this story – so do let me know if you have any questions – I hope we can work out some way to make this very useful for the conference attendees

Veröffentlicht in: Daten & Analysen
0 Kommentare
2 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
1.219
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
30
Aktionen
Geteilt
0
Downloads
27
Kommentare
0
Gefällt mir
2
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Ameya Kanitkar – Scaling Real Time Analytics with Storm & HBase - NoSQL matters Barcelona 2014

  1. 1. Relevance & Personalization with Real Time Big Data Analytics" Ameya Kanitkar ameya@groupon.com
  2. 2. About Me" • Lead Engineer working on Real Time Data Infrastructure @ Groupon • Graduate of Carnegie Mellon University & Pune University
  3. 3. What are Groupon Deals?"
  4. 4. Our Relevance Scenario" Users
  5. 5. Scaling: Keeping Up With a Changing Business" Growing Deals Growing Users 2011 2012 2014 20+ 400+ 2000+ • 100 Million+ subscribers • We need to store data like, user click history, email records, service logs etc. This tunes to billions of data points and TB’s of data
  6. 6. Changing Business: Shift from Email to Mobile" • Growth in Mobile Business • Reducing dependence on email markeQng • Change in strategy from daily deal model to deal 100 Million+ App Downloads marketplace
  7. 7. Deal Personalization Infrastructure Use Cases" Deliver Personalized Emails! Deliver Personalized Website & Mobile Experience! Offline System Online System Email Personalize billions of emails for hundreds of millions of users Personalize one of the most popular e-­‐commerce mobile & web app for hundreds of millions of users & page views
  8. 8. Deal Personalization Infrastructure Use Cases" Deliver Personalized Website, Mobile and Email Experience! Deal Level Performance User Behavior Performance Deliver Relevant Experience with High Quality Deals!
  9. 9. Earlier System" Email Offline PersonalizaQon Map/Reduce Online Deal PersonalizaQon API MySQL Store Data Pipeline (User Logs, Email Records, User History etc)
  10. 10. Earlier System" Email Offline PersonalizaQon Map/Reduce Online Deal PersonalizaQon Data Pipeline API MySQL Store • Scaling MySQL for data such as user click history, email records was painful unless we shard data • Data Pipeline is not “Real Time”
  11. 11. Email Offline PersonalizaQon Map/Reduce Real Time Data Pipeline Online Deal PersonalizaQon API Ideal Data Store • Common data store that serves data to both online and offline systems • Data store that scales to hundreds of millions of records • Data store that plays well with our exisQng hadoop based systems • Real Time pipeline that scales and can process about 100,000 messages/ second Ideal System"
  12. 12. Email Offline PersonalizaQon Map/Reduce Web Site Logs Online Deal PersonalizaQon API HBase Final Design" Mobile Logs Ka_a Message Broker Storm
  13. 13. Two Challenges With HBase" How to scale 100,000 writes/ second? HBase HBase • How to run Map Reduce Programs over HBase without affecQng read latency • How to batch load data in HBase without affecQng read latencies
  14. 14. Scaling with HBase" • Every new write is wri,en as a new column in HBase. This also ensures de-­‐duplica;on • At the ;me of wri;ng new event, no read happens ! Column Family 0 User_id 1 Qmestamp_DealId_eventType: {Impression1} Qmestamp_DealId_eventType: {Impression2}
  15. 15. Final Design" Real Time HBase Batch HBase ReplicaQon Bulk Load data via HFiles Map Reduce Over HBase
  16. 16. Leveraging System for Real Time Analytics" Various requirements from relevance algorithms to pre-­‐ compute real ;me analy;cs for be,er targe;ng Category Level MulQdimensional Performance Metrics Deal Level Performance Metrics How does women in Barcelona convert for Pizza deals? How does women in Barcelona are converQng for a parQcular pizza deal?
  17. 17. Leveraging System for Real Time Analytics" More Complex Examples Category Level MulQdimensional Performance Metrics Deal Level Performance Metrics How does women in Barcelona from Las Ramblas area aged 45-­‐50 convert for New York Style Pizza, when deal is located within 2 miles, and when deal is priced between €10-­‐€20? How does women in Barcelona from Las Ramblas area aged 45-­‐50 convert for New York Style Pizza, for this parQcular deal which happens to be about 2 miles away, and deal is priced for €12
  18. 18. Leveraging System for Real Time Analytics" Even More Complex Examples How does women in Barcelona from Las Ramblas area aged 45-­‐50 convert for New York Style Pizza, when deal is located within 2 miles, and when deal is priced between €10-­‐ €20 who also like AcQviQes such as Biking and who have been very acQve customer of Groupon deals on mobile plakorm? How does women in Barcelona from Las Ramblas area aged 45-­‐50 convert for New York Style Pizza, for this parQcular deal which happens to be about 2 miles away, and deal is priced for €12, who also like acQviQes such as biking and who have been very acQve customer of Groupon deals on mobile plakorm?
  19. 19. Power of Simple Counting" Turns out all earlier quesQons can be answered if we could count appropriate events in appropriate bucket No of Purchases by Women in Palo Alto for No Deal Impressions by Women in Palo Alto for Pizza Deals Conversion rate Pizza Deals for pizza deals for women in Palo Alto =
  20. 20. Real Time Analytics Infrastructure" Ka_a Topic – With Real Time User events Storm – Running AnalyQcs Topology Real Time infrastructure processing 100,000 requests/ second Redis 1 … Storm Topology calculaQng various dimensions/ buckets and updates appropriate Redis bucket. Redis is sharded from client side Redis cluster handles over 3 Million events per second. Stores over 14 Billion unique keys Redis 2 Redis N
  21. 21. Real Time Analytics Infrastructure - Explained" Ka_a Topic – With Real Time User events Read user event Data from Ka_a Find out which all buckets this event falls Increase event counter for appropriate bucket in Redis Redis Shards Storm
  22. 22. Scaling Challenges - Kafka - Storm" • Massive reads and writes to Ka_a cluster created backpressure in Storm topologies that produce data into Ka_a • Storm was hard to scale. We had to try various number of combinaQons to finalize how many bolts of each type are required for steady state operaQons and overall how many workers are needed etc. This took more Qme than we anQcipated • Use “maxSpoutPending” serng in Storm topologies. We found it to be very useful to shield your topologies from sudden increase in traffic • Build your enQre infrastructure – where data duplicates are allowed
  23. 23. Scaling Challenges - Redis" • Reduce memory footprint – use hashes. Very memory efficient compared to normal Redis keys • In order to support high write operaQons turned off AOF, turned on RDB backups Easiest of all other infrastructure pieces – Ka_a, Storm, HBase
  24. 24. When Small is Big – Bloom Filters" • Since both Ka_a and Storm can send same data twice specially at scale, it was important to build downstream infrastructure that can handle duplicate data. • However, by very nature AnalyQcs Topology (CounQng Topology) cannot handle duplicates • Storing individual messages for billions of messages is way too expensive and would take lot more memory • So we used bloom filters. At a very small % error rate, we could effecQvely de dup data with a very small memory footprint. We store bloom filters in Redis using redis “bit” support
  25. 25. Avoiding Errors – Backups/ Recovery Strategy" With such a high volume system, which also drives so much revenue for the company good backup/ recovery strategy is necessary Redis RDB Backups every X hours. RDB backups are stored in HDFS for later use HBase HBase Snapshot funcQonality is used. Snapshot taken every X hours. Can be loaded as necessary Ka_a/ Storm All input into Ka_a topic is stored in HDFS. So any hour/ day can be replayed from HDFS if necessary
  26. 26. We Are Hiring…"
  27. 27. Questions?"
  28. 28. Ques;ons? Thanks! ameya@groupon.com! www.groupon.com/techjobs!

×