Data aggregation and analysis problems become notoriously thorny as traffic scales up: conventional databases break down at scale, and map/reduce frameworks such as Hadoop have a substantial developer and operational complexity burden. Wanelo, an online community for all the world's shopping bringing together stores, products and 10M users all in one social platform, became frustrated that the aggregation and analysis tools used when data was small (venerable Unix data processing utilities like grep, awk, cut, sed, uniq and sort) couldn't be used when data became large. Upon discovering Manta, a new cloud-based object storage system that enables the storing and processing of data simultaneously, Wanelo had a solution that no longer required the need to move data between storage and compute. Building on Manta, Wanelo has developed a system for data analysis that allows the team to tackle big data analysis using Unix utilities, resulting in a cost-effective and scalable solution. In this talk Konstantin discussed Wanelo's experiences building their system on Manta, including their motivations and considered alternatives that led to a Manta-based implementation of fully-parallelized cohort retention analysis in four lines of shell.
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo
1. Proprietary and
Presenter: Konstantin Gredeskoul
CTO, Wanelo.com
Based on work of Atasay Gökkaya and other engineers
"It's a Unix System! I know this!"
Using Manta to Scale Event-based Data
Collection and Analysis
@kig
@kigster
2. Proprietary and
■ Wanelo (“Wah-nee-lo” from Want, Need Love)
is a global platform for all the world’s shopping
3. Proprietary and
■ Users find products on online stores
■ They post these products to Wanelo via,
a javascript “bookmarklet”
■ Others discover these products on
Wanelo via feed, trending, search, etc
■ Users then save products they
discovered to their own collections
How Wanelo Works
5. Proprietary and
■ Users can follow other users. Following is
bi-directional, like Twitter, and public
■ Besides following other users, you can follow
individual stores on wanelo
■ Result is a personalized shopping feed,
much like Twitter’s information feed
■ After seeing a product on Wanelo, users can
buy the product on the original site
Wanelo is a Social Network
8. Final word about Wanelo...
Proprietary and
We are slightly obsessed with cat pictures =)
9. Recording User Events: Why?
Proprietary and
■ Let’s say user saves a product
■ Naturally we create a row in our main data
store (PostgreSQL)
■ But we also want to record this event to an
append-only log table, for future analysis
■ In the ideal world, this append-only table has
every user-generated event of interest
10. Hey, What’s the Scale Here?
Proprietary and
■ 10M users
■ 7M products saved over 1B times
■ 200K+ stores
■ Backend peaks at 200,000 RPMs
■ Generating between 5M and
20M user events per day
11. Recording Events: Stupidly
Proprietary and
■ We are just starting: what’s the simplest thing
we can do? Our traffic is still pretty low.
■ Let’s create a database table and append to
that. Simple? Yes.
■ Scalable? Hell No.
■ One month after launch, we hit the wall.
12. Let’s Scale Data Collection
Proprietary and
■ OK, so inserting 10M records into PostgreSQL
per day is pretty stupid. Even I know that.
■ We looked around for various options. There
were many. Flume, Fluentd, Scribe. Meh.
■ We chose rsyslog: clients can buffer records,
send cheap UDP packets.
■ More than one log collector for redundancy
13. Scaling Event Data Collection
Proprietary and
■ rsyslog rocks. We are now sending 20M
events per day from 40+ hosts
■ rsyslog is dumping them into an ASCII pipe-
delimited file
■ logadm rotates the file daily. We get 1GB+ file
per day of activity
■ We have solved data collection problem for a
long time, and very cheaply.
15. Now What?
Proprietary and
■ So now we have 100s of files, closing in on
500GB of data
■ We want to ask some intelligent questions
■ For example: how many people who signed up
four weeks ago are still active? (cohort
retention)
■ How many products saved does it take for a
user to become engaged?
16. Let’s Dive Deeper
Proprietary and
■ Here is an example of our log file
(spaces/alignment added for readability)
user_id
platform
action_type
object
object_id
secondary_object
sec_obj_id
timestamp
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
8524264|ipad
|SaveAction
|Product|5757428|Collection
|29399687|1368341942
7555287|android|SaveAction
|Product|5758908|GiftsCollection|26680024|1368341942
3924118|iphone
|SaveAction
|Product|1979020|Collection
|29463107|1368341942
1285811|ipad
|SessionAction|User
|1285811|
|
|1368341942
8246365|ipod
|SaveAction
|Product|7930662|Collection
|28523544|1378895196
1233612|desktop|SessionAction|User
|1233612|
|
|1378895196
9654098|desktop|PostAction
|Product|7962904|Store
|158163
|1378895197
9654098|desktop|SaveAction
|Product|7962904|GiftsCollection|34407722|1378895197
843456
|iphone
|SessionAction|User
|843456
|
|
|1378895197
9005146|android|SaveAction
|Product|6389593|GiftsCollection|32117206|1378895197
6721497|desktop|CommentAction|Product|7930418|Comment
|37304732|1378895197
17. Parsing ASCII files is simple
Proprietary and
■ What we get with this file format is
simplicity
■ grep,
sort,
uniq,
comp,
awk,
wc
■ These UNIX tools have been optimized
for four decades! I challenge you to
write a faster grep!
19. Let’s Ask Some Questions
Proprietary and
cat user_actions_20130626.log |
awk -F'|'
'{if ($2==“ipad” &&
$3==“FollowAction” ){
print $1
}
}' |
sort |
uniq |
wc -l
■ How many unique users followed someone or
something on iPad on 06/26/2013?
20. What About Registrations?
Proprietary and
cat user_actions_20130626.log |
grep -F -e '|RegisterAction|’ |
wc -l
■ How many total user registrations
happened across all platforms on the
same day 06/26/2013?
21. How fast is it really?
Proprietary and
■ It takes about 10 seconds to grep through a
1.5GB (single day of recorded events) file
>
time
gunzip
-‐c
user_actions.log.20130512.gz
|
>
/usr/bin/grep
SaveAction
|
wc
-‐l
......
real
0m
9.584s
user
0m
12.195s
sys
0m
1.672s
22. Can we go back a whole year?
Proprietary and
■ On one hand, we know how to do it...
■ The problem is: 10 seconds x 360 files
■ Sounds like a data warehouse!
/run query; /come back the next day
■ Now we are talking hours of parsing!
24. Map/Reduce
Proprietary and
■ Decidedly, Map/Reduce requires a new
way of thinking
■ Today we have many related projects,
such as Hadoop, HDFS, Spark, Hive,
Pig
■ Which means that it also requires learning
these (somewhat) new tools
25. On Demand or Permanent?
Proprietary and
■ With Hadoop, one practical question is that
of infrastructure lifecycle:
■ One can create an “on-demand” Hadoop
cluster to run analytics
■ But “on-demand” solution is cheap. Once
queried, Hadoop cluster can be killed
■ This requires copying lots of (TBs) of data
from storage (typically S3) and takes time
26. Static Hadoop Cluster
Proprietary and
■ With a continuously running Hadoop
cluster, the biggest issue is cost
■ It’s very expensive to keep a large cluster
around, sitting on top of a copy of a giant
dataset
27. Proprietary and
Enter Joyent’s Manta
■ Distributed Object Store, sort of like S3
■ UNIX-like file system semantics for
objects, and supports directories (YES!!!!)
■ Native compute on top of objects!
■ Strongly consistent instead of eventual
consistency
28. Proprietary and
Detailed look at Manta later at Surge2013
Mark Cavage and David Pacheco (Joyent) will
discuss building Manta in “Scaling the Unix
Philosophy to Big Data” talk on Friday @ 10am
29. Proprietary and
User Events → Joyent Manta
■ Instead of saving daily event logs to NFS,
we now push them as objects to Manta
■ One object = one file = one day of events
■ Let’s look at an example...
30. Proprietary and
Uploading and Downloading
>
mput
-‐f
user_actions.20130911
/wanelo/stor/user_actions/20130911
>
mget
/wanelo/stor/user_actions/20130911
>
user_actions.20130911
>
mmkdir
/wanelo/stor/user_actions
32. Proprietary and
Beyond Object Store
■ What makes Manta unique is native
compute on top of our objects
■ We submit a compute job to Manta
■ Manta creates many virtual instances in
seconds (or even milliseconds)
■ We even get root access!
■ We parse our event objects in parallel
33. Proprietary and
Manta’s “Map/Reduce”
■ Streams objects into initial phase
■ Pipes output of initial phase into the
input of the next phase (like UNIX!)
■ Each phase is either one-to-one (map
phase), or many-to-one (reduce)
34. Proprietary and
Manta’s “Map/Reduce”
input object filtered object
combined resultinput object filtered object
input object filtered object
map phase 1 map phase 2 reduce phase
It’s very familiar, because it’s so similar
to piping on a single machine
35. Proprietary and
Real Example
■ Let’s ask a more computationally expensive
question:
■ How many times a store was followed in the
last three months?
36. Proprietary and
Aggegating Store Follows
■ Map phase:
■ Reduce phase (sum up all the numbers):
grep -F -e '|FollowAction|’ |
grep -F -e '|Store|’ |
wc -l
awk ' { total += $1 }
END { print total } '
37. Proprietary and
Cohort Retention Analysis
■ We can save output of map/reduce jobs
in another stored object
■ “Cohort” is a set of unique users sharing
a particular property
■ Let’s save a unique set of users who
registered between 21 and 28 days ago
into a temporary object
38. Proprietary and
Cohort Retention Analysis, ctd
awk -F '|'
'{ if ($3 == “RegisterAction”)
{ print $1 }
}'
■ Map Phase runs only on 7 days for the
given week
■ Reduce phase saves the result into a
temporary object
sort |
uniq |
mtee /wanelo/stor/tmp/cohort_user_ids
39. Proprietary and
Cohort Retention Analysis, ctd
■ Now we just need to get unique users active this
week, and intersect them with the temporary object
awk -F'|' '{ print $1 }'
sort |
uniq > period_uniq_ids &&
comm -12 period_uniq_ids
/assets/wanelo/stor/tmp/cohort_user_ids |
wc -l
■ Map Phase runs on last 7 days
■ Reduce phase intersects
40. Proprietary and
Other Uses of Manta @ Wanelo
■ We can migrate user images to Manta
instead of S3, and serve them via CDN
■ If we need to create new image format,
we submit a job to use CLI tools to
generate new format, or thumbnail size
■ We can (and do!) push database
backups and PostgreSQL archive logs to
Manta
41. Proprietary and
Conclusion
■ We were able to create a very cost-efficient
way to store massive amount of events
■ Manta allows us to
perform complex
algebraic queries
on our event data,
very fast and also
cheap