SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
Proprietary and
Presenter: Konstantin Gredeskoul
CTO, Wanelo.com
Based on work of Atasay Gökkaya and other engineers
"It's a Unix System! I know this!"
Using Manta to Scale Event-based Data
Collection and Analysis
@kig
@kigster
Proprietary and
■ Wanelo (“Wah-nee-lo” from Want, Need Love)
is a global platform for all the world’s shopping
Proprietary and
■ Users find products on online stores
■ They post these products to Wanelo via,
a javascript “bookmarklet”
■ Others discover these products on
Wanelo via feed, trending, search, etc
■ Users then save products they
discovered to their own collections
How Wanelo Works
Proprietary and
Proprietary and
■ Users can follow other users. Following is
bi-directional, like Twitter, and public
■ Besides following other users, you can follow
individual stores on wanelo
■ Result is a personalized shopping feed,
much like Twitter’s information feed
■ After seeing a product on Wanelo, users can
buy the product on the original site
Wanelo is a Social Network
Proprietary and
Mobile: iOS + Android
60K ratings
Backend Stack & Key Vendors
Proprietary and
■ MRI Ruby 2.0 & Rails 3
■ PostgreSQL 9.2, solr, redis,
memcached, twemproxy, nginx, haproxy
■ Joyent Cloud, SmartOS
ZFS, ARC Cache, raw IO performance, SMF, Zones, dTrace
■ Joyent Manta: Analytics and Backups
■ Chef, Opscode Enterprise
Full server automation, zero manual installs
■ Images: AWS S3 behind Fastly CDN
■ Circonus, NewRelic, statsd, Boundary
Final word about Wanelo...
Proprietary and
We are slightly obsessed with cat pictures =)
Recording User Events: Why?
Proprietary and
■ Let’s say user saves a product
■ Naturally we create a row in our main data
store (PostgreSQL)
■ But we also want to record this event to an
append-only log table, for future analysis
■ In the ideal world, this append-only table has
every user-generated event of interest
Hey, What’s the Scale Here?
Proprietary and
■ 10M users
■ 7M products saved over 1B times
■ 200K+ stores
■ Backend peaks at 200,000 RPMs
■ Generating between 5M and
20M user events per day
Recording Events: Stupidly
Proprietary and
■ We are just starting: what’s the simplest thing
we can do? Our traffic is still pretty low.
■ Let’s create a database table and append to
that. Simple? Yes.
■ Scalable? Hell No.
■ One month after launch, we hit the wall.
Let’s Scale Data Collection
Proprietary and
■ OK, so inserting 10M records into PostgreSQL
per day is pretty stupid. Even I know that.
■ We looked around for various options. There
were many. Flume, Fluentd, Scribe. Meh.
■ We chose rsyslog: clients can buffer records,
send cheap UDP packets.
■ More than one log collector for redundancy
Scaling Event Data Collection
Proprietary and
■ rsyslog rocks. We are now sending 20M
events per day from 40+ hosts
■ rsyslog is dumping them into an ASCII pipe-
delimited file
■ logadm rotates the file daily. We get 1GB+ file
per day of activity
■ We have solved data collection problem for a
long time, and very cheaply.
Proprietary and
Now What?
Proprietary and
■ So now we have 100s of files, closing in on
500GB of data
■ We want to ask some intelligent questions
■ For example: how many people who signed up
four weeks ago are still active? (cohort
retention)
■ How many products saved does it take for a
user to become engaged?
Let’s Dive Deeper
Proprietary and
■ Here is an example of our log file
(spaces/alignment added for readability)
user_id	
  	
  	
  	
  platform	
  	
  action_type	
  	
  	
  	
  	
  	
  	
  object	
  	
  	
  	
  object_id	
  	
  secondary_object	
  	
  	
  	
  sec_obj_id	
  	
  	
  timestamp
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐
8524264|ipad	
  	
  	
  |SaveAction	
  	
  	
  |Product|5757428|Collection	
  	
  	
  	
  	
  |29399687|1368341942
7555287|android|SaveAction	
  	
  	
  |Product|5758908|GiftsCollection|26680024|1368341942
3924118|iphone	
  |SaveAction	
  	
  	
  |Product|1979020|Collection	
  	
  	
  	
  	
  |29463107|1368341942
1285811|ipad	
  	
  	
  |SessionAction|User	
  	
  	
  |1285811|	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  	
  	
  	
  	
  	
  	
  	
  |1368341942
8246365|ipod	
  	
  	
  |SaveAction	
  	
  	
  |Product|7930662|Collection	
  	
  	
  	
  	
  |28523544|1378895196
1233612|desktop|SessionAction|User	
  	
  	
  |1233612|	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  	
  	
  	
  	
  	
  	
  	
  |1378895196
9654098|desktop|PostAction	
  	
  	
  |Product|7962904|Store	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |158163	
  	
  |1378895197
9654098|desktop|SaveAction	
  	
  	
  |Product|7962904|GiftsCollection|34407722|1378895197
843456	
  |iphone	
  |SessionAction|User	
  	
  	
  |843456	
  |	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  	
  	
  	
  	
  	
  	
  	
  |1378895197
9005146|android|SaveAction	
  	
  	
  |Product|6389593|GiftsCollection|32117206|1378895197
6721497|desktop|CommentAction|Product|7930418|Comment	
  	
  	
  	
  	
  	
  	
  	
  |37304732|1378895197
Parsing ASCII files is simple
Proprietary and
■ What we get with this file format is
simplicity
■ grep,	
  sort,	
  uniq,	
  comp,	
  awk,	
  wc	
  
■ These UNIX tools have been optimized
for four decades! I challenge you to
write a faster grep!
Have YOU brushed up on your
AWK skillz?
Proprietary and
Let’s Ask Some Questions
Proprietary and
cat user_actions_20130626.log | 
awk -F'|' 
'{if ($2==“ipad” &&
$3==“FollowAction” ){
print $1
}
}' | 
sort | 
uniq | 
wc -l
■ How many unique users followed someone or
something on iPad on 06/26/2013?
What About Registrations?
Proprietary and
cat user_actions_20130626.log | 
grep -F -e '|RegisterAction|’ | 
wc -l
■ How many total user registrations
happened across all platforms on the
same day 06/26/2013?
How fast is it really?
Proprietary and
■ It takes about 10 seconds to grep through a
1.5GB (single day of recorded events) file
>	
  time	
  gunzip	
  -­‐c	
  user_actions.log.20130512.gz	
  |	
  
>	
  	
  	
  	
  /usr/bin/grep	
  SaveAction	
  |	
  wc	
  -­‐l
......
real	
  	
  	
  	
  0m	
  	
  9.584s
user	
  	
  	
  	
  0m	
  12.195s
sys	
  	
  	
  	
  	
  0m	
  	
  1.672s
Can we go back a whole year?
Proprietary and
■ On one hand, we know how to do it...
■ The problem is: 10 seconds x 360 files
■ Sounds like a data warehouse!
/run query; /come back the next day
■ Now we are talking hours of parsing!
Map/Reduce
Proprietary and
■ Google published this model in 2004
■ It describes a way to parallelize algorithms
across huge data sets
Map/Reduce
Proprietary and
■ Decidedly, Map/Reduce requires a new
way of thinking
■ Today we have many related projects,
such as Hadoop, HDFS, Spark, Hive,
Pig
■ Which means that it also requires learning
these (somewhat) new tools
On Demand or Permanent?
Proprietary and
■ With Hadoop, one practical question is that
of infrastructure lifecycle:
■ One can create an “on-demand” Hadoop
cluster to run analytics
■ But “on-demand” solution is cheap. Once
queried, Hadoop cluster can be killed
■ This requires copying lots of (TBs) of data
from storage (typically S3) and takes time
Static Hadoop Cluster
Proprietary and
■ With a continuously running Hadoop
cluster, the biggest issue is cost
■ It’s very expensive to keep a large cluster
around, sitting on top of a copy of a giant
dataset
Proprietary and
Enter Joyent’s Manta
■ Distributed Object Store, sort of like S3
■ UNIX-like file system semantics for
objects, and supports directories (YES!!!!)
■ Native compute on top of objects!
■ Strongly consistent instead of eventual
consistency
Proprietary and
Detailed look at Manta later at Surge2013
Mark Cavage and David Pacheco (Joyent) will
discuss building Manta in “Scaling the Unix
Philosophy to Big Data” talk on Friday @ 10am
Proprietary and
User Events → Joyent Manta
■ Instead of saving daily event logs to NFS,
we now push them as objects to Manta
■ One object = one file = one day of events
■ Let’s look at an example...
Proprietary and
Uploading and Downloading
	
  >	
  mput	
  -­‐f	
  user_actions.20130911	
  
	
  	
  	
  /wanelo/stor/user_actions/20130911
	
  >	
  mget	
  
	
  	
  	
  /wanelo/stor/user_actions/20130911	
  >
	
  	
  	
  user_actions.20130911
	
  >	
  mmkdir	
  /wanelo/stor/user_actions
Proprietary and
Listing Uploaded User Events
>	
  mls	
  /wanelo/stor/user_actions
	
  	
  ....
	
  	
  20130909
	
  	
  20130910
	
  	
  20130911
	
  	
  20130912
Proprietary and
Beyond Object Store
■ What makes Manta unique is native
compute on top of our objects
■ We submit a compute job to Manta
■ Manta creates many virtual instances in
seconds (or even milliseconds)
■ We even get root access!
■ We parse our event objects in parallel
Proprietary and
Manta’s “Map/Reduce”
■ Streams objects into initial phase
■ Pipes output of initial phase into the
input of the next phase (like UNIX!)
■ Each phase is either one-to-one (map
phase), or many-to-one (reduce)
Proprietary and
Manta’s “Map/Reduce”
input object filtered object
combined resultinput object filtered object
input object filtered object
map phase 1 map phase 2 reduce phase
It’s very familiar, because it’s so similar
to piping on a single machine
Proprietary and
Real Example
■ Let’s ask a more computationally expensive
question:
■ How many times a store was followed in the
last three months?
Proprietary and
Aggegating Store Follows
■ Map phase:
■ Reduce phase (sum up all the numbers):
grep -F -e '|FollowAction|’ | 
grep -F -e '|Store|’ | 
wc -l
awk ' { total += $1 }
END { print total } '
Proprietary and
Cohort Retention Analysis
■ We can save output of map/reduce jobs
in another stored object
■ “Cohort” is a set of unique users sharing
a particular property
■ Let’s save a unique set of users who
registered between 21 and 28 days ago
into a temporary object
Proprietary and
Cohort Retention Analysis, ctd
awk -F '|'
'{ if ($3 == “RegisterAction”)
{ print $1 }
}'
■ Map Phase runs only on 7 days for the
given week
■ Reduce phase saves the result into a
temporary object
sort | 
uniq | 
mtee /wanelo/stor/tmp/cohort_user_ids
Proprietary and
Cohort Retention Analysis, ctd
■ Now we just need to get unique users active this
week, and intersect them with the temporary object
awk -F'|' '{ print $1 }'
 
sort | 
uniq > period_uniq_ids && 
comm -12 period_uniq_ids 
/assets/wanelo/stor/tmp/cohort_user_ids | 
wc -l
■ Map Phase runs on last 7 days
■ Reduce phase intersects
Proprietary and
Other Uses of Manta @ Wanelo
■ We can migrate user images to Manta
instead of S3, and serve them via CDN
■ If we need to create new image format,
we submit a job to use CLI tools to
generate new format, or thumbnail size
■ We can (and do!) push database
backups and PostgreSQL archive logs to
Manta
Proprietary and
Conclusion
■ We were able to create a very cost-efficient
way to store massive amount of events
■ Manta allows us to
perform complex
algebraic queries
on our event data,
very fast and also
cheap
Proprietary and
And we are just scratching the surface of
what’s possible with Manta...
Thanks!
apidocs.joyent.com/manta
github.com/wanelo
github.com/wanelo-chef
Wanelo’s technical blog:
building.wanelo.com
Proprietary and
@kig
@kig
@kigster

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5DianaGray10
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdfPaige Cruz
 
The Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API ManagementThe Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API ManagementNuwan Dias
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"DianaGray10
 
Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024Alexander Turgeon
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideHironori Washizaki
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 

Kürzlich hochgeladen (20)

Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
 
The Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API ManagementThe Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API Management
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
 
Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 

Empfohlen

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Empfohlen (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 

Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

  • 1. Proprietary and Presenter: Konstantin Gredeskoul CTO, Wanelo.com Based on work of Atasay Gökkaya and other engineers "It's a Unix System! I know this!" Using Manta to Scale Event-based Data Collection and Analysis @kig @kigster
  • 2. Proprietary and ■ Wanelo (“Wah-nee-lo” from Want, Need Love) is a global platform for all the world’s shopping
  • 3. Proprietary and ■ Users find products on online stores ■ They post these products to Wanelo via, a javascript “bookmarklet” ■ Others discover these products on Wanelo via feed, trending, search, etc ■ Users then save products they discovered to their own collections How Wanelo Works
  • 5. Proprietary and ■ Users can follow other users. Following is bi-directional, like Twitter, and public ■ Besides following other users, you can follow individual stores on wanelo ■ Result is a personalized shopping feed, much like Twitter’s information feed ■ After seeing a product on Wanelo, users can buy the product on the original site Wanelo is a Social Network
  • 6. Proprietary and Mobile: iOS + Android 60K ratings
  • 7. Backend Stack & Key Vendors Proprietary and ■ MRI Ruby 2.0 & Rails 3 ■ PostgreSQL 9.2, solr, redis, memcached, twemproxy, nginx, haproxy ■ Joyent Cloud, SmartOS ZFS, ARC Cache, raw IO performance, SMF, Zones, dTrace ■ Joyent Manta: Analytics and Backups ■ Chef, Opscode Enterprise Full server automation, zero manual installs ■ Images: AWS S3 behind Fastly CDN ■ Circonus, NewRelic, statsd, Boundary
  • 8. Final word about Wanelo... Proprietary and We are slightly obsessed with cat pictures =)
  • 9. Recording User Events: Why? Proprietary and ■ Let’s say user saves a product ■ Naturally we create a row in our main data store (PostgreSQL) ■ But we also want to record this event to an append-only log table, for future analysis ■ In the ideal world, this append-only table has every user-generated event of interest
  • 10. Hey, What’s the Scale Here? Proprietary and ■ 10M users ■ 7M products saved over 1B times ■ 200K+ stores ■ Backend peaks at 200,000 RPMs ■ Generating between 5M and 20M user events per day
  • 11. Recording Events: Stupidly Proprietary and ■ We are just starting: what’s the simplest thing we can do? Our traffic is still pretty low. ■ Let’s create a database table and append to that. Simple? Yes. ■ Scalable? Hell No. ■ One month after launch, we hit the wall.
  • 12. Let’s Scale Data Collection Proprietary and ■ OK, so inserting 10M records into PostgreSQL per day is pretty stupid. Even I know that. ■ We looked around for various options. There were many. Flume, Fluentd, Scribe. Meh. ■ We chose rsyslog: clients can buffer records, send cheap UDP packets. ■ More than one log collector for redundancy
  • 13. Scaling Event Data Collection Proprietary and ■ rsyslog rocks. We are now sending 20M events per day from 40+ hosts ■ rsyslog is dumping them into an ASCII pipe- delimited file ■ logadm rotates the file daily. We get 1GB+ file per day of activity ■ We have solved data collection problem for a long time, and very cheaply.
  • 15. Now What? Proprietary and ■ So now we have 100s of files, closing in on 500GB of data ■ We want to ask some intelligent questions ■ For example: how many people who signed up four weeks ago are still active? (cohort retention) ■ How many products saved does it take for a user to become engaged?
  • 16. Let’s Dive Deeper Proprietary and ■ Here is an example of our log file (spaces/alignment added for readability) user_id        platform    action_type              object        object_id    secondary_object        sec_obj_id      timestamp -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 8524264|ipad      |SaveAction      |Product|5757428|Collection          |29399687|1368341942 7555287|android|SaveAction      |Product|5758908|GiftsCollection|26680024|1368341942 3924118|iphone  |SaveAction      |Product|1979020|Collection          |29463107|1368341942 1285811|ipad      |SessionAction|User      |1285811|                              |                |1368341942 8246365|ipod      |SaveAction      |Product|7930662|Collection          |28523544|1378895196 1233612|desktop|SessionAction|User      |1233612|                              |                |1378895196 9654098|desktop|PostAction      |Product|7962904|Store                    |158163    |1378895197 9654098|desktop|SaveAction      |Product|7962904|GiftsCollection|34407722|1378895197 843456  |iphone  |SessionAction|User      |843456  |                              |                |1378895197 9005146|android|SaveAction      |Product|6389593|GiftsCollection|32117206|1378895197 6721497|desktop|CommentAction|Product|7930418|Comment                |37304732|1378895197
  • 17. Parsing ASCII files is simple Proprietary and ■ What we get with this file format is simplicity ■ grep,  sort,  uniq,  comp,  awk,  wc   ■ These UNIX tools have been optimized for four decades! I challenge you to write a faster grep!
  • 18. Have YOU brushed up on your AWK skillz? Proprietary and
  • 19. Let’s Ask Some Questions Proprietary and cat user_actions_20130626.log | awk -F'|' '{if ($2==“ipad” && $3==“FollowAction” ){ print $1 } }' | sort | uniq | wc -l ■ How many unique users followed someone or something on iPad on 06/26/2013?
  • 20. What About Registrations? Proprietary and cat user_actions_20130626.log | grep -F -e '|RegisterAction|’ | wc -l ■ How many total user registrations happened across all platforms on the same day 06/26/2013?
  • 21. How fast is it really? Proprietary and ■ It takes about 10 seconds to grep through a 1.5GB (single day of recorded events) file >  time  gunzip  -­‐c  user_actions.log.20130512.gz  |   >        /usr/bin/grep  SaveAction  |  wc  -­‐l ...... real        0m    9.584s user        0m  12.195s sys          0m    1.672s
  • 22. Can we go back a whole year? Proprietary and ■ On one hand, we know how to do it... ■ The problem is: 10 seconds x 360 files ■ Sounds like a data warehouse! /run query; /come back the next day ■ Now we are talking hours of parsing!
  • 23. Map/Reduce Proprietary and ■ Google published this model in 2004 ■ It describes a way to parallelize algorithms across huge data sets
  • 24. Map/Reduce Proprietary and ■ Decidedly, Map/Reduce requires a new way of thinking ■ Today we have many related projects, such as Hadoop, HDFS, Spark, Hive, Pig ■ Which means that it also requires learning these (somewhat) new tools
  • 25. On Demand or Permanent? Proprietary and ■ With Hadoop, one practical question is that of infrastructure lifecycle: ■ One can create an “on-demand” Hadoop cluster to run analytics ■ But “on-demand” solution is cheap. Once queried, Hadoop cluster can be killed ■ This requires copying lots of (TBs) of data from storage (typically S3) and takes time
  • 26. Static Hadoop Cluster Proprietary and ■ With a continuously running Hadoop cluster, the biggest issue is cost ■ It’s very expensive to keep a large cluster around, sitting on top of a copy of a giant dataset
  • 27. Proprietary and Enter Joyent’s Manta ■ Distributed Object Store, sort of like S3 ■ UNIX-like file system semantics for objects, and supports directories (YES!!!!) ■ Native compute on top of objects! ■ Strongly consistent instead of eventual consistency
  • 28. Proprietary and Detailed look at Manta later at Surge2013 Mark Cavage and David Pacheco (Joyent) will discuss building Manta in “Scaling the Unix Philosophy to Big Data” talk on Friday @ 10am
  • 29. Proprietary and User Events → Joyent Manta ■ Instead of saving daily event logs to NFS, we now push them as objects to Manta ■ One object = one file = one day of events ■ Let’s look at an example...
  • 30. Proprietary and Uploading and Downloading  >  mput  -­‐f  user_actions.20130911        /wanelo/stor/user_actions/20130911  >  mget        /wanelo/stor/user_actions/20130911  >      user_actions.20130911  >  mmkdir  /wanelo/stor/user_actions
  • 31. Proprietary and Listing Uploaded User Events >  mls  /wanelo/stor/user_actions    ....    20130909    20130910    20130911    20130912
  • 32. Proprietary and Beyond Object Store ■ What makes Manta unique is native compute on top of our objects ■ We submit a compute job to Manta ■ Manta creates many virtual instances in seconds (or even milliseconds) ■ We even get root access! ■ We parse our event objects in parallel
  • 33. Proprietary and Manta’s “Map/Reduce” ■ Streams objects into initial phase ■ Pipes output of initial phase into the input of the next phase (like UNIX!) ■ Each phase is either one-to-one (map phase), or many-to-one (reduce)
  • 34. Proprietary and Manta’s “Map/Reduce” input object filtered object combined resultinput object filtered object input object filtered object map phase 1 map phase 2 reduce phase It’s very familiar, because it’s so similar to piping on a single machine
  • 35. Proprietary and Real Example ■ Let’s ask a more computationally expensive question: ■ How many times a store was followed in the last three months?
  • 36. Proprietary and Aggegating Store Follows ■ Map phase: ■ Reduce phase (sum up all the numbers): grep -F -e '|FollowAction|’ | grep -F -e '|Store|’ | wc -l awk ' { total += $1 } END { print total } '
  • 37. Proprietary and Cohort Retention Analysis ■ We can save output of map/reduce jobs in another stored object ■ “Cohort” is a set of unique users sharing a particular property ■ Let’s save a unique set of users who registered between 21 and 28 days ago into a temporary object
  • 38. Proprietary and Cohort Retention Analysis, ctd awk -F '|' '{ if ($3 == “RegisterAction”) { print $1 } }' ■ Map Phase runs only on 7 days for the given week ■ Reduce phase saves the result into a temporary object sort | uniq | mtee /wanelo/stor/tmp/cohort_user_ids
  • 39. Proprietary and Cohort Retention Analysis, ctd ■ Now we just need to get unique users active this week, and intersect them with the temporary object awk -F'|' '{ print $1 }'   sort | uniq > period_uniq_ids && comm -12 period_uniq_ids /assets/wanelo/stor/tmp/cohort_user_ids | wc -l ■ Map Phase runs on last 7 days ■ Reduce phase intersects
  • 40. Proprietary and Other Uses of Manta @ Wanelo ■ We can migrate user images to Manta instead of S3, and serve them via CDN ■ If we need to create new image format, we submit a job to use CLI tools to generate new format, or thumbnail size ■ We can (and do!) push database backups and PostgreSQL archive logs to Manta
  • 41. Proprietary and Conclusion ■ We were able to create a very cost-efficient way to store massive amount of events ■ Manta allows us to perform complex algebraic queries on our event data, very fast and also cheap
  • 42. Proprietary and And we are just scratching the surface of what’s possible with Manta...