This was presented by the Yongsheng Wu, head of big data and ML platform at Pinterest, at the Alluxio bay area meetup.
Yongsheng shares Pinterest's journey to build a fast and scalable big data and ML platform in AWS for Pinterest to handle the requests and complexity in data at scale. In this talk, he will cover different aspects from the requirements of the platform, the challenges encountered, the technologies chosen, and the tradeoffs that were made.
6. Mission
Provide a highly scalable, reliable, secure, performant, efficient and
delightful-to-use big data and machine learning platform to enable rapid
product innovation and help make Pinterest a thriving business.
Vision
A big data and machine learning platform at scale enables every single
engineer at Pinterest to derive trustworthy, actionable insights and
apply ML to solve complex problems with ease and confidence.
8. Principles
● Put engineers first - make the platform delightful-to-use for all
engineers at Pinterest
● Keep it simple, get it right - build a simple yet sufficient
platform
● Enable speed and quality - enable all engineers at Pinterest to
move fast with scalable, reliable, secure, performant and efficient
solutions made easy by the platform
● Build with reusability and for reusability - embrace open
source technology, build with lego blocks and provide lego blocks to
all engineers at Pinterest
13. Pinterest’s data graph: Pin/Image/Board/User...
xJoin
pin’s text
image
info
video
info
texts
text
languages
text
scores
SEO
signa
l
link
languagelink
country
link perf
link scores
safe
search
spam
visual
signal
catvec_v0
pin’s catvec_v0
catvec_v1
pin’s catvec_v1
topicvec_v4
pin’s topicvec_v4
country
vecs
text
tokens
landing
page
annot_embedding v3
annotation_v2
annotation_v3
annotation_v4
Feature Platform - Today
14. code
module
developer
retrieval API, serving, acl, ...
offline consumers
(ML model training)
online consumers
(ML model serving)
Signal Access & Serving
spec
metadata
code
module
developer
spec
metadata
code
module
developer
spec
metadata
Galaxy: next-gen feature platform
* incremental dataflow execution engine
* signal data store (“column”-partitioned) and metadata repo (registry, stats)
* dependency management
* governance: enforcement & tracking
Metadata-driven framework & dev API
ML Platform
BDP BDP
21. Much more complex in practice
Learner 1
Parameter
Autotuning
Serving &
Logging
Automation
Feature
Extraction 1
Related Pins Ads Home Feed
Learner 2
Data
Monitoring
Serving &
Logging
Automation
Feature
Extraction 2
Learner 3
Data
Monitoring
Serving &
Logging
Automation
Feature
Extraction 3
Distributed
Training
Distributed
Training
Similar components, no sharing!
Incomplete stacks
22. Unified ML Platform
Learner
Parameter
Autotuning
Serving &
Logging
Automation
Feature
Extraction
Related Pins Ads Home Feed
Data
Monitoring
Distributed
Training
Client teams focus on business problems, not infra problems.
Search
NUX Topic Picker
Notifications
New use cases
Platform team specializes in
infra problems.
Quick to build new
ML applications.
23. Unified Big Data ML Platform
● Speed & quality
● Single Use Case
○ 0 -> 1 made fast, easy and robust - create a ML model
to solve a complex problem
○ 1 -> N made automated - such a ML model continuously
trained, improved, and deployed
● Many Use Cases on the Platform
○ N -> N2 - most of ML models trained and served by the platform
25. Scorpion Training & Catwalk
Catwalk: enables running training jobs on
distributed cluster
Tensorflow XGBoost
Mesos: Cluster resource
management (CPUs, RAM,
GPUs)
Kubernetes:
to replace Mesos in
2018
Scorpion Training
Abstracts user from specific trainer package used.
future: other
packages
runs on
30. Pixie: Graph walks
● The greatest asset of Pinterest is our pin-to-board graph
○ It captures relationships between pins (how objects are organized into collections)
○ Can be used to capture multiple different interactions: pins to boards, clicks by user,...
● We use Pixie for candidate generation: How to quickly go from 2B pins to 1k
pins so that ML models can then score each pin separately
● Represent user a (set of) pin(s) Q and do a random walk from Q:
○ Bias the walk towards fresh pins, Pins in the local user’s language, Pins that males/females like
33. ● [Product Enablement] Streaming engines
○ Spark Structured Streaming
○ Flink
○ … ...
● [Scalability] Spinner - next gen workflow engine
● [Performance] Hive on Tez
● [Efficiency] Hadoop auto-scaling
● [Future Proofing] Spark on Kubernetes
● [Future Proofing] Hadoop 3.0
Big Data Platform
34. code
module
developer
retrieval API, serving, acl, ...
offline consumers
(ML model training)
online consumers
(ML model serving)
Signal Access & Serving
spec
metadata
code
module
developer
spec
metadata
code
module
developer
spec
metadata
Galaxy: next-gen feature platform
* incremental dataflow execution engine
* signal data store (“column”-partitioned) and metadata repo (registry, stats)
* dependency management
* governance: enforcement & tracking
Metadata-driven framework & dev API
ML Platform
BDP BDP
35. ML Platform
Learner
Model Eval &
Comparison
Data
Monitoring
Feature
Analysis
Parameter
Autotunin
g
Model
Serving
Logging
Developer Frontend
off-the-shelf
solutions:
Tensorflow ...
Scorpion
Serving
Scorpion
Training
Incremental & Real-Time Training Automation
Model
Deploy
Linchpin DSL
Model Version
Management
Feature
Extraction
Real-time
Feature Sources
Counting
Service
ML Serving Systems
ML Training Platform
Team key:
Model Runtime
Validation
37. Key Learnings
● Unified big data ML platform greatly accelerates
product innovations
● Data lineage, quality and democracy are vital to
organization scalability
● Speed, quality & delightful-to-use