10. 10
Reason 1: minimizing training time
Collecting dataset
Training
Serving
Time
Model 1
time-to-serve delay
11. 11
Training time vs online performance
▪ Most (all?) recommendation algorithms need to predict
future behavior from past information
▪ If model training takes days, it might miss out on important
changes
▪ New items being introduced
▪ Popularity swings
▪ Changes in underlying feature distributions
▪ Time-to-serve can be a key component in how good the
recommendations will be, online
12. 12
Training time vs experimentation speed
▪ Faster training time
=> more offline experimentations and iterations
=> better models
▪ Many other factors at play (like modularity of the ML
framework), but training time is a key one
▪ How quickly can you iterate through e.g. model architectures
if training a model takes days?
13. 13
Reason 2: increasing dataset size
▪ If your model is complex enough (trees, DNNs, …) more data
could help
▪ … But this will have an impact on the training time
▪ … Which in turn could have a negative impact on
time-to-serve delay and experimentation speed
▪ Hard limits
15. 15
Topic-sensitive PageRank
▪ Popular graph diffusion algorithm
▪ Capturing vertex importance with regards to a particular
vertex
▪ Easy to distribute using Spark and GraphX
▪ Fast distributed implementation contributed by Netflix
(coming up in Spark 2.1!)
16. 16
Iteration 0
We start by
activating a single
node
“Seattle”
related to
shot in
featured in
related to
cast
cast
cast
related to
17. 17
Iteration 1
With some probability,
we follow outbound
edges, otherwise we
go back to the origin.
20. 20
Latent Dirichlet Allocation
▪ Popular clustering /
latent factors model
▪ Uncollapsed Gibbs
sampler is fairly easy to
distribute
Per-topic
word
distributions
Per-document
topic
distributions
Topic label for
document d and
word w
28. 28
Topic Sensitive Pagerank
▪ Distributed Spark/GraphX implementation
▪ Available in Spark 2.1
▪ Propagates multiple vertices at once
▪ Alternative implementation
▪ Single-threaded and single-machine for one source vertex
▪ Works on full graph adjacency
▪ Scala/Breeze, horizontally scaled with Spark to propagate multiple
vertices at once
▪ Dimension: number of vertices for which we compute a
ranking
29. 29
Open Source DBPedia
dataset
Sublinear rise in time with
Spark/GraphX vs linear rise in the
horizontally scaled version
Doubling the size of cluster:
2.0 speedup in horizontally scaled
version vs 1.2 in Spark/GraphX
30. 30
Latent Dirichlet Allocation
▪ Distributed Spark/GraphX implementation
▪ Alternative implementation
▪ Single machine, multi-threaded Java code
▪ NOT horizontally scaled
▪ Dimension: training set size
31. 31
Netflix dataset
Number of Topics = 100
Spark/GraphX setup:
8 x resources than the
multi-core setup
Wikipedia dataset, 100
Topic LDA
Cluster: (16 x r3.2xl)
(source: Databricks)
Spark/GraphX for very large
datasets outperforms multi-core
32. 32
Other comparisons
▪ Frank McSherry’s blog post
comparing different distributed
pagerank implementation and a
single-threaded Rust
implementation on his laptop
▪ 1.5B edges for twitter_rv, 3.7B
for uk_2007_05
▪ “If you are going to use a big
data system for yourself, see if it
is faster than your laptop.”
33. 33
Other comparisons
▪ GraphChi, a single-machine large-scale graph computation engine
developed at CMU, reports similar findings
34. 34
Now, is it faster?
No, unless your problem or dataset is huge :(
35. 35
To conclude...
▪ When distributing an algorithm, there are two opposing
forces:
▪ 1) Communication overhead (shifting data from node to node)
▪ 2) More raw computing power available
▪ Whether one overtakes the other depends on the size of your
problem
▪ Single-machine ML can be very efficient!
▪ Smarter algorithms can beat brute force
▪ Better data structures, input data formats, caching, optimization
algorithms, etc. can all make a huge difference
▪ Good core implementation is a prerequisite to distribution
▪ Easy to get large machines!
36. 36
To conclude...
▪ However, distribution lets you easily throw more hardware
at a problem
▪ Also, some algorithms/methods are better than others at
minimizing the communication overhead
▪ Iterative distributed graph algorithms can be inefficient in that
respect
▪ Can your problem fit on a single machine?
▪ Can your problem be partitioned?
▪ For SGD-like algos, parameter servers can be used to distribute while
keeping this overhead to a minimum