Tom Peters, Software Engineer, Ufora at MLconf ATL 2016

Write What You Mean
Thomas Peters, engineer
Scaling up machine learning algorithms
directly from source code

Q: Why should I have to rewrite my
program as my dataset gets larger?

def sq_distance(p1, p2):
return sum((c[0]-c[1])**2 for c in zip(p1, p2))
def index_of_nearest(q, points):
return min((sq_distance(q, p), i)
for i, p in enumerate(points))[1]
def nearest_center(points, centers):
return [index_of_nearest(p, centers) for p in points]
Example: Nearest Neighbor

Unfortunately, this is not fast.

A: You shouldn’t have to!
Q: Why should I have to rewrite my
program as my dataset gets larger?

Pyfora
Automatically scalable Python
for large-scale machine learning and data science
http://github.com/ufora/ufora
http://docs.pyfora.com/

Goals of Pyfora
•Provide identical semantics to regular Python
•Easily use hundreds of CPUs / GPUs and TBs of
RAM
•Scale by analyzing source code, not by calling
libraries
No more complex frameworks or

Approaches to Scaling
APIs and Frameworks
• Library of functions for
specific patterns of
parallelism
• Programmer (re)writes
program to fit the pattern.

APIs and Frameworks
• Library of functions for
specific patterns of
parallelism
• Programmer (re)writes
program to fit the pattern.
Programming Language
• Semantics of calculation
entirely defined by source-
code
• Compiler and Runtime are
responsible for efficient
execution.

APIs and Frameworks
•MPI
•Hadoop
•Spark
Programming
Languages
•SQL
•CUDA
•CILK
•Python with Pyfora

API Language
Pros
• More control over performance
• Easy to integrate lots of different
systems.
• Simpler code
• Much more expressive
• Programs are easier to understand.
• Cleaner failure modes
• Much deeper optimizations are possible.
Cons
• More code
• Program meaning obscured by
implementation details
• Hard to debug when something goes
wrong
• Very hard to implement

With a strong implementation,
“language approach” should win
• Any pattern that can be implemented in an API can be
recognized in a language.
• Language-based systems have the entire source code, so they
have more to work with than API based systems.
• Can measure behavior at runtime and use this to optimize.

Example: Nearest Neighbors
def sq_distance(p1,p2):
return sum((c[0]-c[1])**2 for c in zip(p1, p2))
def index_of_nearest(q, points):
return min((sq_distance(q, p), i)
for i, p in enumerate(points))[1]
def nearest_center(points, centers):
return [index_of_nearest(p, centers) for p in points]

How can we make this fast?
• JIT compile to make single-threaded code fast
• Parallelize to use multiple CPUs
• Distribute data to use multiple machines

Why is this tricky?
Optimal behavior depends on the sizes and shapes of data.
Centers Points
If both sets are small, don’t bother to distribute.

Why is this tricky?
Centers
Points
If “points” is tall and thin, it’s
natural to split it across many
machines and replicate
“centers”

Why is this tricky?
Centers
Points
If “points” and “centers” are really wide (say, they’re
images), it would be better to split them vertically, compute
distances between all pairs in slices, and merge them.

Why is this tricky?
You will end up writing totally different code for
each of these different situations.
The source code contains the necessary
structure.
The key is to defer decisions to runtime, when the
system can actually see how big the datasets are.

Getting it right is valuable
•Much less work for the programmer
•Code is more readable
•Code becomes more reusable.
•Use the language the way it was intended:
For instance, in Python, the “row” objects can
be anything that looks like a list.

What are some other common
implementation problems we can
solve this way?

Problem: Wrong-sized chunking
• API-based frameworks require you to explicitly partition your
data into chunks.
• If you are running a complex task, the runtime may be really
long for a small subset of chunks. You’ll end up waiting a long
time for that last mapper.
• If your tasks allocate memory, you can run out of RAM and
crash.

Solution: Dynamically rebalance
CORE
#1
CORE #2 CORE #3 CORE #4
Splitting
Adaptive
Parallelism
sum(f(x) for x in v)

Solution: Dynamically rebalance
• This requires you to be able to interrupt running tasks as
they’re executing.
• Adding support for this to an API makes it much more
complicated to use.
• This is much easier to do with compiler support.

Problem: Nested parallelism
Example:
• You have an iterative model
• There is lots of parallelism in each iteration
• But you also want to search over many hyperparameters
With API-based approaches, you have to manage this yourself,
either by constructing a graph of subtasks, or figuring out how to
flatten your workload into something that can be map-reduced.

sources of parallelism
def fit_model(learning_rate, model, params):
while not model.finished(params):
params = model.update_params(learning_rate, params)
return params
fits = [[fit_model(rate, model, params) for rate in learning_rates]
for model in models]
Solution: infer parallelism from source

So how does Pyfora work?
• Operate on a subset of Python that restricts mutability (but
we're relaxing this).
• Built a JIT compiler that can “pop” code back into the interpreter
• Can move sets of stack frames from one machine to another
• Can rewrite selected stack frames to use futures if there is parallelism
to exploit.
• Carefully track what data a thread is using.
• Dynamically schedule threads and data on machines to
optimize for cache locality.

import pyfora
executor = pyfora.connect("http://...")
data = executor.importS3Dataset("myBucket", "myData.csv")
def calibrate(dataframe, params):
#some complex model with loops and parallelism
with executor.remotely:
dataframe = pyfora.read_csv(data)
models = [calibrate(dataframe, p) for p in params]
print(models.toLocal().result())

What are we working on?
• Relaxing immutability assumptions.
• Compiler optimizations (immutable Python is a rich source of
these)
• Automatic compilation and scheduling of data and compute on
GPU

Thanks!
• Check out the repo: github.com/ufora/ufora
• Read the docs: docs.pyfora.com
• Subscribe to “This Week in Data” (see top of ufora.com)
• Email me: tpeters@ufora.com

Tom Peters, Software Engineer, Ufora at MLconf ATL 2016

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Tom Peters, Software Engineer, Ufora at MLconf ATL 2016

Ähnlich wie Tom Peters, Software Engineer, Ufora at MLconf ATL 2016 (20)

Mehr von MLconf

Mehr von MLconf (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Tom Peters, Software Engineer, Ufora at MLconf ATL 2016