Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
1. Online Learning with Structured
Streaming
Ram Sriharsha,Vlad Feinberg
@halfabrane
Spark Summit, Brussels
27 October 2016
2. What is online learning?
• Update modelparameters on eachdata point
• In batch setting get to see the entire dataset before update
• Cannotvisit data points again
• In batch setting, can iterate over data points as many times as
we want!
2
3. An example: the perceptron
3
x
w
Update Rule: if (y != sign(w.x)),w -> w + y(w.x)
Goal: Find the best line separating positive
From negative examples on a plane
4. Why learn online?
• I wantto adapt to changingpatternsquickly
• data distribution can change
– e.g, distribution of features that affect learning might change over
time
• I needto learn a goodmodelwithinresource + time
constraints(large-scalelearning)
• Time to a given accuracy might be faster for certain online
algorithms
4
5. Online Classification Setting
• Pick a hypothesis
• For eachlabeledexample(𝘅, y):
• Predict label ỹ using hypothesis
• Observe the loss 𝓛(y, ỹ) (and its gradient)
• Learn from mistake and update hypothesis
• Goal: to make as few mistakesas possiblein
comparisonto the best hypothesisin hindsight
5
6. An example: Online SGD
• Initializeweights 𝘄
• Lossfunction 𝓛 is known.
• For eachlabeledexample(𝘅, y):
• Perform update 𝘄 -> 𝘄 – η∇𝓛(y , 𝘄.𝘅)
• For eachnew examplex:
• Predict ỹ = σ(𝘄.𝘅) (σ is called link function)
6
𝓛(y , 𝘄.𝘅)
𝘄
ẘ
7. Distributed Online Learning
• Synchronous
• On each worker:
– Load training data, compute gradientsand update model, push model to
driver
• On some node:
– Perform model merge
• Asynchronous
• On each worker:
– Load training data, compute gradientsand push to server
• On each server:
– Aggregate the gradients, performupdate step
7
8. Challenges
• Not all algorithmsadmit efficient onlineversions
• Lack of infrastructure
• (Single machine) Vowpal Wabbitworksgreatbuthard to use from
Scala, Java and otherlanguages.
• (Distributed) No implementationthatisfault tolerant,scalable,robust
• Lack of frameworkin open sourceto provide extensible
algorithms
• Adagrad, normalized learning,L1 regularization,…
• Online SGD,FTRL, ...
8
10. 1. One singleAPI DataFrameforeverything
- Same API for machine learning, batch processing, graphX
- Dataset is a typed version of DataFrame for Scala and Java
2. End-to-endexactly-onceguarantees
- The guarantees extend into the sources/sinks, e.g. MySQL, S3
3. Understandsexternalevent-time
- Handling late arriving data
- Support sessionization based on event-time
Structured Streaming
11. How does it work?
at any time, theoutput of the applicationisequivalentto
executing a batch job on a prefixof thedata
11
12. The Model Trigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
Input: data from source as an
append-only table
Trigger: how frequently to check
input for new data
Query: operations on input
usual map/filter/reduce
new window, session ops
13. The Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Result: final operated table
updated every trigger interval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Output complete
output
14. The Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Output delta
output
Result: final operated table
updated every trigger interval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Delta output: Write only the rows that changed
in result from previous batch
Append output:Write only new rows
*Notall outputmodesare feasible with all queries
16. Streaming ML on Structured StreamingTrigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
Input: append only table containing
labeled examples
Query: Stateful aggregation query:
picks up the last trained model,
performs a distributed update +
merge
17. Streaming ML on Structured StreamingTrigger: every 1 sec
1 2 3
model
for data
up to t
Result
Query
Time
labeled
examples
up
to time t
InputResult: table of model parameters
updated every trigger interval
Complete mode: table has one row,
constantly being updated
Append mode (in the works): table has
timestamp-keyed model, one
row per trigger
Output
intermediate models would have the same state at this
point of computation for the (abstract) queries #1 and #2
19. Solution: Make Assumptions
• Resultmay be partition-dependent,butwe don’tcare as
long as we getsomevalid result.
average-models(
P1: update(update(previous model, data[0]), data[1]),
P2: update(update(previous model, data[2]), data[3]))
• Only partition-dependentifupdate and averagedon’t
commute- can still be deterministicotherwise!
19
20. Stateful Aggregator
• Within eachpartition
• Initialize with previous state (instead of zero in regular
aggregator)
• For each item, update state
• Performreducestep
• Outputfinalstate
Very general abstraction:worksforsketches, online
statistics(quantiles),onlineclustering …
20
21. How does it work?
Driver
Map Map
State
Store
Labeled Stream
Source
Reduce
Is there more data?
yes!
run query
Map
Read labeled examples
Feature transforms, gradient updates
Model averaging
save model
read last saved model
23. ML Estimator on Streams
• Interoperablewith ML pipelines
23
Streaming
DF
m = estimator.fit()
m.writeStream
streaming sink
Input: stream of labelled data
Output: stream of models, updated over time.
25. Feature Creation
• Handle new featuresas theyappear(ex., IPs in fraud
detection)
• Provide transformers, such as the HashingEncoder, that
apply the hashing trick.
• Encode arbitrary (possibly categorical data) without
knowing cardinality ahead of time by using a high-
dimensional sparse mapping.
25
26. API Goals
• Provide modern, regret-minimization-basedonline
algorithms.
• Online Logistic Regression
• Adagrad
• Online gradient descent
• L2 regularization
• Inputstreams of any kindaccepted.
• Streaming aware featureengineering
26