Dmitry Larko, H2O.ai - Time Series in H2O Driverless AI - #H2OWorld 2019 NYC

Time series in Driverless AIDmitry Larko
Sr. Data Scientist
H2O.ai

• Some input data
• A target variable
• An objective (or a success metric) like RMSE or MAE
• Some allocated resources (time and hardware)
e.g.salesx1 x2 x3 x4 y
0.14 0.69 0.01 0.71 300
0.22 0.44 0.45 0.69 100
0.12 0.35 0.51 0.23 40
0.22 0.42 0.79 0.60 23
0.93 0.82 0.72 0.50 1900
0.32 0.58 0.28 0.22 231
0.95 0.59 0.68 0.09 700
0.34 0.58 0.35 0.81 423
0.05 0.80 0.28 0.86 222
0.23 0.49 0.63 0.03 190
0.05 0.34 0.53 0.73 890
0.74 0.02 0.33 0.56 1000
Driverless AI Process
- Data visualization (AutoViz)
- Feature engineering & selection
- Automated Modeling
- Model interpretability (MLI)
- Scoring pipeline (predictions)

0
50
100
150
200
250
300
350
400
12/31/2017 1/2/2018 1/4/2018 1/6/2018 1/8/2018 1/10/2018 1/12/2018 1/14/2018
Sales over time
Linear relationshipNonlinear (seasonal) relationship
What is a Time Series Problem?
0
50
100
150
200
250
12/21/2017 12/31/2017 1/10/2018 1/20/2018 1/30/2018 2/9/2018 2/19/2018
Sales over time

0
100
200
300
400
500
600
700
800
12/21/2017 12/31/2017 1/10/2018 1/20/2018 1/30/2018 2/9/2018 2/19/2018 3/1/2018 3/11/2018
sales per per day (all groups)
0
100
200
300
400
500
600
700
800
12/21/2017 12/31/2017 1/10/2018 1/20/2018 1/30/2018 2/9/2018 2/19/2018 3/1/2018 3/11/2018
sales by group
group 1 group 2 group 3
time groups sales
01/01/2018 group1 30
01/01/2018 group2 100
01/01/2018 group3 10
02/01/2018 group1 60.2
02/01/2018 group2 200.2
02/01/2018 group3 20.2
03/01/2018 group1 90.3
03/01/2018 group2 300.3
03/01/2018 group3 30.3
04/01/2018 group1 120.4
04/01/2018 group2 400.4
04/01/2018 group3 40.4
Time Groups

Modeling Foundation
1 2 3 4 5 6 7 8 9 10 11 12
[Gap]
1 2 3 4 5 6 7 8 9 10 11 12
[Gap] [Gap]
testtrain
tvs train tvs valid test
time:
Gap | Forecast Horizon
invalid lag size
valid lag size
time:

Date
1/1/2018
2/1/2018
3/1/2018
4/1/2018
5/1/2018
6/1/2018
7/1/2018
8/1/2018
9/1/2018
10/1/2018
Day Month Year Weekday Weeknum IsHoliday
1 1 2018 2 1 1
2 1 2018 3 1 0
3 1 2018 4 1 0
4 1 2018 5 1 0
5 1 2018 6 1 0
6 1 2018 7 1 0
7 1 2018 1 2 0
8 1 2018 2 2 0
9 1 2018 3 2 0
10 1 2018 4 2 0
Feature Engineering

Date Sales
1/1/2018 100
2/1/2018 150
3/1/2018 160
4/1/2018 200
5/1/2018 210
6/1/2018 150
7/1/2018 160
8/1/2018 120
9/1/2018 80
10/1/2018 70
Lag1 Lag2
- -
100 -
150 100
160 150
200 160
210 200
150 210
160 150
120 160
80 120
Moving Average
-
100
125
155
180
205
180
155
140
100
Feature Engineering (cont.)
• Lags on subsets of the specified group columns (e.g. {Store, Department} vs. {Department} vs. {Store})
• Exponentially Weighted Moving Averages (EWMA) of n-th order differentiated lags
• Aggregation of lags (mean, std, sums, etc.)
• Interactions of lags (e.g. Lag2 - Lag1)
• Linear regression on lags (taking slope and/or intercept as new features)

What’s new?
Training Holdout Predictions / Backtesting
• Final pipeline will be refitted on various train/valid splits to generate holdout
predictions:
Split
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 1 2 3 4 5 6 7 8 9 10 11 12
3 1 2 3 4 5 6 7 8 9 10
4 1 2 3 4 5 6 7 8
5 1 2 3 4 5 6
x
x
x
training data
validation/holdout data
Training & Validation/Holdout Data
optional training data
time

Test Time Augmentation & Rolling Predictions
• If test set is larger than the forecast horizon the predictions beyond are inferior
• Lookup tables to create features (lags, aggregates etc.) are missing the necessary data hence
models operate with many missing values
missing data
present data

1st Solution: Test Time Augmentation (TTA)
Model stays the same, only
the memory of the fitted
transformers is updated (TTA).
Keep rolling the prediction
window over the whole test
set to get valid predictions.
Pros: no model changes, fast
Cons: model degradation over
time

2nd Solution: Extend Train and Refit
To get valid predictions for
whole test set we extend the
train set with latest data and
refit the original model to
generate precise predictions.
Roll the prediction window
step by step over the whole
test period and keep
retraining models.
Pros: most precise
Cons: time consuming

What’s new?
Bring Your Own Recipe (BYOR)
• Custom time series transformers or models to be used within Driverless AI
• Interface to bring in domain specific (or just additional) feature transformers
• Interface to bring in popular algorithms like ARIMA, LSTM, Prophet etc.
• Either as custom models or as feature transformations
(i.e. using their predictions as input features for DAI)
• Example implementions available
• FBProphetModel
• ExponentialSmoothingModel
• AutoArimaTransformer
• ProphetTransformer
• …
https://github.com/h2oai/driverlessai-recipes/tree/master/transformers/timeseries
https://github.com/h2oai/driverlessai-recipes/tree/master/models/timeseries

Will be released soon:
Unknown Features at Prediction Time
• Some features might not be known at the time a prediction is made
• Driverless will make sure that only historical information for these features are used

Time Aware Target Transformations
• Detrending
• Fast linear (least squares)
• Robust linear (RANSAC regression)
• Logistic growth
• Centering
• y‘(t) = y(t) – c
• Differencing
• y‘(t) = y(t) – y(t - k)
• Ratio
• y‘(t) = y(t) / y(t - k)

Time Aware Target Transformations (cont.)
• Example: Capture trends with tree based models
Without detrending With detrending

Prediction Intervals
• Basend on the method from Williams & Goodman (1971)
• Very general approach:
• Makes no assumptions about the distribution of forecast errors
• Makes no assumptions about the model used to create forecasts
• General idea:
• Using time based holdout predictions to determine real forecast errors
• Constructing empirical prediction intervals based on forecast error quantiles

Masterminds behind DAI time series
• Data Scientists
• Former #1 & #4

Thank You
Twitter: @DmitryLarko

Dmitry Larko, H2O.ai - Time Series in H2O Driverless AI - #H2OWorld 2019 NYC

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Sri Ambati

Mehr von Sri Ambati (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Dmitry Larko, H2O.ai - Time Series in H2O Driverless AI - #H2OWorld 2019 NYC