This document discusses machine learning heuristics for short-term forecasting of time series data from self-tracking apps. It describes classical forecasting methods like linear regression, k-means clustering, and ARMA that perform poorly on this type of noisy data. The document then presents a toolbox of forecasting heuristics and a randomized incremental algorithm that combines them using a term algebra. This approach achieves better average forecast accuracy than classical methods by addressing overfitting through regularization and other techniques. Forecasting is used in self-tracking apps to improve the user experience and provide clues about the plausibility of causal hypotheses.
Disentangling the origin of chemical differences using GHOST
ML TITLE
1. Yves Caseau - Machine Learning for Self Tracking – February 2019 1/10
Machine Learning Heuristics for Short TimeMachine Learning Heuristics for Short Time
Series Forecasting with Quantified Self DataSeries Forecasting with Quantified Self Data
Yves Caseau
National Academy of Technologies
2. Yves Caseau - Machine Learning for Self Tracking – February 2019 2/10
Self-Tracking and Knomee Mobile AppSelf-Tracking and Knomee Mobile App
Knomee is a self-tracking mobile app for iOS (one of many
thousands)
Knomee motto: « self-tracking with sense »
Data science applied to self tracking
Self-tracking apps generate time series
One or many (up to 4) data points collected over a period of
time
Data is either self-declared (the user picks a value in a preset
range) or automatically imported from a a connected device
(iPhone’s sensors, Apple watch or any HealthKit compatible
device like a a Withings scale)
Data files are accessible on:
https://github.com/ycaseau/KnomeeQuest/tree/master/data
20 samples
Ranging from 40 to 220 measures (x 4)
3. Yves Caseau - Machine Learning for Self Tracking – February 2019 3/10
Quests : Causal Diagrams are proposed by the userQuests : Causal Diagrams are proposed by the user
Self-tracking is organized around causal diagrams
A quest is made of a target tracker and up to three
factor trackers
The user makes the hypothesis that the factors may
contribute to the target
Using Judea Peal’s notation we look for: usal
P(X | do(Y)) : impact of doing Y on X
Detect causality through active experiments
Correlation is not enough
A quest is an hypothesis, not all quests are meaningful
Factor causality is tricky (e.g. coffee as a symptom)
How to tell if the effort on factors is « worth it » ?
Impact on the target
Key property of self-tracking data:
some input is purely random
{quest:ENERGY, icloud:true,
energy:{
type:2, more:true,
min:1, max:6, target:4,
labels:[crisis, sleepy, lapses,
normal, energetic, hyper],},
sleep:{
type:7, more:true,
min:4, max:9, target:7,},
steps:{
type:4, more:true,
min:0, max:19000,
target:7000,},
weight:{
type:5, more:false,
min:75, max:82, target:78,},
}
4. Yves Caseau - Machine Learning for Self Tracking – February 2019 4/10
Short Time-Series ForecastingShort Time-Series Forecasting
Our goal in this talk : how to forecast values from self-tracking data ?
Forecasting gives a possible clue about the value of the causal hypothesis
(Granger causality)
We search for a robust method that does not break with random noise
Measuring success: iterative training protocol
For i in (2N/3 .. N), forecast TS[i] from (TS[1], …, TS[i - 1]
– Apply forecast to time[i]
– Measure average distance to real value TS[i]
– Compare to « average » performance
Realistic simulation of what happens in the app
Why it is hard:
short samples (small data)
mixed random inputs
5. Yves Caseau - Machine Learning for Self Tracking – February 2019 5/10
Classical Methods yield poor resultsClassical Methods yield poor results
Three classical ML algorithms, trained to
minimize distance, using implicit time
features and factors
Linear Regression
K-means Clustering (10 – 15 groups)
ARMA (AutoRegressive Moving Average)
Forecasting results are dispapointing
The difficulty is not a surprise, we are
looking to extract a small amount of
information, only when present
Improving a few % over average is the best
we can expect
Overfitting very easily offsets the forecasting
gain
Linar
Regression
K-means ARMA
forecasting 18.34% 19.5% 18.9%
average 17.5% 17.5% 17.5%
Distance
(squares)
0.655 0.81 0.525
Random noise
Linked to factors
Linked to non-
collected factors Random noise
“good quest” “poor quest”
variation
6. Yves Caseau - Machine Learning for Self Tracking – February 2019 6/10
A Term-Algebra of Heuristics CombinationsA Term-Algebra of Heuristics Combinations
Heuristic toolbox
MovingAverage – MA(k,discount)
Trend (time linear regression)
Weekly and Hourly patterns
Factor regression with explicit delay
CumSum (cumulative sum of differences to average)
Threshold regression with delay
Combined through a linear algebra
Each term is a weighted combination of a few heuristics
Some other heuristics provide improvement with some quests but are left aside for lack
of robustness
Cycle analysis (detecting “biorhythms”)
Split (constant until date X, then T)
useful when something changed.
And(t1,t2) : Boolean conjunction of two factors
Mi x[ 0. 97] (
T[ 2. 25- 2. 02/ - 1. 00] ,
wAvg[ " t ar get " ] ( 10, 1. 00) )
+ Cor [ 0. 04] ( " t r ack2" +16)
7. Yves Caseau - Machine Learning for Self Tracking – February 2019 7/10
Distances and RegularizationDistances and Regularization
Time-series operations are weighted
The weight of each measure is proportional to the
distance to its next neighbor
Spaced measures are more important than repeated
ones
« Triangular distance »
The distance between two time series is the area
between the two curves
Regularization to avoid overfitting
Principle: add a penalty to the distance that reduces
the overall standard deviation
best formula for this data set
wDist(a,t) + max(0.0, stdev(a) – 0.02)
8. Yves Caseau - Machine Learning for Self Tracking – February 2019 8/10
Randomized Incremental AlgorithmsRandomized Incremental Algorithms
Main algorithm is “Randomized Optimization” (RandOpt)
Create n random algebra terms
Combination of glutton heuristics (create the best possible term)
And randomization (coefficients / which sub-term to pick)
Depth is controlled with a global parameter
Optimized though local optimization
Each parameter of the algebra sub-terms (i.e, coefficient, delays, etc.) are optimized one by one
Hill-climbing local meta heuristics
Three successive rounds
This is used in an “incremental mode:
For each new measure
Reuse previous best term, and improve through local optimization
Run ”RandOpt” (100 iterations)
Keep best term
What has not worked out so far
Evolutionary (genetic algorithm with cross-over)
Mutation (large neighborgood local optimization)
9. Yves Caseau - Machine Learning for Self Tracking – February 2019 9/10
Computational resultsComputational results
Average forecast is 16.88% (control = average is 17.5%)
Average square distance is 1.03 (worse than LR,ARMA or k-means) because of regularization
Strong measures against overfitting (regularization, depth, # local opt loops + techniques)
10. Yves Caseau - Machine Learning for Self Tracking – February 2019 10/10
ConclusionConclusion
Forecasting for self-tracking data is hard
We presented a reinforcement generative
machine learning that performs better than
most classical techniques
This is due to the complex nature of the data
On (classical) sales time series, ARMA does better than the proposed approach
(close to LR)
Open question : how to detect the “intrinsic quality” of the quest and change the
forecasting method / regularization parameters accordingly ?
You can download the data and try your own approaches
Forecasting is used to two purposes in our mobile app:
User experience : forecasting makes data entry faster + gives a sense of playfulness
Granger Causality : when the forecasting score is ”good”, this gives a sense of
plausibility to the causal diagram hypothesis (represented by the “quest”)