Data Augmentation and Disaggregation by Neal Fultz

Data Augmentation and
Disaggregation
Neal Fultz
nfultz@system1.com
July 26, 2017
https://goo.gl/6uQrss

● Many data sets are only available in aggregated form
○ Precluding use of stock statistics / ML directly.
● Data augmentation, a classic tool from Bayesian computation,
can be applied more generally.
○ Disaggregating across and within observations
Executive Summary

A Data Set
n Price
42 2.406
33 2.283
10 2.114
10 2.815
2 1.691
1 2.033
1 2.061
1 0.133
1 0.627

● Like to use Price ~ LN( , 2
)
○ Lognormal has nice interpretation as random walk of ± %
○ Also won’t go negative
○ Common Alternatives: Exponential, Gamma
● Estimate both parameters for later use
● Actually, we want to do so for 10k items
Modeling Price

Log-normal Recap
● If Y ~ N( , 2
), X = exp(Y) ~ LN( , 2
)
● E(X) = exp( + 2
/ 2)
● Var(X) = [exp( 2
) - 1] exp(2 + 2
)
● Standard estimators:
○ MoM - uses log of mean of X
○ MLE - uses mean of log X

Log-normal Recap
● Method of Moments
○ s2
= ln(Σ X2
/ N) - 2 ln (Σ X / N)
○ m = ln(Σ X / N) - s2
/2
● Maximum Likelihood
○ m = Σ ln X / N
○ s2
= Σ (ln X - m)2
/ N

Estimation v0.1
What if we just ignore n, and plug in hourly averages to our estimators?
=>Gives equal weight to (n=1, $=0.133) as (n=42, $=2.406)
=> Everything biased towards the small obs

Estimation v0.2
What if we just plug in weighted sample averages?
● Method of Moments:
○ m = 0.342, s2
= 0.996
○ Expected Value: exp(.342 + .996/2) = 2.32
● Maximum Likelihood:
○ m = 0.811, s2
= 0.105
○ Expected Value: exp(.811 + .105/2) = 2.37

Are these trustworthy?
To check if these make sense:
● Simulate from both estimates as ground truth
● Apply both estimators
● Inspect bias

Why are these not working?
● Many distributions are additive
○ N(0,1) + N(1,1) => N(1,3)
○ Pois(4) + Pois(5) => Pois(9)
● Log Normal is not!
○ So (n=42, $=2.406) is not LN, even if individual prices are
○ It is in fact a marginal distribution
■ contains 41 integrals :(
● What about CLT?
■ Even if (n=42) is approx N, (n=10) and (n=2) are probably not

Part 1
Main Points
Violate iid at your own risk!
● Do NOT plug and chug
● Do NOT expect weights will fix your problem
● Do NOT use predictive models
● Do NOT use multi-armed bandits
Get better, unaggregated data!
… but if you can’t ...

Long format….
id Group Price
1 1 2.406
2 1 2.406
3 1 2.406
... ... ...
96 5 1.691
97 5 1.691
98 6 2.033
99 7 2.061
100 8 0.133
101 9 0.627

Estimation
● MCMC using stock methods, eg Metropolis-Hastings
● MH requires:
○ Target Distribution / probability model
○ State transition functions / proposal distributions
● MH outputs:
○ Numerical samples from target distribution

Proposal Distribution
● Transitions on m and s2
- easy
● Transitions on missing T Prices ?
○ hourly constraints on total $
■ Don’t want to propose out-of-bounds
○ Option 1 - draw from dirichlet,
■ use that to disaggregate, transition whole hours at once
■ Big steps => lots of rejections
○ Option 2 - pairwise transitions within group

Part 2
Main Points
Switching from aggregated to long format shows
aggregation can be thought of as a form of missing data.
However, group averages => constraints on the missing data.
In our example data, 97/101 points are missing,
but we can still get reasonable estimates via MCMC

Additional Challenges
What if aggregation is over multiple heterogeneous groups, and we need
to split the money between the groups (“disaggregate”)?
Do we know the split a priori?
What if we don’t?

A Grouped Data Set
Known Groups
Desktop Mobile Price
38 4 2.406
27 6 2.283
2 8 2.114
6 4 2.815
0 2 1.691
0 1 2.033
1 0 2.061
1 0 0.133
0 1 0.627

Common Heuristics
● Linear disaggregation
○ Weighted averages by another name
○ Doesn’t account for variation in other columns
● Iterative Proportional Fitting
○ If you have subtotals in all dimension
○ Alternates disaggregating by rows/columns
Desktop Mobile Price
38 4 2.406
27 6 2.283
2 8 2.114
6 4 2.815
0 2 1.691
0 1 2.033
1 0 2.061
1 0 0.133
0 1 0.627

Long format….
id Group mobile Price
1 1 1 2.406
2 1 0 2.406
3 1 1 2.406
... ...
96 5 1 1.691
97 5 1 1.691
98 6 1 2.033
99 7 0 2.061
100 8 0 0.133
101 9 1 0.627

A Grouped Data Set
Unknown Groups
n Prime Sub Price
42 ? ? 2.406
33 ? ? 2.283
10 ? ? 2.114
10 ? ? 2.815
2 ? ? 1.691
1 ? ? 2.033
1 ? ? 2.061
1 ? ? 0.133
1 ? ? 0.627

A Grouped Data Set
Unknown Groups
n Prime Sub Price
42 30 12 2.406
33 23 9 2.283
10 7 3 2.114
10 8 2 2.815
2 2 0 1.691
1 1 0 2.033
1 1 0 2.061
1 0 1 0.133
1 0 1 0.627

Part 3
Main Points
By extending the previous model, we can deal with
“heterogeneous aggregates”.
If the grouping variable is known, solve like a regression problem.
If not known / latent, solve it like a mixture problem.
Either way, going Bayes let’s you borrow strength between aggregates,
which disaggregation heuristics are not good at.

Data Augmentation and Disaggregation by Neal Fultz

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Data Augmentation and Disaggregation by Neal Fultz

Ähnlich wie Data Augmentation and Disaggregation by Neal Fultz (20)

Mehr von Data Con LA

Mehr von Data Con LA (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Augmentation and Disaggregation by Neal Fultz