Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Data Augmentation and Disaggregation by Neal Fultz

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

YouTube-Videos werden auf SlideShare nicht mehr unterstützt.

Original auf YouTube ansehen

Data Augmentation and
Disaggregation
Neal Fultz
nfultz@system1.com
July 26, 2017
https://goo.gl/6uQrss
● Many data sets are only available in aggregated form
○ Precluding use of stock statistics / ML directly.
● Data augmenta...
Nächste SlideShare
Clustering tutorial
Clustering tutorial
Wird geladen in …3
×

Hier ansehen

1 von 35 Anzeige

Data Augmentation and Disaggregation by Neal Fultz

Abstract:- Machine learning models may be very powerful, but many data sets are only released in aggregated form, precluding their use directly. Various heuristics can be used to bridge the gap, but they are typically domain-specific. The data augmentation algorithm, a classic tool from Bayesian computation, can be applied more generally. We will present a brief review of DA and how to apply it to disaggregation problems. We will also discuss a case study on disaggregating daily pricing data, along with a reference implementation R package.

Abstract:- Machine learning models may be very powerful, but many data sets are only released in aggregated form, precluding their use directly. Various heuristics can be used to bridge the gap, but they are typically domain-specific. The data augmentation algorithm, a classic tool from Bayesian computation, can be applied more generally. We will present a brief review of DA and how to apply it to disaggregation problems. We will also discuss a case study on disaggregating daily pricing data, along with a reference implementation R package.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Data Augmentation and Disaggregation by Neal Fultz (20)

Weitere von Data Con LA (20)

Anzeige

Aktuellste (20)

Data Augmentation and Disaggregation by Neal Fultz

  1. 1. Data Augmentation and Disaggregation Neal Fultz nfultz@system1.com July 26, 2017 https://goo.gl/6uQrss
  2. 2. ● Many data sets are only available in aggregated form ○ Precluding use of stock statistics / ML directly. ● Data augmentation, a classic tool from Bayesian computation, can be applied more generally. ○ Disaggregating across and within observations Executive Summary
  3. 3. Part 1: Motivating Example
  4. 4. A Data Set n Price 42 2.406 33 2.283 10 2.114 10 2.815 2 1.691 1 2.033 1 2.061 1 0.133 1 0.627
  5. 5. ● Like to use Price ~ LN( , 2 ) ○ Lognormal has nice interpretation as random walk of ± % ○ Also won’t go negative ○ Common Alternatives: Exponential, Gamma ● Estimate both parameters for later use ● Actually, we want to do so for 10k items Modeling Price
  6. 6. Log-normal Recap ● If Y ~ N( , 2 ), X = exp(Y) ~ LN( , 2 ) ● E(X) = exp( + 2 / 2) ● Var(X) = [exp( 2 ) - 1] exp(2 + 2 ) ● Standard estimators: ○ MoM - uses log of mean of X ○ MLE - uses mean of log X
  7. 7. Log-normal Recap ● Method of Moments ○ s2 = ln(Σ X2 / N) - 2 ln (Σ X / N) ○ m = ln(Σ X / N) - s2 /2 ● Maximum Likelihood ○ m = Σ ln X / N ○ s2 = Σ (ln X - m)2 / N
  8. 8. Estimation v0.1 What if we just ignore n, and plug in hourly averages to our estimators? =>Gives equal weight to (n=1, $=0.133) as (n=42, $=2.406) => Everything biased towards the small obs
  9. 9. Estimation v0.2 What if we just plug in weighted sample averages? ● Method of Moments: ○ m = 0.342, s2 = 0.996 ○ Expected Value: exp(.342 + .996/2) = 2.32 ● Maximum Likelihood: ○ m = 0.811, s2 = 0.105 ○ Expected Value: exp(.811 + .105/2) = 2.37
  10. 10. Are these trustworthy? To check if these make sense: ● Simulate from both estimates as ground truth ● Apply both estimators ● Inspect bias
  11. 11. Why are these not working? ● Many distributions are additive ○ N(0,1) + N(1,1) => N(1,3) ○ Pois(4) + Pois(5) => Pois(9) ● Log Normal is not! ○ So (n=42, $=2.406) is not LN, even if individual prices are ○ It is in fact a marginal distribution ■ contains 41 integrals :( ● What about CLT? ■ Even if (n=42) is approx N, (n=10) and (n=2) are probably not
  12. 12. A Data Set n Price 42 2.406 33 2.283 10 2.114 10 2.815 2 1.691 1 2.033 1 2.061 1 0.133 1 0.627
  13. 13. Part 1 Main Points Violate iid at your own risk! ● Do NOT plug and chug ● Do NOT expect weights will fix your problem ● Do NOT use predictive models ● Do NOT use multi-armed bandits Get better, unaggregated data! … but if you can’t ...
  14. 14. Part 2: Data Augmentation
  15. 15. A Data Set n Price 42 2.406 33 2.283 10 2.114 10 2.815 2 1.691 1 2.033 1 2.061 1 0.133 1 0.627
  16. 16. Long format…. id Group Price 1 1 2.406 2 1 2.406 3 1 2.406 ... ... ... 96 5 1.691 97 5 1.691 98 6 2.033 99 7 2.061 100 8 0.133 101 9 0.627
  17. 17. Estimation ● MCMC using stock methods, eg Metropolis-Hastings ● MH requires: ○ Target Distribution / probability model ○ State transition functions / proposal distributions ● MH outputs: ○ Numerical samples from target distribution
  18. 18. Proposal Distribution ● Transitions on m and s2 - easy ● Transitions on missing T Prices ? ○ hourly constraints on total $ ■ Don’t want to propose out-of-bounds ○ Option 1 - draw from dirichlet, ■ use that to disaggregate, transition whole hours at once ■ Big steps => lots of rejections ○ Option 2 - pairwise transitions within group
  19. 19. Part 2 Main Points Switching from aggregated to long format shows aggregation can be thought of as a form of missing data. However, group averages => constraints on the missing data. In our example data, 97/101 points are missing, but we can still get reasonable estimates via MCMC
  20. 20. Part 3: Disaggregation
  21. 21. A Data Set n Price 42 2.406 33 2.283 10 2.114 10 2.815 2 1.691 1 2.033 1 2.061 1 0.133 1 0.627
  22. 22. Additional Challenges What if aggregation is over multiple heterogeneous groups, and we need to split the money between the groups (“disaggregate”)? Do we know the split a priori? What if we don’t?
  23. 23. A Grouped Data Set Known Groups Desktop Mobile Price 38 4 2.406 27 6 2.283 2 8 2.114 6 4 2.815 0 2 1.691 0 1 2.033 1 0 2.061 1 0 0.133 0 1 0.627
  24. 24. Common Heuristics ● Linear disaggregation ○ Weighted averages by another name ○ Doesn’t account for variation in other columns ● Iterative Proportional Fitting ○ If you have subtotals in all dimension ○ Alternates disaggregating by rows/columns Desktop Mobile Price 38 4 2.406 27 6 2.283 2 8 2.114 6 4 2.815 0 2 1.691 0 1 2.033 1 0 2.061 1 0 0.133 0 1 0.627
  25. 25. Long format…. id Group mobile Price 1 1 1 2.406 2 1 0 2.406 3 1 1 2.406 ... ... 96 5 1 1.691 97 5 1 1.691 98 6 1 2.033 99 7 0 2.061 100 8 0 0.133 101 9 1 0.627
  26. 26. A Grouped Data Set Unknown Groups n Prime Sub Price 42 ? ? 2.406 33 ? ? 2.283 10 ? ? 2.114 10 ? ? 2.815 2 ? ? 1.691 1 ? ? 2.033 1 ? ? 2.061 1 ? ? 0.133 1 ? ? 0.627
  27. 27. A Grouped Data Set Unknown Groups n Prime Sub Price 42 30 12 2.406 33 23 9 2.283 10 7 3 2.114 10 8 2 2.815 2 2 0 1.691 1 1 0 2.033 1 1 0 2.061 1 0 1 0.133 1 0 1 0.627
  28. 28. Part 3 Main Points By extending the previous model, we can deal with “heterogeneous aggregates”. If the grouping variable is known, solve like a regression problem. If not known / latent, solve it like a mixture problem. Either way, going Bayes let’s you borrow strength between aggregates, which disaggregation heuristics are not good at.
  29. 29. Questions?

×