SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Data Augmentation and
Disaggregation
Neal Fultz
nfultz@system1.com
July 26, 2017
https://goo.gl/6uQrss
● Many data sets are only available in aggregated form
○ Precluding use of stock statistics / ML directly.
● Data augmentation, a classic tool from Bayesian computation,
can be applied more generally.
○ Disaggregating across and within observations
Executive Summary
Part 1: Motivating Example
A Data Set
n Price
42 2.406
33 2.283
10 2.114
10 2.815
2 1.691
1 2.033
1 2.061
1 0.133
1 0.627
● Like to use Price ~ LN( , 2
)
○ Lognormal has nice interpretation as random walk of ± %
○ Also won’t go negative
○ Common Alternatives: Exponential, Gamma
● Estimate both parameters for later use
● Actually, we want to do so for 10k items
Modeling Price
Log-normal Recap
● If Y ~ N( , 2
), X = exp(Y) ~ LN( , 2
)
● E(X) = exp( + 2
/ 2)
● Var(X) = [exp( 2
) - 1] exp(2 + 2
)
● Standard estimators:
○ MoM - uses log of mean of X
○ MLE - uses mean of log X
Log-normal Recap
● Method of Moments
○ s2
= ln(ÎŁ X2
/ N) - 2 ln (ÎŁ X / N)
○ m = ln(Σ X / N) - s2
/2
● Maximum Likelihood
○ m = Σ ln X / N
○ s2
= ÎŁ (ln X - m)2
/ N
Estimation v0.1
What if we just ignore n, and plug in hourly averages to our estimators?
=>Gives equal weight to (n=1, $=0.133) as (n=42, $=2.406)
=> Everything biased towards the small obs
Estimation v0.2
What if we just plug in weighted sample averages?
● Method of Moments:
○ m = 0.342, s2
= 0.996
○ Expected Value: exp(.342 + .996/2) = 2.32
● Maximum Likelihood:
○ m = 0.811, s2
= 0.105
○ Expected Value: exp(.811 + .105/2) = 2.37
Are these trustworthy?
To check if these make sense:
● Simulate from both estimates as ground truth
● Apply both estimators
● Inspect bias
Why are these not working?
● Many distributions are additive
○ N(0,1) + N(1,1) => N(1,3)
○ Pois(4) + Pois(5) => Pois(9)
● Log Normal is not!
○ So (n=42, $=2.406) is not LN, even if individual prices are
○ It is in fact a marginal distribution
■ contains 41 integrals :(
● What about CLT?
■ Even if (n=42) is approx N, (n=10) and (n=2) are probably not
A Data Set
n Price
42 2.406
33 2.283
10 2.114
10 2.815
2 1.691
1 2.033
1 2.061
1 0.133
1 0.627
Part 1
Main Points
Violate iid at your own risk!
● Do NOT plug and chug
● Do NOT expect weights will fix your problem
● Do NOT use predictive models
● Do NOT use multi-armed bandits
Get better, unaggregated data!

 but if you can’t ...
Part 2: Data Augmentation
A Data Set
n Price
42 2.406
33 2.283
10 2.114
10 2.815
2 1.691
1 2.033
1 2.061
1 0.133
1 0.627
Long format
.
id Group Price
1 1 2.406
2 1 2.406
3 1 2.406
... ... ...
96 5 1.691
97 5 1.691
98 6 2.033
99 7 2.061
100 8 0.133
101 9 0.627
Estimation
● MCMC using stock methods, eg Metropolis-Hastings
● MH requires:
○ Target Distribution / probability model
○ State transition functions / proposal distributions
● MH outputs:
○ Numerical samples from target distribution
Proposal Distribution
● Transitions on m and s2
- easy
● Transitions on missing T Prices ?
○ hourly constraints on total $
■ Don’t want to propose out-of-bounds
○ Option 1 - draw from dirichlet,
■ use that to disaggregate, transition whole hours at once
■ Big steps => lots of rejections
○ Option 2 - pairwise transitions within group
Part 2
Main Points
Switching from aggregated to long format shows
aggregation can be thought of as a form of missing data.
However, group averages => constraints on the missing data.
In our example data, 97/101 points are missing,
but we can still get reasonable estimates via MCMC
Part 3: Disaggregation
A Data Set
n Price
42 2.406
33 2.283
10 2.114
10 2.815
2 1.691
1 2.033
1 2.061
1 0.133
1 0.627
Additional Challenges
What if aggregation is over multiple heterogeneous groups, and we need
to split the money between the groups (“disaggregate”)?
Do we know the split a priori?
What if we don’t?
A Grouped Data Set
Known Groups
Desktop Mobile Price
38 4 2.406
27 6 2.283
2 8 2.114
6 4 2.815
0 2 1.691
0 1 2.033
1 0 2.061
1 0 0.133
0 1 0.627
Common Heuristics
● Linear disaggregation
○ Weighted averages by another name
○ Doesn’t account for variation in other columns
● Iterative Proportional Fitting
○ If you have subtotals in all dimension
○ Alternates disaggregating by rows/columns
Desktop Mobile Price
38 4 2.406
27 6 2.283
2 8 2.114
6 4 2.815
0 2 1.691
0 1 2.033
1 0 2.061
1 0 0.133
0 1 0.627
Long format
.
id Group mobile Price
1 1 1 2.406
2 1 0 2.406
3 1 1 2.406
... ...
96 5 1 1.691
97 5 1 1.691
98 6 1 2.033
99 7 0 2.061
100 8 0 0.133
101 9 1 0.627
A Grouped Data Set
Unknown Groups
n Prime Sub Price
42 ? ? 2.406
33 ? ? 2.283
10 ? ? 2.114
10 ? ? 2.815
2 ? ? 1.691
1 ? ? 2.033
1 ? ? 2.061
1 ? ? 0.133
1 ? ? 0.627
A Grouped Data Set
Unknown Groups
n Prime Sub Price
42 30 12 2.406
33 23 9 2.283
10 7 3 2.114
10 8 2 2.815
2 2 0 1.691
1 1 0 2.033
1 1 0 2.061
1 0 1 0.133
1 0 1 0.627
Part 3
Main Points
By extending the previous model, we can deal with
“heterogeneous aggregates”.
If the grouping variable is known, solve like a regression problem.
If not known / latent, solve it like a mixture problem.
Either way, going Bayes let’s you borrow strength between aggregates,
which disaggregation heuristics are not good at.
Questions?

Weitere Àhnliche Inhalte

Ähnlich wie Data Augmentation and Disaggregation by Neal Fultz

Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAminaRepo
 
Predictive Modelling
Predictive ModellingPredictive Modelling
Predictive ModellingRajiv Advani
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithmK Hari Shankar
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and ldaSuresh Pokharel
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...IJERA Editor
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithmspppepito86
 
TDC2017 | SĂŁo Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | SĂŁo Paulo - Trilha Java EE How we figured out we had a SRE team at ...TDC2017 | SĂŁo Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | SĂŁo Paulo - Trilha Java EE How we figured out we had a SRE team at ...tdc-globalcode
 
Regression analysis in excel
Regression analysis in excelRegression analysis in excel
Regression analysis in excelAwais Salman
 
3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdfAmanuelDina
 
Md2k 0219 shang
Md2k 0219 shangMd2k 0219 shang
Md2k 0219 shangBBKuhn
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionTe-Yen Liu
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningPyingkodi Maran
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Clustering
ClusteringClustering
ClusteringRashmi Bhat
 
12 Sorting Techniques.pptx
12 Sorting Techniques.pptx12 Sorting Techniques.pptx
12 Sorting Techniques.pptxPulkitSharma220132
 
Matrix Multiplication(An example of concurrent programming)
Matrix Multiplication(An example of concurrent programming)Matrix Multiplication(An example of concurrent programming)
Matrix Multiplication(An example of concurrent programming)Pramit Kumar
 
MLU_DTE_Lecture_2.pptx
MLU_DTE_Lecture_2.pptxMLU_DTE_Lecture_2.pptx
MLU_DTE_Lecture_2.pptxRahulChaudhry15
 

Ähnlich wie Data Augmentation and Disaggregation by Neal Fultz (20)

Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
 
Predictive Modelling
Predictive ModellingPredictive Modelling
Predictive Modelling
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Unit 2
Unit 2Unit 2
Unit 2
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
 
TDC2017 | SĂŁo Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | SĂŁo Paulo - Trilha Java EE How we figured out we had a SRE team at ...TDC2017 | SĂŁo Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | SĂŁo Paulo - Trilha Java EE How we figured out we had a SRE team at ...
 
Regression analysis in excel
Regression analysis in excelRegression analysis in excel
Regression analysis in excel
 
3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf
 
Md2k 0219 shang
Md2k 0219 shangMd2k 0219 shang
Md2k 0219 shang
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Clustering
ClusteringClustering
Clustering
 
12 Sorting Techniques.pptx
12 Sorting Techniques.pptx12 Sorting Techniques.pptx
12 Sorting Techniques.pptx
 
Matrix Multiplication(An example of concurrent programming)
Matrix Multiplication(An example of concurrent programming)Matrix Multiplication(An example of concurrent programming)
Matrix Multiplication(An example of concurrent programming)
 
MLU_DTE_Lecture_2.pptx
MLU_DTE_Lecture_2.pptxMLU_DTE_Lecture_2.pptx
MLU_DTE_Lecture_2.pptx
 

Mehr von Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

Mehr von Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

KĂŒrzlich hochgeladen

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

KĂŒrzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Data Augmentation and Disaggregation by Neal Fultz

  • 1. Data Augmentation and Disaggregation Neal Fultz nfultz@system1.com July 26, 2017 https://goo.gl/6uQrss
  • 2. ● Many data sets are only available in aggregated form ○ Precluding use of stock statistics / ML directly. ● Data augmentation, a classic tool from Bayesian computation, can be applied more generally. ○ Disaggregating across and within observations Executive Summary
  • 4. A Data Set n Price 42 2.406 33 2.283 10 2.114 10 2.815 2 1.691 1 2.033 1 2.061 1 0.133 1 0.627
  • 5. ● Like to use Price ~ LN( , 2 ) ○ Lognormal has nice interpretation as random walk of ± % ○ Also won’t go negative ○ Common Alternatives: Exponential, Gamma ● Estimate both parameters for later use ● Actually, we want to do so for 10k items Modeling Price
  • 6. Log-normal Recap ● If Y ~ N( , 2 ), X = exp(Y) ~ LN( , 2 ) ● E(X) = exp( + 2 / 2) ● Var(X) = [exp( 2 ) - 1] exp(2 + 2 ) ● Standard estimators: ○ MoM - uses log of mean of X ○ MLE - uses mean of log X
  • 7. Log-normal Recap ● Method of Moments ○ s2 = ln(ÎŁ X2 / N) - 2 ln (ÎŁ X / N) ○ m = ln(ÎŁ X / N) - s2 /2 ● Maximum Likelihood ○ m = ÎŁ ln X / N ○ s2 = ÎŁ (ln X - m)2 / N
  • 8. Estimation v0.1 What if we just ignore n, and plug in hourly averages to our estimators? =>Gives equal weight to (n=1, $=0.133) as (n=42, $=2.406) => Everything biased towards the small obs
  • 9. Estimation v0.2 What if we just plug in weighted sample averages? ● Method of Moments: ○ m = 0.342, s2 = 0.996 ○ Expected Value: exp(.342 + .996/2) = 2.32 ● Maximum Likelihood: ○ m = 0.811, s2 = 0.105 ○ Expected Value: exp(.811 + .105/2) = 2.37
  • 10.
  • 11. Are these trustworthy? To check if these make sense: ● Simulate from both estimates as ground truth ● Apply both estimators ● Inspect bias
  • 12.
  • 13. Why are these not working? ● Many distributions are additive ○ N(0,1) + N(1,1) => N(1,3) ○ Pois(4) + Pois(5) => Pois(9) ● Log Normal is not! ○ So (n=42, $=2.406) is not LN, even if individual prices are ○ It is in fact a marginal distribution ■ contains 41 integrals :( ● What about CLT? ■ Even if (n=42) is approx N, (n=10) and (n=2) are probably not
  • 14. A Data Set n Price 42 2.406 33 2.283 10 2.114 10 2.815 2 1.691 1 2.033 1 2.061 1 0.133 1 0.627
  • 15. Part 1 Main Points Violate iid at your own risk! ● Do NOT plug and chug ● Do NOT expect weights will fix your problem ● Do NOT use predictive models ● Do NOT use multi-armed bandits Get better, unaggregated data! 
 but if you can’t ...
  • 16. Part 2: Data Augmentation
  • 17. A Data Set n Price 42 2.406 33 2.283 10 2.114 10 2.815 2 1.691 1 2.033 1 2.061 1 0.133 1 0.627
  • 18. Long format
. id Group Price 1 1 2.406 2 1 2.406 3 1 2.406 ... ... ... 96 5 1.691 97 5 1.691 98 6 2.033 99 7 2.061 100 8 0.133 101 9 0.627
  • 19. Estimation ● MCMC using stock methods, eg Metropolis-Hastings ● MH requires: ○ Target Distribution / probability model ○ State transition functions / proposal distributions ● MH outputs: ○ Numerical samples from target distribution
  • 20. Proposal Distribution ● Transitions on m and s2 - easy ● Transitions on missing T Prices ? ○ hourly constraints on total $ ■ Don’t want to propose out-of-bounds ○ Option 1 - draw from dirichlet, ■ use that to disaggregate, transition whole hours at once ■ Big steps => lots of rejections ○ Option 2 - pairwise transitions within group
  • 21.
  • 22.
  • 23. Part 2 Main Points Switching from aggregated to long format shows aggregation can be thought of as a form of missing data. However, group averages => constraints on the missing data. In our example data, 97/101 points are missing, but we can still get reasonable estimates via MCMC
  • 25. A Data Set n Price 42 2.406 33 2.283 10 2.114 10 2.815 2 1.691 1 2.033 1 2.061 1 0.133 1 0.627
  • 26. Additional Challenges What if aggregation is over multiple heterogeneous groups, and we need to split the money between the groups (“disaggregate”)? Do we know the split a priori? What if we don’t?
  • 27. A Grouped Data Set Known Groups Desktop Mobile Price 38 4 2.406 27 6 2.283 2 8 2.114 6 4 2.815 0 2 1.691 0 1 2.033 1 0 2.061 1 0 0.133 0 1 0.627
  • 28. Common Heuristics ● Linear disaggregation ○ Weighted averages by another name ○ Doesn’t account for variation in other columns ● Iterative Proportional Fitting ○ If you have subtotals in all dimension ○ Alternates disaggregating by rows/columns Desktop Mobile Price 38 4 2.406 27 6 2.283 2 8 2.114 6 4 2.815 0 2 1.691 0 1 2.033 1 0 2.061 1 0 0.133 0 1 0.627
  • 29. Long format
. id Group mobile Price 1 1 1 2.406 2 1 0 2.406 3 1 1 2.406 ... ... 96 5 1 1.691 97 5 1 1.691 98 6 1 2.033 99 7 0 2.061 100 8 0 0.133 101 9 1 0.627
  • 30. A Grouped Data Set Unknown Groups n Prime Sub Price 42 ? ? 2.406 33 ? ? 2.283 10 ? ? 2.114 10 ? ? 2.815 2 ? ? 1.691 1 ? ? 2.033 1 ? ? 2.061 1 ? ? 0.133 1 ? ? 0.627
  • 31. A Grouped Data Set Unknown Groups n Prime Sub Price 42 30 12 2.406 33 23 9 2.283 10 7 3 2.114 10 8 2 2.815 2 2 0 1.691 1 1 0 2.033 1 1 0 2.061 1 0 1 0.133 1 0 1 0.627
  • 32.
  • 33. Part 3 Main Points By extending the previous model, we can deal with “heterogeneous aggregates”. If the grouping variable is known, solve like a regression problem. If not known / latent, solve it like a mixture problem. Either way, going Bayes let’s you borrow strength between aggregates, which disaggregation heuristics are not good at.