SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
Inference with
       Big Data

  A Superpower
    Approach


        Galit Shmuéli
  Indian School of Business
                                CECR
Mohit Dayal      Mingfeng Lin
Lalita Reddi     Hank Lucas
Bhim Pochiraju
Big data studies (in information systems)
increasingly common

 # IS papers with n>10,000 (2004-2010)
Large-study IS papers: How Big?
“over 10,000 publicly available feedback text comments… in eBay”
                       The Nature and Role of Feedback Text Comments in Online Marketplaces
                                                                 Pavlou & Dimoka, ISR 2006

   For our analysis, we have … 784,882 [portal visits]
                            Household-Specific Regressions Using Clickstream Data
                                           Goldfarb & Lu, Statistical Science 2006

           “51,062 rare coin auctions that took place… on eBay”
                                                The Sound of Silence in Online Feedback
                                         Dellarocas & Wood, Management Science 2006

“We collected data on … [175,714] reviews from Amazon”
                                            Examining the Relationship Between Reviews and Sales
                                                                           Forman et al., ISR 2008

108,333 used vehicles offered in the wholesale automotive market
                                                                Electronic vs. Physical Market Mechanisms
                                                                 Overby & Jap, Management Science 2009

“we use… 3.7 million records, encompassing transactions for the Federal Supply Service
(FSS) of the U.S. Federal government in fiscal year 2000
                                Using Transaction Prices to Re-Examine Price Dispersion in Electronic Markets
                                                                                      Ghose & Yao, ISR 2011
Apply small sample approach to
Big Data studies?
It’s all about Power
Magnify effects




                  Separate signal
                  from noise
Artwork: Running the numbers by Chris Jordan (www.chrisjordan.com)
           426,000 cell phones retired in the US every day
Power = Prob (detect H1 effect)
        = f ( sample size, effect size, a, noise )
Rare
events



         Stronger
         validity
                    Small &
                    complex effects



The Promise
Statistical Technology

     Hypotheses
     Data Exploration
     Models
     Model Validation
     Inference
Chapter 1:
With Mohit Dayal & Lalita Reedi (ISB)


DATA VIZ: “BIG DATA” CHARTS
Scaling Up Data Visualization




Missing values
Big Data Scatter plot
Visualization:
Big Data
Boxplot
BIG DATA (SUPERPOWER) APPROACH:
Charts based on aggregation
Interactive viz (zoom & pan, filter, etc.)
BIG DATA AND SMALL-SAMPLE
INFERENCE
Simple Hypotheses                        Assumptions?
     H1: b1>0
                                      Few control
                    Few hypotheses    variables


𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋1 𝑋2 + ⋯ + 𝛽 𝑘 𝑋𝑘 + 𝜀

                 Which model?

                    What data?                Sign +
                                              Statistical
                                              significance
                 Model fit   Robustness

    𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽 2 𝑋2 + 𝛽 3 𝑋1 𝑋2 + ⋯ + 𝛽 𝑘 𝑋𝑘
What doesn’t scale up?
(wrong conclusions)

What are missed opportunities?
Chapter 2:
With Hank Lucas (UMD) & Mingfeng Lin (UoA)



TOO BIG TO FAIL:
LARGE SAMPLES AND FALSE DISCOVERIES
small p-values*** are not interesting
p-value ~ proximity of sample to H0
         = f(effect size, sample size, noise)

                       H0: b=0



Large sample result:
Deflated p-values
auctions for digital cameras Aug ’07- Jan ‘08
                           [thanks to Wolfgang Jank for the data!]



ln Price  b0  b1 *ln(minimumBid )  b 2 * reserve  b3 *ln(sellerFeedback )
           b 4 (Duration)  b5 (controls)  


  H1: Higher minimum bids lead to higher final prices (b1>0)
  H2: Auctions with reserve price will sell for higher prices (b2>0)
  H3: Duration affects price (b4≠0)
  H4: The higher the seller feedback, the higher the price (b3>0)




                 n=341,136
n=341,136
“In a large sample, we can obtain very large t
statistics with low p-values for our predictors,
when, in fact, their effect on Y is very slight”
              Applied Statistics in Business & Economics
                                       Doane & Seward
BIG DATA (SUPERPOWER) APPROACH:
Focus on size (ignore p-values)
Subsamples for robustness: “results quantitatively similar”
Chapter 3:


MODEL ASSUMPTIONS,
DIAGNOSTICS AND ADJUSTMENT
With big data, we’re in the realm of
      asymptotic behaviour


          𝑛→∞
Violated assumptions: less tinkering
Assumption             Coefficient Standard Redundant diagnostic
Violation              bias        errors   tests
Under-specification         all      bias
Endogeneity*                all      2SLS is        Instrument strength
                                     worse          (Sargan)
E(e) =0                     𝜷𝟎       bias           Lack-of-fit
Non-normality                                       Anderson-Darling
Heteroscedasticity                   bias           Breusch-Pagan
Over-specification
Serial dependence                    bias           Durbin -Watson
Multicollinearity                    increase       Significant correlations
Influential outliers                                Leverage (multiple testing)


*IV estimators only have desirable asymptotic, not finite sample, properties
BIG DATA (SUPERPOWER) APPROACH:
Focus on bias-related assumptions
Avoid statistical tests (p-value challenge)
Chapter 4:
With Bhimsankaram Pochiraju & Mohit Dayal (ISB)




COMPLEX EFFECTS &
HETEROGENEITY
With Big Data:

Detect small (but important) effects

Detect rare events (in rare minorities)
Complex          H1: b3>2      Fixed effects          Less
 hypotheses                                            assumptions
                              control
                              variables


   𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋12 + 𝛽3 𝑋2 𝑋3 + ⋯ + 𝛽 𝑘 𝑋𝑘 + 𝜀

                     Which model?          Heterogeneous
                                           Clustering/Mixtures
                                           Sub-samples
                         What data?        Propensity Scores, 2SLS



Predictive       Model fit   Robustness                   magnitude

        𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽 2 𝑋2 + 𝛽 3 𝑋1 𝑋2 + ⋯ + 𝛽 𝑘 𝑋𝑘
Test complex
hypotheses
Moderators
Nonlinear relationships
Multiple categories
Control variables
                                  The rovers have a magnifying camera… that
                                  scientists can use to carefully look at the fine
                                  structure of a rock


Quantify subtle effects
Specific measures (20 eBay categories)

“Moderator variables are difficult to detect” -Aguinis, 1994
Low R2, yet non-zero coefficients
Stepwise Selection
OLS with Stepwise (AIC measure)
Logistic with variable selection (RELR)
       All independent variables
       All control variables
       Quadratic terms of continuous variables
       2-way interactions


Choose software carefully (R: “out of memory”)
Heterogeneity: CART
• Identify non-linearities and
  interactions
• Does not identify different models
• Challenge: independent variables
  vs. control variables
Clustering

 1. Cluster all
    independent and
    control variables
 2. Fit separate
    regression models
    to each cluster

• Popular in risk analytics
• Fast, easy
• Does not guarantee
  distinct relationships
Finite Mixture
Regression

Search for k separate
regressions

Convergence issues
on entire dataset

For 10 subsamples
(n=30K) converged
for seven
Chapter 5:


MODEL VALIDATION
Improve model validation, comparison,
          and generalization

Internal & external validity

Robustness across subsamples
  non-random
  random
  (overlapping/non)
Improve predictive validation

                                Holdout set

       Training set
SMALL SAMPLE
MODELING APPROACH
Clark Kent ≤ Superman

Weitere ähnliche Inhalte

Andere mochten auch

Opening Data With Kaggle
Opening Data With KaggleOpening Data With Kaggle
Opening Data With KaggleGalit Shmueli
 
Too Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False DiscoveriesToo Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False DiscoveriesGalit Shmueli
 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)Galit Shmueli
 
What is Predictive About Partial Least Squares?
What is Predictive About Partial Least Squares?What is Predictive About Partial Least Squares?
What is Predictive About Partial Least Squares?Galit Shmueli
 
Linear Probability Models and Big Data: Kosher or Not?
Linear Probability Models and Big Data: Kosher or Not?Linear Probability Models and Big Data: Kosher or Not?
Linear Probability Models and Big Data: Kosher or Not?Galit Shmueli
 
Big Data & Analytics in the Digital Creative Industries
Big Data & Analytics in the Digital Creative IndustriesBig Data & Analytics in the Digital Creative Industries
Big Data & Analytics in the Digital Creative IndustriesGalit Shmueli
 
De-Mystefying Predictive Analytics
De-Mystefying Predictive AnalyticsDe-Mystefying Predictive Analytics
De-Mystefying Predictive AnalyticsGalit Shmueli
 
Prediction, Explanation and the Business Analytics Toolkit
Prediction, Explanation and the Business Analytics Toolkit Prediction, Explanation and the Business Analytics Toolkit
Prediction, Explanation and the Business Analytics Toolkit Galit Shmueli
 
Predictive analytics in Information Systems Research (TSWIM 2015 keynote)
Predictive analytics in Information Systems Research (TSWIM 2015 keynote)Predictive analytics in Information Systems Research (TSWIM 2015 keynote)
Predictive analytics in Information Systems Research (TSWIM 2015 keynote)Galit Shmueli
 

Andere mochten auch (9)

Opening Data With Kaggle
Opening Data With KaggleOpening Data With Kaggle
Opening Data With Kaggle
 
Too Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False DiscoveriesToo Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False Discoveries
 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
 
What is Predictive About Partial Least Squares?
What is Predictive About Partial Least Squares?What is Predictive About Partial Least Squares?
What is Predictive About Partial Least Squares?
 
Linear Probability Models and Big Data: Kosher or Not?
Linear Probability Models and Big Data: Kosher or Not?Linear Probability Models and Big Data: Kosher or Not?
Linear Probability Models and Big Data: Kosher or Not?
 
Big Data & Analytics in the Digital Creative Industries
Big Data & Analytics in the Digital Creative IndustriesBig Data & Analytics in the Digital Creative Industries
Big Data & Analytics in the Digital Creative Industries
 
De-Mystefying Predictive Analytics
De-Mystefying Predictive AnalyticsDe-Mystefying Predictive Analytics
De-Mystefying Predictive Analytics
 
Prediction, Explanation and the Business Analytics Toolkit
Prediction, Explanation and the Business Analytics Toolkit Prediction, Explanation and the Business Analytics Toolkit
Prediction, Explanation and the Business Analytics Toolkit
 
Predictive analytics in Information Systems Research (TSWIM 2015 keynote)
Predictive analytics in Information Systems Research (TSWIM 2015 keynote)Predictive analytics in Information Systems Research (TSWIM 2015 keynote)
Predictive analytics in Information Systems Research (TSWIM 2015 keynote)
 

Ähnlich wie Inference with big data: SCECR 2012 Presentation

Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use casesSridhar Ratakonda
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boostingbutest
 
Data Science Isn't a Fad: Let's Keep it That Way
Data Science Isn't a Fad: Let's Keep it That WayData Science Isn't a Fad: Let's Keep it That Way
Data Science Isn't a Fad: Let's Keep it That WayMelinda Thielbar
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Miningbutest
 
To explain or to predict
To explain or to predictTo explain or to predict
To explain or to predictGalit Shmueli
 
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Ian Morgan
 
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Bayes Nets meetup London
 
Rsqrd AI - ML Interpretability: Beyond Feature Importance
Rsqrd AI - ML Interpretability: Beyond Feature ImportanceRsqrd AI - ML Interpretability: Beyond Feature Importance
Rsqrd AI - ML Interpretability: Beyond Feature ImportanceAlessya Visnjic
 
Neural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for PhysicistsNeural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for PhysicistsHéloïse Nonne
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Jonathan Stray
 
Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922stone55
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsDilum Bandara
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsManojit Nandi
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedKrishnaram Kenthapadi
 
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...changedaeoh
 

Ähnlich wie Inference with big data: SCECR 2012 Presentation (20)

Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use cases
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
 
Shmueli
ShmueliShmueli
Shmueli
 
Data Science Isn't a Fad: Let's Keep it That Way
Data Science Isn't a Fad: Let's Keep it That WayData Science Isn't a Fad: Let's Keep it That Way
Data Science Isn't a Fad: Let's Keep it That Way
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Mining
 
To explain or to predict
To explain or to predictTo explain or to predict
To explain or to predict
 
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
 
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
 
Rsqrd AI - ML Interpretability: Beyond Feature Importance
Rsqrd AI - ML Interpretability: Beyond Feature ImportanceRsqrd AI - ML Interpretability: Beyond Feature Importance
Rsqrd AI - ML Interpretability: Beyond Feature Importance
 
Neural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for PhysicistsNeural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for Physicists
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
 
Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive Analytics
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Emerging patterns based classifier
Emerging patterns based classifierEmerging patterns based classifier
Emerging patterns based classifier
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World Systems
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons Learned
 
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
 

Mehr von Galit Shmueli

“Improving” prediction of human behavior using behavior modification
“Improving” prediction of human behavior using behavior modification“Improving” prediction of human behavior using behavior modification
“Improving” prediction of human behavior using behavior modificationGalit Shmueli
 
Repurposing Classification & Regression Trees for Causal Research with High-D...
Repurposing Classification & Regression Trees for Causal Research with High-D...Repurposing Classification & Regression Trees for Causal Research with High-D...
Repurposing Classification & Regression Trees for Causal Research with High-D...Galit Shmueli
 
To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?Galit Shmueli
 
Behavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare ResearchBehavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare ResearchGalit Shmueli
 
Reinventing the Data Analytics Classroom
Reinventing the Data Analytics ClassroomReinventing the Data Analytics Classroom
Reinventing the Data Analytics ClassroomGalit Shmueli
 
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS TaipeiBehavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS TaipeiGalit Shmueli
 
Repurposing predictive tools for causal research
Repurposing predictive tools for causal researchRepurposing predictive tools for causal research
Repurposing predictive tools for causal researchGalit Shmueli
 
Statistical Modeling in 3D: Describing, Explaining and Predicting
Statistical Modeling in 3D: Describing, Explaining and PredictingStatistical Modeling in 3D: Describing, Explaining and Predicting
Statistical Modeling in 3D: Describing, Explaining and PredictingGalit Shmueli
 
Workshop on Information Quality
Workshop on Information QualityWorkshop on Information Quality
Workshop on Information QualityGalit Shmueli
 
Behavioral Big Data: Why Quality Engineers Should Care
Behavioral Big Data: Why Quality Engineers Should CareBehavioral Big Data: Why Quality Engineers Should Care
Behavioral Big Data: Why Quality Engineers Should CareGalit Shmueli
 
Statistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, DescribingStatistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, DescribingGalit Shmueli
 
Researcher Dilemmas using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
Researcher Dilemmas  using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...Researcher Dilemmas  using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
Researcher Dilemmas using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...Galit Shmueli
 
Prediction-based Model Selection in PLS-PM
Prediction-based Model Selection in PLS-PMPrediction-based Model Selection in PLS-PM
Prediction-based Model Selection in PLS-PMGalit Shmueli
 
When Prediction Met PLS: What We learned in 3 Years of Marriage
When Prediction Met PLS: What We learned in 3 Years of MarriageWhen Prediction Met PLS: What We learned in 3 Years of Marriage
When Prediction Met PLS: What We learned in 3 Years of MarriageGalit Shmueli
 
A Tree-Based Approach for Addressing Self-selection in Impact Studies with B...
A Tree-Based Approach  for Addressing Self-selection in Impact Studies with B...A Tree-Based Approach  for Addressing Self-selection in Impact Studies with B...
A Tree-Based Approach for Addressing Self-selection in Impact Studies with B...Galit Shmueli
 
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...Galit Shmueli
 
A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...Galit Shmueli
 
Research Using Behavioral Big Data (BBD)
Research Using Behavioral Big Data (BBD)Research Using Behavioral Big Data (BBD)
Research Using Behavioral Big Data (BBD)Galit Shmueli
 
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral IssuesAnalyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral IssuesGalit Shmueli
 
Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Ma...
Big Data - To Explain or To Predict?  Talk at U Toronto's Rotman School of Ma...Big Data - To Explain or To Predict?  Talk at U Toronto's Rotman School of Ma...
Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Ma...Galit Shmueli
 

Mehr von Galit Shmueli (20)

“Improving” prediction of human behavior using behavior modification
“Improving” prediction of human behavior using behavior modification“Improving” prediction of human behavior using behavior modification
“Improving” prediction of human behavior using behavior modification
 
Repurposing Classification & Regression Trees for Causal Research with High-D...
Repurposing Classification & Regression Trees for Causal Research with High-D...Repurposing Classification & Regression Trees for Causal Research with High-D...
Repurposing Classification & Regression Trees for Causal Research with High-D...
 
To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?
 
Behavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare ResearchBehavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare Research
 
Reinventing the Data Analytics Classroom
Reinventing the Data Analytics ClassroomReinventing the Data Analytics Classroom
Reinventing the Data Analytics Classroom
 
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS TaipeiBehavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
 
Repurposing predictive tools for causal research
Repurposing predictive tools for causal researchRepurposing predictive tools for causal research
Repurposing predictive tools for causal research
 
Statistical Modeling in 3D: Describing, Explaining and Predicting
Statistical Modeling in 3D: Describing, Explaining and PredictingStatistical Modeling in 3D: Describing, Explaining and Predicting
Statistical Modeling in 3D: Describing, Explaining and Predicting
 
Workshop on Information Quality
Workshop on Information QualityWorkshop on Information Quality
Workshop on Information Quality
 
Behavioral Big Data: Why Quality Engineers Should Care
Behavioral Big Data: Why Quality Engineers Should CareBehavioral Big Data: Why Quality Engineers Should Care
Behavioral Big Data: Why Quality Engineers Should Care
 
Statistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, DescribingStatistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, Describing
 
Researcher Dilemmas using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
Researcher Dilemmas  using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...Researcher Dilemmas  using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
Researcher Dilemmas using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
 
Prediction-based Model Selection in PLS-PM
Prediction-based Model Selection in PLS-PMPrediction-based Model Selection in PLS-PM
Prediction-based Model Selection in PLS-PM
 
When Prediction Met PLS: What We learned in 3 Years of Marriage
When Prediction Met PLS: What We learned in 3 Years of MarriageWhen Prediction Met PLS: What We learned in 3 Years of Marriage
When Prediction Met PLS: What We learned in 3 Years of Marriage
 
A Tree-Based Approach for Addressing Self-selection in Impact Studies with B...
A Tree-Based Approach  for Addressing Self-selection in Impact Studies with B...A Tree-Based Approach  for Addressing Self-selection in Impact Studies with B...
A Tree-Based Approach for Addressing Self-selection in Impact Studies with B...
 
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
 
A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
 
Research Using Behavioral Big Data (BBD)
Research Using Behavioral Big Data (BBD)Research Using Behavioral Big Data (BBD)
Research Using Behavioral Big Data (BBD)
 
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral IssuesAnalyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
 
Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Ma...
Big Data - To Explain or To Predict?  Talk at U Toronto's Rotman School of Ma...Big Data - To Explain or To Predict?  Talk at U Toronto's Rotman School of Ma...
Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Ma...
 

Kürzlich hochgeladen

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Inference with big data: SCECR 2012 Presentation

  • 1. Inference with Big Data A Superpower Approach Galit Shmuéli Indian School of Business CECR Mohit Dayal Mingfeng Lin Lalita Reddi Hank Lucas Bhim Pochiraju
  • 2. Big data studies (in information systems) increasingly common # IS papers with n>10,000 (2004-2010)
  • 3. Large-study IS papers: How Big? “over 10,000 publicly available feedback text comments… in eBay” The Nature and Role of Feedback Text Comments in Online Marketplaces Pavlou & Dimoka, ISR 2006 For our analysis, we have … 784,882 [portal visits] Household-Specific Regressions Using Clickstream Data Goldfarb & Lu, Statistical Science 2006 “51,062 rare coin auctions that took place… on eBay” The Sound of Silence in Online Feedback Dellarocas & Wood, Management Science 2006 “We collected data on … [175,714] reviews from Amazon” Examining the Relationship Between Reviews and Sales Forman et al., ISR 2008 108,333 used vehicles offered in the wholesale automotive market Electronic vs. Physical Market Mechanisms Overby & Jap, Management Science 2009 “we use… 3.7 million records, encompassing transactions for the Federal Supply Service (FSS) of the U.S. Federal government in fiscal year 2000 Using Transaction Prices to Re-Examine Price Dispersion in Electronic Markets Ghose & Yao, ISR 2011
  • 4.
  • 5. Apply small sample approach to Big Data studies?
  • 7. Magnify effects Separate signal from noise
  • 8. Artwork: Running the numbers by Chris Jordan (www.chrisjordan.com) 426,000 cell phones retired in the US every day
  • 9. Power = Prob (detect H1 effect) = f ( sample size, effect size, a, noise )
  • 10. Rare events Stronger validity Small & complex effects The Promise
  • 11. Statistical Technology Hypotheses Data Exploration Models Model Validation Inference
  • 12. Chapter 1: With Mohit Dayal & Lalita Reedi (ISB) DATA VIZ: “BIG DATA” CHARTS
  • 13. Scaling Up Data Visualization Missing values
  • 16. BIG DATA (SUPERPOWER) APPROACH: Charts based on aggregation Interactive viz (zoom & pan, filter, etc.)
  • 17. BIG DATA AND SMALL-SAMPLE INFERENCE
  • 18. Simple Hypotheses Assumptions? H1: b1>0 Few control Few hypotheses variables 𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋1 𝑋2 + ⋯ + 𝛽 𝑘 𝑋𝑘 + 𝜀 Which model? What data? Sign + Statistical significance Model fit Robustness 𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽 2 𝑋2 + 𝛽 3 𝑋1 𝑋2 + ⋯ + 𝛽 𝑘 𝑋𝑘
  • 19. What doesn’t scale up? (wrong conclusions) What are missed opportunities?
  • 20. Chapter 2: With Hank Lucas (UMD) & Mingfeng Lin (UoA) TOO BIG TO FAIL: LARGE SAMPLES AND FALSE DISCOVERIES
  • 21. small p-values*** are not interesting
  • 22. p-value ~ proximity of sample to H0 = f(effect size, sample size, noise) H0: b=0 Large sample result: Deflated p-values
  • 23. auctions for digital cameras Aug ’07- Jan ‘08 [thanks to Wolfgang Jank for the data!] ln Price  b0  b1 *ln(minimumBid )  b 2 * reserve  b3 *ln(sellerFeedback )  b 4 (Duration)  b5 (controls)   H1: Higher minimum bids lead to higher final prices (b1>0) H2: Auctions with reserve price will sell for higher prices (b2>0) H3: Duration affects price (b4≠0) H4: The higher the seller feedback, the higher the price (b3>0) n=341,136
  • 25.
  • 26. “In a large sample, we can obtain very large t statistics with low p-values for our predictors, when, in fact, their effect on Y is very slight” Applied Statistics in Business & Economics Doane & Seward
  • 27. BIG DATA (SUPERPOWER) APPROACH: Focus on size (ignore p-values) Subsamples for robustness: “results quantitatively similar”
  • 29. With big data, we’re in the realm of asymptotic behaviour 𝑛→∞
  • 30. Violated assumptions: less tinkering Assumption Coefficient Standard Redundant diagnostic Violation bias errors tests Under-specification all bias Endogeneity* all 2SLS is Instrument strength worse (Sargan) E(e) =0 𝜷𝟎 bias Lack-of-fit Non-normality Anderson-Darling Heteroscedasticity bias Breusch-Pagan Over-specification Serial dependence bias Durbin -Watson Multicollinearity increase Significant correlations Influential outliers Leverage (multiple testing) *IV estimators only have desirable asymptotic, not finite sample, properties
  • 31. BIG DATA (SUPERPOWER) APPROACH: Focus on bias-related assumptions Avoid statistical tests (p-value challenge)
  • 32. Chapter 4: With Bhimsankaram Pochiraju & Mohit Dayal (ISB) COMPLEX EFFECTS & HETEROGENEITY
  • 33. With Big Data: Detect small (but important) effects Detect rare events (in rare minorities)
  • 34. Complex H1: b3>2 Fixed effects Less hypotheses assumptions control variables 𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋12 + 𝛽3 𝑋2 𝑋3 + ⋯ + 𝛽 𝑘 𝑋𝑘 + 𝜀 Which model? Heterogeneous Clustering/Mixtures Sub-samples What data? Propensity Scores, 2SLS Predictive Model fit Robustness magnitude 𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽 2 𝑋2 + 𝛽 3 𝑋1 𝑋2 + ⋯ + 𝛽 𝑘 𝑋𝑘
  • 35. Test complex hypotheses Moderators Nonlinear relationships Multiple categories Control variables The rovers have a magnifying camera… that scientists can use to carefully look at the fine structure of a rock Quantify subtle effects Specific measures (20 eBay categories) “Moderator variables are difficult to detect” -Aguinis, 1994 Low R2, yet non-zero coefficients
  • 36.
  • 37. Stepwise Selection OLS with Stepwise (AIC measure) Logistic with variable selection (RELR) All independent variables All control variables Quadratic terms of continuous variables 2-way interactions Choose software carefully (R: “out of memory”)
  • 38. Heterogeneity: CART • Identify non-linearities and interactions • Does not identify different models • Challenge: independent variables vs. control variables
  • 39. Clustering 1. Cluster all independent and control variables 2. Fit separate regression models to each cluster • Popular in risk analytics • Fast, easy • Does not guarantee distinct relationships
  • 40. Finite Mixture Regression Search for k separate regressions Convergence issues on entire dataset For 10 subsamples (n=30K) converged for seven
  • 42. Improve model validation, comparison, and generalization Internal & external validity Robustness across subsamples non-random random (overlapping/non)
  • 43. Improve predictive validation Holdout set Training set
  • 45. Clark Kent ≤ Superman