Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Inference with big data: SCECR 2012 Presentation
1. Inference with
Big Data
A Superpower
Approach
Galit Shmuéli
Indian School of Business
CECR
Mohit Dayal Mingfeng Lin
Lalita Reddi Hank Lucas
Bhim Pochiraju
2. Big data studies (in information systems)
increasingly common
# IS papers with n>10,000 (2004-2010)
3. Large-study IS papers: How Big?
“over 10,000 publicly available feedback text comments… in eBay”
The Nature and Role of Feedback Text Comments in Online Marketplaces
Pavlou & Dimoka, ISR 2006
For our analysis, we have … 784,882 [portal visits]
Household-Specific Regressions Using Clickstream Data
Goldfarb & Lu, Statistical Science 2006
“51,062 rare coin auctions that took place… on eBay”
The Sound of Silence in Online Feedback
Dellarocas & Wood, Management Science 2006
“We collected data on … [175,714] reviews from Amazon”
Examining the Relationship Between Reviews and Sales
Forman et al., ISR 2008
108,333 used vehicles offered in the wholesale automotive market
Electronic vs. Physical Market Mechanisms
Overby & Jap, Management Science 2009
“we use… 3.7 million records, encompassing transactions for the Federal Supply Service
(FSS) of the U.S. Federal government in fiscal year 2000
Using Transaction Prices to Re-Examine Price Dispersion in Electronic Markets
Ghose & Yao, ISR 2011
22. p-value ~ proximity of sample to H0
= f(effect size, sample size, noise)
H0: b=0
Large sample result:
Deflated p-values
23. auctions for digital cameras Aug ’07- Jan ‘08
[thanks to Wolfgang Jank for the data!]
ln Price b0 b1 *ln(minimumBid ) b 2 * reserve b3 *ln(sellerFeedback )
b 4 (Duration) b5 (controls)
H1: Higher minimum bids lead to higher final prices (b1>0)
H2: Auctions with reserve price will sell for higher prices (b2>0)
H3: Duration affects price (b4≠0)
H4: The higher the seller feedback, the higher the price (b3>0)
n=341,136
26. “In a large sample, we can obtain very large t
statistics with low p-values for our predictors,
when, in fact, their effect on Y is very slight”
Applied Statistics in Business & Economics
Doane & Seward
27. BIG DATA (SUPERPOWER) APPROACH:
Focus on size (ignore p-values)
Subsamples for robustness: “results quantitatively similar”
35. Test complex
hypotheses
Moderators
Nonlinear relationships
Multiple categories
Control variables
The rovers have a magnifying camera… that
scientists can use to carefully look at the fine
structure of a rock
Quantify subtle effects
Specific measures (20 eBay categories)
“Moderator variables are difficult to detect” -Aguinis, 1994
Low R2, yet non-zero coefficients
36.
37. Stepwise Selection
OLS with Stepwise (AIC measure)
Logistic with variable selection (RELR)
All independent variables
All control variables
Quadratic terms of continuous variables
2-way interactions
Choose software carefully (R: “out of memory”)
38. Heterogeneity: CART
• Identify non-linearities and
interactions
• Does not identify different models
• Challenge: independent variables
vs. control variables
39. Clustering
1. Cluster all
independent and
control variables
2. Fit separate
regression models
to each cluster
• Popular in risk analytics
• Fast, easy
• Does not guarantee
distinct relationships
42. Improve model validation, comparison,
and generalization
Internal & external validity
Robustness across subsamples
non-random
random
(overlapping/non)