The document presents a new approach for measuring business proximity between firms using topic modeling. It aims to overcome limitations of existing approaches by developing a data-driven, scalable method that provides finer-grained analysis with limited data requirements. The approach applies latent Dirichlet allocation to uncover topics from company descriptions in the CrunchBase dataset. Business proximity is then measured as the cosine similarity between the topic distributions of firm pairs. The method is shown to outperform a baseline of using common industry membership and provides a validated measure of firms' technological and business relatedness.
Anomaly detection and data imputation within time series
Towards a better measure of business proximity: Topic modeling for industry intelligence
1. MISQ Workshop, Leuven, Belgium, August 2015
Towards A Better Measure of
Business Proximity:
Topic Modeling for Industry Intelligence
1
August 13th 2015
Zhan (Michael) Shi Gene Moo Lee* Andrew B. Whinston
Arizona State
University
University of Texas
at Arlington
University of Texas
at Austin
* presenter
3. MISQ Workshop, Leuven, Belgium, August 2015 2
Business proximity: motivation
• To measure firms’ dyadic relatedness in spaces of product, market, and
technology
• Essential in competitive/industry intelligence
• Building block in strategy/industrial organization fields
4. MISQ Workshop, Leuven, Belgium, August 2015 2
Business proximity: motivation
• To measure firms’ dyadic relatedness in spaces of product, market, and
technology
• Essential in competitive/industry intelligence
• Building block in strategy/industrial organization fields
• Existing methods
• Common industry membership (Wang and Zajac 2007)
• Patent holdings (Stuart 1998, Mowery et al. 1998)
• Geographic distance (Mitsuhashi and Greve 2009)
5. MISQ Workshop, Leuven, Belgium, August 2015 2
Business proximity: motivation
• To measure firms’ dyadic relatedness in spaces of product, market, and
technology
• Essential in competitive/industry intelligence
• Building block in strategy/industrial organization fields
• Existing methods
• Common industry membership (Wang and Zajac 2007)
• Patent holdings (Stuart 1998, Mowery et al. 1998)
• Geographic distance (Mitsuhashi and Greve 2009)
• These approaches have strong data requirement
• Typically scarce for early stage high-tech startups
7. MISQ Workshop, Leuven, Belgium, August 2015 3
Our Big Data approach
• Our approach: a unified framework that integrates
8. MISQ Workshop, Leuven, Belgium, August 2015 3
Our Big Data approach
• Our approach: a unified framework that integrates
• Machine learning (LDA topic model)
9. MISQ Workshop, Leuven, Belgium, August 2015 3
Our Big Data approach
• Our approach: a unified framework that integrates
• Machine learning (LDA topic model)
• Statistical network model (ERGM)
10. MISQ Workshop, Leuven, Belgium, August 2015 3
Our Big Data approach
• Our approach: a unified framework that integrates
• Machine learning (LDA topic model)
• Statistical network model (ERGM)
• Big Data technologies (Cloud, NoSQL, Condor)
11. MISQ Workshop, Leuven, Belgium, August 2015 3
Our Big Data approach
• Our approach: a unified framework that integrates
• Machine learning (LDA topic model)
• Statistical network model (ERGM)
• Big Data technologies (Cloud, NoSQL, Condor)
• Outperforming existing approaches
12. MISQ Workshop, Leuven, Belgium, August 2015 3
Our Big Data approach
• Our approach: a unified framework that integrates
• Machine learning (LDA topic model)
• Statistical network model (ERGM)
• Big Data technologies (Cloud, NoSQL, Condor)
• Outperforming existing approaches
• Automatic processing (vs. manual inspection)
13. MISQ Workshop, Leuven, Belgium, August 2015 3
Our Big Data approach
• Our approach: a unified framework that integrates
• Machine learning (LDA topic model)
• Statistical network model (ERGM)
• Big Data technologies (Cloud, NoSQL, Condor)
• Outperforming existing approaches
• Automatic processing (vs. manual inspection)
• Dynamic industry definition (vs. static)
14. MISQ Workshop, Leuven, Belgium, August 2015 3
Our Big Data approach
• Our approach: a unified framework that integrates
• Machine learning (LDA topic model)
• Statistical network model (ERGM)
• Big Data technologies (Cloud, NoSQL, Condor)
• Outperforming existing approaches
• Automatic processing (vs. manual inspection)
• Dynamic industry definition (vs. static)
• Finer granularity (vs. discrete)
15. MISQ Workshop, Leuven, Belgium, August 2015 3
Our Big Data approach
• Our approach: a unified framework that integrates
• Machine learning (LDA topic model)
• Statistical network model (ERGM)
• Big Data technologies (Cloud, NoSQL, Condor)
• Outperforming existing approaches
• Automatic processing (vs. manual inspection)
• Dynamic industry definition (vs. static)
• Finer granularity (vs. discrete)
• Relaxed data requirement (vs. patent, location)
17. MISQ Workshop, Leuven, Belgium, August 2015 4
Main contributions
1. Propose a transformative data-analytic
framework for understanding dynamic startup
landscape
18. MISQ Workshop, Leuven, Belgium, August 2015 4
Main contributions
1. Propose a transformative data-analytic
framework for understanding dynamic startup
landscape
2. Construct an explicit network structure for
understanding firm interactions
19. MISQ Workshop, Leuven, Belgium, August 2015 4
Main contributions
1. Propose a transformative data-analytic
framework for understanding dynamic startup
landscape
2. Construct an explicit network structure for
understanding firm interactions
3. Implement a BI for competitive intelligence in
U.S. high-tech industry
20. MISQ Workshop, Leuven, Belgium, August 2015 5
Roadmap
1. CrunchBase Data
2. Data-Analytics based Business Proximity
3. Empirical Validation
4. Empirical Application on M&A Analysis
5. Industry Intelligence System
6. Conclusion and implication
21. MISQ Workshop, Leuven, Belgium, August 2015 6
Roadmap
1. CrunchBase Data
2. Data-Analytics based Business Proximity
3. Empirical Validation
4. Empirical Application on M&A Analysis
5. Industry Intelligence System
6. Conclusion and implication
22. MISQ Workshop, Leuven, Belgium, August 2015
CrunchBase data
7
• CrunchBase: open database (“Wikipedia”) of high-tech industry
• Data collection time: April 2013 ~ April 2015
• 24,382 U.S. high-tech companies (1.4% public, 5.7 years old)
• HQ location, CB-defined industry sector, key personnels, M&A,
investments, business summary
• States: CA, NY, MA, TX (stats page)
• Industries: software, web, e-commerce, ad, mobile
24. MISQ Workshop, Leuven, Belgium, August 2015
Data: networked business
8
• M&A: 1689 total
• cross-state: 62.6%
• cross-sector: 63.6%
• top 10 buyers: 14.3%
(skewed)
25. MISQ Workshop, Leuven, Belgium, August 2015
Data: networked business
8
• M&A: 1689 total
• cross-state: 62.6%
• cross-sector: 63.6%
• top 10 buyers: 14.3%
(skewed)
• Investments: 531 total
26. MISQ Workshop, Leuven, Belgium, August 2015
Data: networked business
8
• M&A: 1689 total
• cross-state: 62.6%
• cross-sector: 63.6%
• top 10 buyers: 14.3%
(skewed)
• Investments: 531 total
• Job mobility: 19K total
27. MISQ Workshop, Leuven, Belgium, August 2015 9
Roadmap
1. CrunchBase Data
2. Data-Analytics based Business Proximity
3. Empirical Validation
4. Empirical Application on M&A Analysis
5. Industry Intelligence System
6. Conclusion and implication
29. MISQ Workshop, Leuven, Belgium, August 2015
Our approach on business proximity
• Objectives: data-driven, scalability, finer granularity, little data
requirements
10
30. MISQ Workshop, Leuven, Belgium, August 2015
Our approach on business proximity
• Objectives: data-driven, scalability, finer granularity, little data
requirements
• Approach: topic modeling [Blei et al. 2003]
10
31. MISQ Workshop, Leuven, Belgium, August 2015
Our approach on business proximity
• Objectives: data-driven, scalability, finer granularity, little data
requirements
• Approach: topic modeling [Blei et al. 2003]
• unsupervised learning to discover latent “topics” from a large
collection of documents
10
32. MISQ Workshop, Leuven, Belgium, August 2015
Our approach on business proximity
• Objectives: data-driven, scalability, finer granularity, little data
requirements
• Approach: topic modeling [Blei et al. 2003]
• unsupervised learning to discover latent “topics” from a large
collection of documents
10
24K company
descriptions
33. MISQ Workshop, Leuven, Belgium, August 2015
Our approach on business proximity
• Objectives: data-driven, scalability, finer granularity, little data
requirements
• Approach: topic modeling [Blei et al. 2003]
• unsupervised learning to discover latent “topics” from a large
collection of documents
10
LDA
24K company
descriptions
34. MISQ Workshop, Leuven, Belgium, August 2015
Our approach on business proximity
• Objectives: data-driven, scalability, finer granularity, little data
requirements
• Approach: topic modeling [Blei et al. 2003]
• unsupervised learning to discover latent “topics” from a large
collection of documents
10
LDA
Industry-wide topics
24K company
descriptions
35. MISQ Workshop, Leuven, Belgium, August 2015
Our approach on business proximity
• Objectives: data-driven, scalability, finer granularity, little data
requirements
• Approach: topic modeling [Blei et al. 2003]
• unsupervised learning to discover latent “topics” from a large
collection of documents
10
LDA
Industry-wide topics
Company’s topics
24K company
descriptions
36. MISQ Workshop, Leuven, Belgium, August 2015
Business proximity from topic model
• Business proximity pb(i,j) between firms i and j
• Cosine similarity of topic vectors Ti and Tj
• Range: 0 (no commonality) ~ 1 (same business components)
11
37. MISQ Workshop, Leuven, Belgium, August 2015
Business topic model
Per-word
business topic
assignment
Observed
business
descriptions
Business
topics
Per-firm
business
topics distrib.
Topic
parameter
Proportions
parameter
K: # topics
D: # companies
N: # words
38. MISQ Workshop, Leuven, Belgium, August 2015
LDA topic model with CrunchBase
13
Click here for the complete list of 50 topics
Video/music
Energy
Sports
Healthcare
39. MISQ Workshop, Leuven, Belgium, August 2015 14
Roadmap
1. CrunchBase Data
2. Data-Analytics based Business Proximity
3. Empirical Validation
4. Empirical Application on M&A Analysis
5. Industry Intelligence System
6. Conclusion and implication