Whenever you do something that nobody has tried to do before, you often encounter difficulties, that well, nobody has found before! During this talk we will walk you through the data, the modelling, the problems that we encountered and solutions while working together with Graydon on predicting whether different kinds of companies will of relocate. For this use case, we analyzed 10 years of data for two million companies using around 100 descriptors / features, and produced a predictive model using decision trees and random forests on Google Cloud Platform.
4. Relocation Prediction use case
Problem:
businesses, schools, hospitals, etc. move locations over time
(growth, bankruptcy, new markets, etc)
Can we predict if they will relocate?
- To where?
- When?
- Why?
=> For now, we focus only on
relocation probability
5. For businesses we have historical Corporate Data:
- Company size, credit rating, relocation, etc...
=> Can company characteristics predict relocation?
- Useful information for service providers, realtors, city councils,
investors and developers
=> Investigatory POC: 6 week study
- Limit the scope to determine if relocation can be predicted, and
if so, which properties can be a signal
Relocation Prediction use case
6. (Big) Data
We encountered some challenges:
- Monthly data from branches of 2 million companies, going back
10 years… ~ 300 million rows
- Disperse Data: where/how should it be gathered?
- Monthly data too granular: how to aggregate?
- Client did not have a suitable platform for data handling and
analysis...
7. Data & Modeling Considerations
- High dimensional time series data
- Preserve the temporal granularity to maximize information
- Neural Networks?
- LSTM or CNNs?
- NN design/exploration time > available time
- Simplify data and modeling due to time constraints
8. Preparing the Data
- Step 1: Collect the data on an appropriate platform:
- Set up Google Cloud platform in one week
- Step 2: Aggregate the Data
- From monthly to yearly: predicted relocation based on
yearly data
- Choose how to deal with categorical variables
- Subsequent Steps : Spawn virtual machine(s) on GCP for
modeling
9. Summary Statistics
- Final dataset: 75 features from 1 year and ‘has_relocated’
target from following year
- 2 million entries per year
- ~5% relocation (imbalanced dataset)
- Goal: Build a model that can predict ‘has_relocated’
better than randomly (better than 95% accurate)
10. Modeling step 1: Exploring Models
- Apply binary classification algorithms: SVM, logistic regression,
decision trees (DT), random forests (RF)
- Choose models with best performance: AUC, kappa
- DTs and RF did best
- Apply Sampling Techniques to improve models
- Tune model parameters
- Validate
11. Modeling step 2: ResultsTPR
FPR
AUC: 0.66
Best DT model produced
by undersampling data,
5-fold CV, and DT
parameters explored via
grid search
12. Modeling Results: Features
The most important
features having an
influence on a
‘has_relocated’ index
were related to:
- Company financial
assessments and
health
- Company age
14. Validation
How well can yearly models
predict the next year’s
relocation?
… in general, rather well
AUC
15. Validation
How well can yearly models
predict the next year’s
relocation?
… in general, rather well
… except for 2016 (?)
AUC
16. - Company properties can be indicative of whether they relocate
- Yearly aggregated data is sufficient for high level indications of
relocation.
- More granular modeling (e.g. with NN) may provide additional
information
Take aways
- Possible to perform successful
POC on big data within 6 weeks
on GCP
17. Having had more time we would have:
- Full time series modeling
- NN, hierarchical modeling, etc...
- Automate prediction, given company characteristics
- Investigate anomalous year
- Make use of modeling results:
Future work