- Andrey Belas is an AI Solution Architect at SMART business and expert in machine learning and public speaker
- He created and mentors the SMART Data Science Academy and is responsible for the technical development of the data science team and architecture of all data science projects at SMART business
- He has Microsoft certifications in areas like Big Data and Advanced Analytics, Cloud Data Science with Azure Machine Learning, and Developing SQL Data Models
- He has experience in domains like Deep Learning, Computer Vision, AI in Forecasting, AI in Marketing, and Risk Management
Andrii Belas "Modern approaches to working with categorical data in machine learning"
1.
2. Андрей Белас AI Solution Architect, SMART business
Эксперт в области машинного обучения, публичный
спикер.
Создатель и ментор SMART Data Science Academy, отвечаю
за техническое развитие data science команды и
архитектуру всех data science проектов SMART business.
Microsoft Certified Professional в направлениях:
Big Data and Advanced Analytics
Cloud Data Science with Azure Machine Learning
Developing SQL Data Models.
Опыт работы:
Deep Learning
Computer Vision
AI in Forecasting
AI in Marketing
Risk management
Business Intelligence
9. Tabular data
• Basic type of data: spreadsheet, relational database, financial reports…
• Credit scoring
• Pricing
• Recommendation systems
• Sales forecasting
• Customer churn
• Fraud detection
10. Let’s assume that preparation is done
Business
Understanding
Data
Understanding
Data
Preparation
Modelling Evaluation Deployment
Identify project
objectives
Collect and
review data
Select and
cleanse data
Manipulate data
and draw
conclusions
Evaluate model
and conclusions
Apply conclusions
to business
11. Feature types
• Numeric – could be any number (age, salary …)
• Categorical - you can select the answer from
a small group of possibilities
(gender, occupation)
• Other types (text, images, audio …)
12. Let’s start modeling…
• But machine learning models can only learn from numeric values (mostly)
• Random forest – limited number of categories
• Xgboost – numeric only
• So should we drop nonnumeric features?
• No! They potentially have predictive power.
15. Machine Learning: Modern
• Label encoding gives random order. No correlation with target
• Trees are unable to handle high-cardinality categorical variables: trees have
limited depth.
• We want to use target to generate features – target encoding
• Mean encoding is the most common
17. Machine Learning: Modern
• Easy to overfit, use regularization or special packages
• Try some modern libraries with build-in encodings:
• LightGBM
• Catboost
21. Deep Learning: Embedding
• Inspired from NLP (word embeddings, word2vec), but currently not in the
books
• We will use embedding layer to treat categorical variables!
• This approach allows for relationships between categories to be captured
• There may be patterns for cities that are geographically near each other, and
for cities that are of similar socio-economic status etc
• Much lower dimensionality
22. Deep Learning: Word2Vec
• Note the difference between first
two rows and rest
• First dimension is capturing
something related to being a dog,
and the second dimension captures
youthfulness
• We definitely won’t do vocabulary
with one hot today
23. Deep Learning: Embedding
• Much smaller
• Learned from data
• Latent (hidden) features – can visualize then
• Can then be used as pretrained (shops for example) – transfer learning
27. Deep Learning: Embedding only
• Doesn’t cover interactions with other variables
• Multiple categorical variables can cause the problem
• Solution – multimodal (multi-input) neural networks!
35. Deep Learning: Embeddings
• Embedding - look something up in an array (looking something up in an array is
mathematically identical to doing a matrix product by a one hot encoded matrix,
but much more efficient)
• Embeddings are amazing!
36. Useful links
• https://ru.coursera.org/learn/competitive-data-science - coursera course on
modern ML
• http://contrib.scikit-learn.org/categorical-encoding/ - Python package
• https://github.com/bfgray3/cattonum - R library
• https://keras.io/getting-started/functional-api-guide/ - Keras functional API
• https://www.kaggle.com/colinmorris/embedding-layers - full example