Prepare your data for machine learning

Prepare Your Data
for Machine Learning
Ivelin Andreev

Sponsors
Gold Sponsors
Innovation Sponsor
Bronze Sponsors
PASS

About meAbout me
• Software Architect @
o 16+ years professional experience
• Microsoft Azure MVP
• External Expert Eurostars-Eureka, InnoFund Denmark
• External Expert Horizon 2020
• Business Interests
o Web Development, SOA, Integration
o IoT, Machine Learning, Computer Intelligence
o Security & Performance Optimization
• Contact
o ivelin.andreev@icb.bg
o www.linkedin.com/in/ivelin
o www.slideshare.net/ivoandreev

Agenda
• Microsoft Tools for ML
• The Data Science Process (Step by Step)
• Data Preparation
• DEMO

ML tools in the Microsoft World
Data preparation
Building models
Consuming models

Machine Learning and Microsoft
• Azure ML integrated, end-to-end data science and advanced analytics
• Microsoft ML related services/tools
• Highlights
o Built on open source technologies (Jupyter Notebook, Spark, Python, Docker)
o Execute experiments in isolated environments
o GPU-enabled VMs
DEPRECATED MAINTAINED AND IMPROVED
• (Azure ML Workbench) • Azure ML Studio • Visual Studio Code Tools for AI
• (Azure ML Experimentation Service) • Data Science VM • Microsoft Cognitive Services, LUIS.ai
• (Azure ML Model Management Service) • Azure Databricks • Libraries for Apache Spark (MMLSpark)
Now called: • Cognitive Toolkit (CNTK) • ML Services for SQL Server (R, Python)
“Machine Learning Service” (preview) • Azure Batch AI Training

Azure ML Workbench
• Desktop application (Windows, macOS) with
• Built-in Jupyter Notebook services and Git integration
• End-to-end process support
o Model development and experimentation (Python)
o Powerful inspectors for data analysis
o Data transformations by example
o Model history and deployment
• Easy to use
and resource hungry 
* Replaced in Sept 24 2018 release to make way for an improved architecture
(ref. to Azure ML SDK for Python or Azure Databricks for big datasets)

Azure ML Studio
• Visual workspace to build, test and deploy ML solutions
• Highlights
o X-browser drag and drop, no programming
o Rich set of modules
o Fits beginners and advanced users
o Unlimited extensibility (R Script, Python Script)
o Enterprise grade cloud service (SLA 99.95%)
o ML REST web services consumption
o Jupyter Notebook
o Azure AI Gallery (9000+ samples)
• At what price?
o Free plan available (10GB storage, 2 web services, 1000 requests/month)
o $10 seat/month + $1 experiment/hour
o Recommended: $100/month (unlimited storage, 10 web services, 100K requests)

Azure Data Science VM
• Pre-configured cloud environment for AI & Data Science
• Highlights
o Fully operational environment
o 50+ tools DEV, ML, BigData, Data management
o Windows and Linux (Ubuntu/CentOS)
o Updated every few months
o On-demand elastic capacity
o GPU optimized VMs for deep learning
o Up to 4x GPUs NV K80 or V100
o Up to 128 vCPU, up to 6’144 GiB RAM
• At what price?
o From $11.76/month to $14’314/month

• Cloud-based environment to develop, train, test, deploy, manage,
and track ML models
• Highlights
• Model management
• Distributed deep learning
• Version control and reproducibility
• Hybrid deployment (Local, Cloud, Edge)
• Automated modeling and tuning (algorithm and parameters)
• Latest open source technologies (TensorFlow, PyTorch, Jupyter, Docker)
• Scale up or out with large GPU-enabled clusters in the cloud
• At what price?
• From $23.51/month to $29’143.94/month
Azure ML Service (preview)

The purpose of ML modelling is:
• Generate predictions
• Understand true relations

Machine Learning Challenges
• Asking the right questions
• Typically 1 Model = 1 Question
• Requires training data
o Real-world data is messy (wrong or missing data)
o Feature engineering transforms to predictive features
o Feature extraction ( i.e. IP Address -> population density)
o Feature selection for informative features
• Overfitting model
o “Kicks ass” while training ,
o fails badly on real predictions
• Model validation
o “Sense” how well model works on new data

• Appealing
o 64% believe they are working in this century’s most “sexiest” job
• In demand
o 90% contacted at least once a month with job offer
o 50% - weekly, 30% - several times/week, 35% have <2y experience
• The dark side…
o All models are wrong, some are useful
o 80% time is data preparation
o Real life, not academic problems
o Non-linear hypothesis testing
o No full automation
• No one cares how you do it
• Presentation is the key
The Data Scientist Job

Data Understanding (Titanic Dataset)
• Mosaic plot
o Categorical distribution
o Visualizes the relation between X and Y
o Strong relation = Y-splits are far apart
o Conclusion: Women have higher survival rate
• Box plot
o Continuous distribution of numeric var
o IQR = middle 50%
o Identify outliers [Q1-1.5 IQR; Q3+1.5 IQR]
o Conclusion: High fares have higher survival rate
• Scatter plot
o How much a variable determines another
o Conclusion: Infants and men 25-45 y
have higher survival rate

• Make features usable
o Numerical
o Categorical (i.e. week day)
o PCA dimensionality reduction
o Dummy variables
• Handle missing data
• Normalize data
o Standard range of numerical scale (i.e. from [-1000;1000] -> [0;1], [-1;1])
o Value range influence the importance of the feature compared to other
Data Preprocessing

Feature Engineering
Increase predictive power by creating features on raw data
• Features closely related to target (predict default –> debt / balance ratio)
• Easier interpretation (Date to Year/Month/Day/Hour)
• Lag features to “look back” before the date (1, 2,… N days ago)
• Rolling aggregates – smoothening over time window
• Categorical features
• identify discrete features
Check Azure team data science process
https://docs.microsoft.com/en-gb/azure/machine-learning/team-data-science-process/create-features

Note: All information is encoded in the digital media
• Images
o Step 1: Colour statistics, EXIF metadata, edges, shapes
o Step 2: Extract knowledge in fixed set of numeric characteristics
• Text
o Step 1:
• Bagging, N-grams, term frequency, topic modelling, stemming
• Named entity recognition (i.e. Wikipedia)
o Step 2: Extract knowledge in fixed set of numeric characteristics
Digital Media Feature Engineering

Feature Selection - select
the most predictive features
ML handles x1000 params
Not all params are equal
Adding features
Common approach
to increase accuracy
Poor performance
Correlated features could lead to
poor model performance
Overfitting
Learning relations in too much details
may lead to overfitting

Selecting Good Features
• Motivation
o Not only prediction but identification of predictive features
o Computational costs are related to number of features
o Limit external sensors and data sources
• Approach
o Trying all combinations of features? ( that would be infeasible)
• Methods
o Forward selection & Backward elimination
o Filter - Independent from the ML algorithm
o
o Embedded – Built-in search for predictive features in ML algorithm
o
o Wrapper – Measure feature usefulness while ML training

Tuning Model Parameters
• Model parameters control inner behaviour
o More sophisticated algorithm, more parameters
o i.e. Locally Deep SVM with kernel
o Kernel type, kernel coefficient
• How parameter tuning works?
1. Choose metric for evaluation (AUC - classification, R2-regression, etc.)
2. Select parameters for optimization
3. Define a grid as Cartesian product between arrays
4. For each combination, cross-validate on training set
5. Select the parameters for the best evaluation
Note: Expected improvement is 3%-8%

False AlarmsFalse Alarms have serious impact
• Degraded confidence in the
system
• Loss of revenue
• Loss of brand image

Performance Metrics
• Regression model
o Root Mean Squared Error (RMSE)
o Coefficient of Determination, R2 ϵ [0;1]
• Multi-class classification model
o Confusion matrix
• Binary classification model
o Accuracy based on correct answers
o Area under ROC curve (AUC)
o Threshold
o Precision = TP / (TP + FP)
o Recall = TP / (TP + FN)
o Cost-Balanced (F1)

Handling Imbalanced Data
• Imbalanced: more examples of one class than others (0.001%)
• Errors are not the same
o Prediction of minority class (failures) is more important
o Asymmetric cost (false negative can cost more than false positive)
• Compromised performance of standard ML algorithms
o For 1% minority class, Accuracy of 99% does not mean useful model
o PR-curve is better for imbalanced data
• Oversampling
o SMOTE – allows better learning
o Generate examples combining features of target with features of neighbours

Takeaways
Team Data Science Process
https://azure.microsoft.com/en-gb/documentation/learning-paths/data-science-process/
ML in the Microsoft World
https://docs.microsoft.com/en-us/azure/machine-learning/
Python for AI
https://wiki.python.org/moin/PythonForArtificialIntelligence
Data Science Blog
https://data-flair.training/blogs/category/machine-learning/
Starter Books

Azure ML StudioAzure ML Workbench

Prepare your data for machine learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Prepare your data for machine learning

Ähnlich wie Prepare your data for machine learning (20)

Mehr von Ivo Andreev

Mehr von Ivo Andreev (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Prepare your data for machine learning