SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
DECEMBER 15
GLOBAL AI BOOTCAMP IS POWERED BY:
The Data Science Process in ML
How to Apply It and When do We Need It?
Thanks to our Sponsors:
Global Sponsor:
Venue Sponsor:
About me
• Software Architect @
o 16+ years professional experience
• Microsoft Azure MVP
• External Expert Horizon 2020
• External Expert Eurostars-Eureka, InnoFund Denmark
• Business Interests
o Web Development, SOA, Integration
o IoT, Machine Learning, Computer Intelligence
o Security & Performance Optimization
• Contact
ivelin.andreev@icb.bg
www.linkedin.com/in/ivelin
www.slideshare.net/ivoandreev
AGENDA
Major Tools
The Purpose of ML
AI as a Service
Iterative ML Process
Takeways
Demo
Machine Learning and Microsoft
• Azure ML integrated, end-to-end data science and advanced analytics
• Microsoft ML related services/tools
• Highlights
o Built on open source technologies (Jupyter Notebook, Spark, Python, Docker)
o Execute experiments in isolated environments and GPU-enabled VMs
DEPRECATED MAINTAINED AND IMPROVED
• (Azure ML Workbench) • Azure ML Studio • Visual Studio Code Tools for AI
• (Azure ML Experimentation Service) • Data Science VM • Microsoft Cognitive Services, LUIS.ai
• (Azure ML Model Management Service) • Azure Databricks • Libraries for Apache Spark (MMLSpark)
Now called • Cognitive Toolkit (CNTK) • ML Services for SQL Server (R, Python)
“Machine Learning Service” (preview) • Azure Batch AI Training
Azure ML Workbench
Desktop application (Windows, macOS) with
• Built-in Jupyter Notebook services and Git integration
• End-to-end process support
o Model development and experimentation (Python)
o Powerful inspectors for data analysis
o Data transformations by example
o Model history and deployment
• Easy to use
and resource hungry 
* Replaced in Sept 24 2018 release to make way for an improved architecture
(ref. to Azure ML SDK for Python or Azure Databricks for big datasets)
Azure ML Studio
• Visual workspace to build, test and deploy ML solutions
• Highlights
o X-browser drag and drop, no programming
o Rich set of modules
o Fits beginners and advanced users
o Unlimited extensibility (R Script, Python Script)
o Enterprise grade cloud service (SLA 99.95%)
o ML REST web services consumption
o Jupyter Notebook
o Azure AI Gallery (9000+ samples)
• At what price?
o Free plan available (10GB storage, 2 web services, 1000 requests/month)
o $10 seat/month + $1 experiment/hour
Azure Data Science VM
• Pre-configured cloud environment for AI & Data Science
• Highlights
o Fully operational environment
o 50+ tools DEV, ML, BigData, Data management
o Windows and Linux (Ubuntu/CentOS)
o Updated every few months
o On-demand elastic capacity
o GPU optimized VMs for deep learning
o Up to 4x GPUs NV K80 or V100
o Up to 128 vCPU, up to 6’144 GiB RAM
• At what price?
o From $11.76/month to $14’314/month
• Cloud-based environment to develop, train, test, deploy,
manage, and track ML models
• Highlights
• Model management
• Distributed deep learning
• Version control and reproducibility
• Hybrid deployment (Local, Cloud, Edge)
• Automated ML (data prep, algorithm, parameters)
• Latest open source technologies (TensorFlow, PyTorch, Jupyter, Docker)
• Scale up or out with large GPU-enabled clusters in the cloud
• At what price?
• From $23.51/month to $29’143.94/month
Azure ML Service (preview)
The purpose of ML modelling is:
• Generate predictions
• Understand true relations
Machine Learning Challenges
• Asking the right questions
• Typically 1 Model = 1 Question
• Requires training data
o Real-world data is messy (wrong or missing data)
o Feature engineering transforms to predictive features
o Feature extraction ( i.e. IP Address -> population density)
o Feature selection for informative features
• Overfitting model
o “Kicks ass” while training ,
o fails badly on real predictions
• Model validation
o “Sense” how well model works on new data
Users’ expectations:
• Engaging experience
• Effortless interaction
• High performance
• Relevant content
Businesses aim:
• Provide high value
• Faster and at low cost
o Data science talent
o Powerful infrastructure
o Continuous improvement
The developer role is to
bridge the gap:
Artificial Intelligence as a Service (AIaaS)
Def: Artificial intelligence off the shelf
• Bots and NLP – commands and guidance
• Cognitive APIs – speech, vision, translation, knowledge
• ML frameworks – build own model w/o infrastructure (i.e. Azure ML Service)
• Fully managed ML – templates, deployment, drag-drop (i.e. Azure ML Studio)
• Innovation w/o upfront costs and expertise
• Usability – easy learning curve
• Scalability – start PoC, grow big
• Flexi cost – know what you pay for
• Share data with vendors
• Data regulations (i.e. GDPR)
• Reduced transparency
• Breaking changes
AIaaS market expected to
grow from $1.5Bil (2018) to
10.9Bil (2023)
(ResearchAndMarket Apr’ 2018)
1 year ago this was not as
achievable as it is now.
Some Key Azure AIaaS
Computer Vision
• Advanced algorithms for processing images for information
Face API
• Detect and analyze facial attributes
Custom Vision API
• Build, deploy, improve custom image classifiers (on tags)
LUIS.ai
• Apply custom ML intelligence
to conversational natural language
Custom Decision (experimental)
• Learn behavioural patterns of users
• Appealing
o 64% believe they are working in this century’s most “sexiest” job
• In demand
o 90% contacted at least once a month with job offer
o 50% - weekly, 30% - several times/week, 35% have <2y experience
• The dark side…
o All models are wrong, some are useful
o 80% time is data preparation
o Real life, not academic problems
o Non-linear hypothesis testing
o No full automation
• No one cares how you do it
The Data Scientist Job
Automated ML (AML)
AML is a recommender system for ML pipelines to achieve accuracy with less time
• Problem: Complexity scales faster that time available
• Highlights
o Designed to not look at customer data
o Only each pipeline result is sent to automated ML service
o Data pre-processing, algorithm experimentation, hyperparameters tunings
• How it Works
o Select algorithm: classification(11), regression(9), forecasting(9)
o Specify labeled data source and format (Numpy array, Pandas dataframe)
o Configure target for training (local, remote VM, AML Compute)
o Set AML configuration
automl_classifier = AutoMLConfig(
task='classification',
primary_metric='AUC_weighted',
max_time_sec=12000,
iterations=50, X=F_Train,
y=F_Label,
n_cross_validations=2)
https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-
configure-auto-train
Iterative ML Process
Data Understanding (Titanic Dataset)
• Mosaic plot
o Categorical distribution
o Visualizes the relation between X and Y
o Strong relation = Y-splits are far apart
o Conclusion: Women have higher survival rate
• Box plot
o Continuous distribution of numeric var
o IQR = middle 50%
o Identify outliers [Q1-1.5 IQR; Q3+1.5 IQR]
o Conclusion: High fares have higher survival rate
• Scatter plot
o How much a variable determines another
o Conclusion: Infants and men 25-45 y
have higher survival rate
• Make features usable
o Numerical
o Categorical (i.e. week day)
o PCA dimensionality reduction
o Dummy variables
• Handle missing data
• Normalize data
o Standard range of numerical scale (i.e. from [-1000;1000] -> [0;1], [-1;1])
o Value range influence the importance of the feature compared to other
Data Preprocessing
Feature Engineering, Feature Extraction
Increase predictive power by creating features on raw data
• Features closely related to target (predict default –> debt / balance ratio)
• Easier interpretation (Date to Year/Month/Day/Hour)
• Lag features to “look back” before the date (1, 2,… N days ago)
• Categorical features - identify discrete features
• Rolling aggregates
• smoothening over time window
• Check Azure team data science process
https://docs.microsoft.com/en-gb/azure/machine-learning/team-data-science-process/create-features
Note: All information is encoded in the digital media
• Images
o Step 1: Colour statistics, EXIF metadata, edges, shapes
o Step 2: Extract knowledge in fixed set of numeric characteristics
• Text
o Step 1:
• Bagging, N-grams, term frequency, topic modelling, stemming
• Named entity recognition (i.e. Wikipedia)
o Step 2: Extract knowledge in fixed set of numeric characteristics
Digital Media Feature Engineering
Feature Selection - select the
most predictive features
For many ML problems, having
a lot of data is a good thing;
but it can sometimes be a curse
Selecting Good Features
• Motivation
o Not only prediction but identification of predictive features
o Computational costs are related to number of features
o Limit external sensors and data sources
• Approach
o Trying all combinations of features? ( that would be infeasible)
• Methods
o Forward selection & Backward elimination
o Filter - Independent from the ML algorithm
o Embedded – Built-in search for predictive features in ML algorithm
o Wrapper – Measure feature usefulness while ML training
Tuning Model Parameters
• Model parameters control inner behaviour
o More sophisticated algorithm, more parameters
o i.e. Locally Deep SVM with kernel
o Kernel type, kernel coefficient
• How parameter tuning works?
1. Choose metric for evaluation (AUC - classification, R2-regression, etc.)
2. Select parameters for optimization
3. Define a grid as Cartesian product between arrays
4. For each combination, cross-validate on training set
5. Select the parameters for the best evaluation
Note: Expected improvement is 3%-8%
Appropriate Algorithms are
Determined by Data
Types of Algorithms
• Linear Algorithms
• Classification - classes separated by straight line
• Support Vector Machine – wide gap from line
• Regression – linear relation variables-label
• Non-Linear Algorithms
• Decision Trees and Jungles - divide space into regions
• Neural Networks – complex and irregular boundaries
• Special Algorithms
• Ordinal Regression – ranked values (i.e. race)
• Poisson Regression - discrete distribution (i.e. nr. of events)
• Bayesian – normal distribution of errors (bell curve)
False AlarmsFalse Alarms have serious impact
• Degraded confidence in the
system
• Loss of revenue
• Loss of brand image
Performance Metrics
• Regression model
o Root Mean Squared Error (RMSE)
o Coefficient of Determination, R2 ϵ [0;1]
• Multi-class classification model
o Confusion matrix
• Binary classification model
o Accuracy based on correct answers
o Area under ROC curve (AUC)
o Threshold
o Precision = TP / (TP + FP)
o Recall = TP / (TP + FN)
o Cost-Balanced (F1)
Handling Imbalanced Data
• Imbalanced: more examples of one class than others (0.001%)
• Errors are not the same
o Prediction of minority class (failures) is more important
o Asymmetric cost (false negative can cost more than false positive)
• Compromised performance of standard ML algorithms
o For 1% minority class, Accuracy of 99% does not mean useful model
o PR-curve is better for imbalanced data
• Oversampling
o SMOTE – allows better learning
o Generate examples combining features of target with features of neighbours
Takeaways
• Team Data Science Process
o https://azure.microsoft.com/en-gb/documentation/learning-paths/data-science-process/
• ML in the Microsoft World
o https://docs.microsoft.com/en-us/azure/machine-learning/
• Python for AI
o https://wiki.python.org/moin/PythonForArtificialIntelligence
• Data Science Blog
o https://data-flair.training/blogs/category/machine-learning/
• Starter Books
o Free e-books download link:
https://www.manning.com/books/exploring-data-science
Azure ML StudioAzure ML Workbench

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
IBM Cloud Data Services
 

Was ist angesagt? (20)

AI with Azure Machine Learning
AI with Azure Machine LearningAI with Azure Machine Learning
AI with Azure Machine Learning
 
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital OneUsing H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
 
Ml infra at an early stage
Ml infra at an early stageMl infra at an early stage
Ml infra at an early stage
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analytics
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 
Detecting Financial Fraud at Scale with Machine Learning
Detecting Financial Fraud at Scale with Machine LearningDetecting Financial Fraud at Scale with Machine Learning
Detecting Financial Fraud at Scale with Machine Learning
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Distributed processing of large graphs in python
Distributed processing of large graphs in pythonDistributed processing of large graphs in python
Distributed processing of large graphs in python
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest Airports
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Semantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowSemantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflow
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 

Ähnlich wie The Data Science Process - Do we need it and how to apply?

Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
vitm11
 

Ähnlich wie The Data Science Process - Do we need it and how to apply? (20)

The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
 
201906 02 Introduction to AutoML with ML.NET 1.0
201906 02 Introduction to AutoML with ML.NET 1.0201906 02 Introduction to AutoML with ML.NET 1.0
201906 02 Introduction to AutoML with ML.NET 1.0
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated ML
 
Cutting Edge Computer Vision for Everyone
Cutting Edge Computer Vision for EveryoneCutting Edge Computer Vision for Everyone
Cutting Edge Computer Vision for Everyone
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
What is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PMWhat is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PM
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
 
Designing Artificial Intelligence
Designing Artificial IntelligenceDesigning Artificial Intelligence
Designing Artificial Intelligence
 
Norman Sasono - Incorporating AI/ML into Your Application Architecture
Norman Sasono - Incorporating AI/ML into Your Application ArchitectureNorman Sasono - Incorporating AI/ML into Your Application Architecture
Norman Sasono - Incorporating AI/ML into Your Application Architecture
 
Norman Sasono - Incorporating AI/ML into Your Application Architecture
Norman Sasono - Incorporating AI/ML into Your Application ArchitectureNorman Sasono - Incorporating AI/ML into Your Application Architecture
Norman Sasono - Incorporating AI/ML into Your Application Architecture
 
Introduction to ML.NET
Introduction to ML.NETIntroduction to ML.NET
Introduction to ML.NET
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
 

Mehr von Ivo Andreev

Mehr von Ivo Andreev (20)

Cybersecurity and Generative AI - for Good and Bad vol.2
Cybersecurity and Generative AI - for Good and Bad vol.2Cybersecurity and Generative AI - for Good and Bad vol.2
Cybersecurity and Generative AI - for Good and Bad vol.2
 
Architecting AI Solutions in Azure for Business
Architecting AI Solutions in Azure for BusinessArchitecting AI Solutions in Azure for Business
Architecting AI Solutions in Azure for Business
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and Bad
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AI
 
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersHow do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
 
OpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and MisconceptionsOpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and Misconceptions
 
Collecting and Analysing Spaceborn Data
Collecting and Analysing Spaceborn DataCollecting and Analysing Spaceborn Data
Collecting and Analysing Spaceborn Data
 
Collecting and Analysing Satellite Data with Azure Orbital
Collecting and Analysing Satellite Data with Azure OrbitalCollecting and Analysing Satellite Data with Azure Orbital
Collecting and Analysing Satellite Data with Azure Orbital
 
Language Studio and Custom Models
Language Studio and Custom ModelsLanguage Studio and Custom Models
Language Studio and Custom Models
 
CosmosDB for IoT Scenarios
CosmosDB for IoT ScenariosCosmosDB for IoT Scenarios
CosmosDB for IoT Scenarios
 
Forecasting time series powerful and simple
Forecasting time series powerful and simpleForecasting time series powerful and simple
Forecasting time series powerful and simple
 
Constrained Optimization with Genetic Algorithms and Project Bonsai
Constrained Optimization with Genetic Algorithms and Project BonsaiConstrained Optimization with Genetic Algorithms and Project Bonsai
Constrained Optimization with Genetic Algorithms and Project Bonsai
 
Azure security guidelines for developers
Azure security guidelines for developers Azure security guidelines for developers
Azure security guidelines for developers
 
Autonomous Machines with Project Bonsai
Autonomous Machines with Project BonsaiAutonomous Machines with Project Bonsai
Autonomous Machines with Project Bonsai
 
Global azure virtual 2021 - Azure Lighthouse
Global azure virtual 2021 - Azure LighthouseGlobal azure virtual 2021 - Azure Lighthouse
Global azure virtual 2021 - Azure Lighthouse
 
Flux QL - Nexgen Management of Time Series Inspired by JS
Flux QL - Nexgen Management of Time Series Inspired by JSFlux QL - Nexgen Management of Time Series Inspired by JS
Flux QL - Nexgen Management of Time Series Inspired by JS
 
Azure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesAzure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challenges
 
Industrial IoT on Azure
Industrial IoT on AzureIndustrial IoT on Azure
Industrial IoT on Azure
 
Flying a Drone with JavaScript and Computer Vision
Flying a Drone with JavaScript and Computer VisionFlying a Drone with JavaScript and Computer Vision
Flying a Drone with JavaScript and Computer Vision
 
ML with Power BI for Business and Pros
ML with Power BI for Business and ProsML with Power BI for Business and Pros
ML with Power BI for Business and Pros
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 

The Data Science Process - Do we need it and how to apply?

  • 1. DECEMBER 15 GLOBAL AI BOOTCAMP IS POWERED BY: The Data Science Process in ML How to Apply It and When do We Need It?
  • 2. Thanks to our Sponsors: Global Sponsor: Venue Sponsor:
  • 3. About me • Software Architect @ o 16+ years professional experience • Microsoft Azure MVP • External Expert Horizon 2020 • External Expert Eurostars-Eureka, InnoFund Denmark • Business Interests o Web Development, SOA, Integration o IoT, Machine Learning, Computer Intelligence o Security & Performance Optimization • Contact ivelin.andreev@icb.bg www.linkedin.com/in/ivelin www.slideshare.net/ivoandreev
  • 4. AGENDA Major Tools The Purpose of ML AI as a Service Iterative ML Process Takeways Demo
  • 5. Machine Learning and Microsoft • Azure ML integrated, end-to-end data science and advanced analytics • Microsoft ML related services/tools • Highlights o Built on open source technologies (Jupyter Notebook, Spark, Python, Docker) o Execute experiments in isolated environments and GPU-enabled VMs DEPRECATED MAINTAINED AND IMPROVED • (Azure ML Workbench) • Azure ML Studio • Visual Studio Code Tools for AI • (Azure ML Experimentation Service) • Data Science VM • Microsoft Cognitive Services, LUIS.ai • (Azure ML Model Management Service) • Azure Databricks • Libraries for Apache Spark (MMLSpark) Now called • Cognitive Toolkit (CNTK) • ML Services for SQL Server (R, Python) “Machine Learning Service” (preview) • Azure Batch AI Training
  • 6. Azure ML Workbench Desktop application (Windows, macOS) with • Built-in Jupyter Notebook services and Git integration • End-to-end process support o Model development and experimentation (Python) o Powerful inspectors for data analysis o Data transformations by example o Model history and deployment • Easy to use and resource hungry  * Replaced in Sept 24 2018 release to make way for an improved architecture (ref. to Azure ML SDK for Python or Azure Databricks for big datasets)
  • 7. Azure ML Studio • Visual workspace to build, test and deploy ML solutions • Highlights o X-browser drag and drop, no programming o Rich set of modules o Fits beginners and advanced users o Unlimited extensibility (R Script, Python Script) o Enterprise grade cloud service (SLA 99.95%) o ML REST web services consumption o Jupyter Notebook o Azure AI Gallery (9000+ samples) • At what price? o Free plan available (10GB storage, 2 web services, 1000 requests/month) o $10 seat/month + $1 experiment/hour
  • 8. Azure Data Science VM • Pre-configured cloud environment for AI & Data Science • Highlights o Fully operational environment o 50+ tools DEV, ML, BigData, Data management o Windows and Linux (Ubuntu/CentOS) o Updated every few months o On-demand elastic capacity o GPU optimized VMs for deep learning o Up to 4x GPUs NV K80 or V100 o Up to 128 vCPU, up to 6’144 GiB RAM • At what price? o From $11.76/month to $14’314/month
  • 9. • Cloud-based environment to develop, train, test, deploy, manage, and track ML models • Highlights • Model management • Distributed deep learning • Version control and reproducibility • Hybrid deployment (Local, Cloud, Edge) • Automated ML (data prep, algorithm, parameters) • Latest open source technologies (TensorFlow, PyTorch, Jupyter, Docker) • Scale up or out with large GPU-enabled clusters in the cloud • At what price? • From $23.51/month to $29’143.94/month Azure ML Service (preview)
  • 10. The purpose of ML modelling is: • Generate predictions • Understand true relations
  • 11. Machine Learning Challenges • Asking the right questions • Typically 1 Model = 1 Question • Requires training data o Real-world data is messy (wrong or missing data) o Feature engineering transforms to predictive features o Feature extraction ( i.e. IP Address -> population density) o Feature selection for informative features • Overfitting model o “Kicks ass” while training , o fails badly on real predictions • Model validation o “Sense” how well model works on new data
  • 12. Users’ expectations: • Engaging experience • Effortless interaction • High performance • Relevant content Businesses aim: • Provide high value • Faster and at low cost o Data science talent o Powerful infrastructure o Continuous improvement The developer role is to bridge the gap:
  • 13. Artificial Intelligence as a Service (AIaaS) Def: Artificial intelligence off the shelf • Bots and NLP – commands and guidance • Cognitive APIs – speech, vision, translation, knowledge • ML frameworks – build own model w/o infrastructure (i.e. Azure ML Service) • Fully managed ML – templates, deployment, drag-drop (i.e. Azure ML Studio) • Innovation w/o upfront costs and expertise • Usability – easy learning curve • Scalability – start PoC, grow big • Flexi cost – know what you pay for • Share data with vendors • Data regulations (i.e. GDPR) • Reduced transparency • Breaking changes
  • 14. AIaaS market expected to grow from $1.5Bil (2018) to 10.9Bil (2023) (ResearchAndMarket Apr’ 2018) 1 year ago this was not as achievable as it is now.
  • 15. Some Key Azure AIaaS Computer Vision • Advanced algorithms for processing images for information Face API • Detect and analyze facial attributes Custom Vision API • Build, deploy, improve custom image classifiers (on tags) LUIS.ai • Apply custom ML intelligence to conversational natural language Custom Decision (experimental) • Learn behavioural patterns of users
  • 16. • Appealing o 64% believe they are working in this century’s most “sexiest” job • In demand o 90% contacted at least once a month with job offer o 50% - weekly, 30% - several times/week, 35% have <2y experience • The dark side… o All models are wrong, some are useful o 80% time is data preparation o Real life, not academic problems o Non-linear hypothesis testing o No full automation • No one cares how you do it The Data Scientist Job
  • 17. Automated ML (AML) AML is a recommender system for ML pipelines to achieve accuracy with less time • Problem: Complexity scales faster that time available • Highlights o Designed to not look at customer data o Only each pipeline result is sent to automated ML service o Data pre-processing, algorithm experimentation, hyperparameters tunings • How it Works o Select algorithm: classification(11), regression(9), forecasting(9) o Specify labeled data source and format (Numpy array, Pandas dataframe) o Configure target for training (local, remote VM, AML Compute) o Set AML configuration automl_classifier = AutoMLConfig( task='classification', primary_metric='AUC_weighted', max_time_sec=12000, iterations=50, X=F_Train, y=F_Label, n_cross_validations=2) https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to- configure-auto-train
  • 19. Data Understanding (Titanic Dataset) • Mosaic plot o Categorical distribution o Visualizes the relation between X and Y o Strong relation = Y-splits are far apart o Conclusion: Women have higher survival rate • Box plot o Continuous distribution of numeric var o IQR = middle 50% o Identify outliers [Q1-1.5 IQR; Q3+1.5 IQR] o Conclusion: High fares have higher survival rate • Scatter plot o How much a variable determines another o Conclusion: Infants and men 25-45 y have higher survival rate
  • 20. • Make features usable o Numerical o Categorical (i.e. week day) o PCA dimensionality reduction o Dummy variables • Handle missing data • Normalize data o Standard range of numerical scale (i.e. from [-1000;1000] -> [0;1], [-1;1]) o Value range influence the importance of the feature compared to other Data Preprocessing
  • 21. Feature Engineering, Feature Extraction Increase predictive power by creating features on raw data • Features closely related to target (predict default –> debt / balance ratio) • Easier interpretation (Date to Year/Month/Day/Hour) • Lag features to “look back” before the date (1, 2,… N days ago) • Categorical features - identify discrete features • Rolling aggregates • smoothening over time window • Check Azure team data science process https://docs.microsoft.com/en-gb/azure/machine-learning/team-data-science-process/create-features
  • 22. Note: All information is encoded in the digital media • Images o Step 1: Colour statistics, EXIF metadata, edges, shapes o Step 2: Extract knowledge in fixed set of numeric characteristics • Text o Step 1: • Bagging, N-grams, term frequency, topic modelling, stemming • Named entity recognition (i.e. Wikipedia) o Step 2: Extract knowledge in fixed set of numeric characteristics Digital Media Feature Engineering
  • 23. Feature Selection - select the most predictive features For many ML problems, having a lot of data is a good thing; but it can sometimes be a curse
  • 24. Selecting Good Features • Motivation o Not only prediction but identification of predictive features o Computational costs are related to number of features o Limit external sensors and data sources • Approach o Trying all combinations of features? ( that would be infeasible) • Methods o Forward selection & Backward elimination o Filter - Independent from the ML algorithm o Embedded – Built-in search for predictive features in ML algorithm o Wrapper – Measure feature usefulness while ML training
  • 25. Tuning Model Parameters • Model parameters control inner behaviour o More sophisticated algorithm, more parameters o i.e. Locally Deep SVM with kernel o Kernel type, kernel coefficient • How parameter tuning works? 1. Choose metric for evaluation (AUC - classification, R2-regression, etc.) 2. Select parameters for optimization 3. Define a grid as Cartesian product between arrays 4. For each combination, cross-validate on training set 5. Select the parameters for the best evaluation Note: Expected improvement is 3%-8%
  • 27. Types of Algorithms • Linear Algorithms • Classification - classes separated by straight line • Support Vector Machine – wide gap from line • Regression – linear relation variables-label • Non-Linear Algorithms • Decision Trees and Jungles - divide space into regions • Neural Networks – complex and irregular boundaries • Special Algorithms • Ordinal Regression – ranked values (i.e. race) • Poisson Regression - discrete distribution (i.e. nr. of events) • Bayesian – normal distribution of errors (bell curve)
  • 28. False AlarmsFalse Alarms have serious impact • Degraded confidence in the system • Loss of revenue • Loss of brand image
  • 29. Performance Metrics • Regression model o Root Mean Squared Error (RMSE) o Coefficient of Determination, R2 ϵ [0;1] • Multi-class classification model o Confusion matrix • Binary classification model o Accuracy based on correct answers o Area under ROC curve (AUC) o Threshold o Precision = TP / (TP + FP) o Recall = TP / (TP + FN) o Cost-Balanced (F1)
  • 30. Handling Imbalanced Data • Imbalanced: more examples of one class than others (0.001%) • Errors are not the same o Prediction of minority class (failures) is more important o Asymmetric cost (false negative can cost more than false positive) • Compromised performance of standard ML algorithms o For 1% minority class, Accuracy of 99% does not mean useful model o PR-curve is better for imbalanced data • Oversampling o SMOTE – allows better learning o Generate examples combining features of target with features of neighbours
  • 31. Takeaways • Team Data Science Process o https://azure.microsoft.com/en-gb/documentation/learning-paths/data-science-process/ • ML in the Microsoft World o https://docs.microsoft.com/en-us/azure/machine-learning/ • Python for AI o https://wiki.python.org/moin/PythonForArtificialIntelligence • Data Science Blog o https://data-flair.training/blogs/category/machine-learning/ • Starter Books o Free e-books download link: https://www.manning.com/books/exploring-data-science
  • 32. Azure ML StudioAzure ML Workbench