SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Machine Learning
Presented By
Deepak Sharma
1
Topics Covered
▶ K Means
▶ Linear Regression
▶ Logistic Regression
▶ Decision Tree
▶ Random Forest
2
Machine Learning
Machine learning is an application of artificial intelligence
(AI) that provides systems the ability to automatically learn
and improve from experience without being explicitly
programmed.
Unsupervised learning is a type of machine learning
algorithm used to draw inferences from datasets consisting
of input data without labelled responses.
Supervised learning is inferring from labelled training data.
Unsupervised and Supervised Learning
3
4
K Means
▶ k-means is one of the simplest unsupervised learning algorithms that solve the
well known clustering problem.
▶ The algorithm classify a given data set through a certain number of clusters
(assume k clusters) fixed apriori.
▶ Algorithmic steps for k-means clustering
▶ Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the set
of centers.
▶ 1) Randomly select ‘c’ cluster centers.
▶ 2) Calculate the eucledian distance between each data point and cluster centers.
▶ 3) Assign the data point to the cluster center whose distance from the cluster center is
minimum of all the cluster centers..
▶ 4) Recalculate the new cluster center
▶ 5) Recalculate the distance between each data point and new obtained cluster centers.
▶ 6) If no data point was reassigned then stop, otherwise repeat from step 3).
5
Optimal Value of K - Elbow Method
6
Advantage of K Means
▶ 1) Fast, robust and easier to understand and implement.
▶ 2) Gives best result when data set are distinct or well separated from each other.
▶ 1) The learning algorithm requires apriori specification of the number of cluster
centers.
▶ 2) The use of Exclusive Assignment - If there are two highly overlapping data then
k-means will not be able to resolve that there are two clusters.
▶ 3) Euclidean distance measures can unequally weight underlying factors.
▶ 4) The learning algorithm provides the local optima of the squared error function.
Disadvantage of K Means
7
Business Application
1) Document Classification - Cluster documents in multiple categories based on tags, topics,
and the content of the document.
2) Customer Segmentation - Clustering helps marketers improve their customer base, work
on target areas, and segment customers based on purchase history, interests, or activity
monitoring
3) Insurance Fraud Detection - Utilizing past historical data on fraudulent claims, it is
possible to isolate new claims based on its proximity to clusters that indicate fraudulent
patterns.
4) Cyber Profile Criminals -The idea of cyber profiling is derived from criminal profiles,
which provide information on the investigation division to classify the types of criminals
who were at the crime scene.
8
Linear Regression
▶ Linear regression is the simplest and most widely used statistical technique for
predictive modeling.
▶ It basically gives us an equation, where we have our features as independent
variables, on which our target variable is dependent upon.
▶ Linear regression equation looks like
▶ Here, we have Y as our dependent variable (Sales), X’s are the independent
variables and all thetas are the coefficients.
▶ Coefficients are basically the weights assigned to the features, based on their
importance.
▶ Residuals - The difference between the values predicted by us and the observed
values
9
Linear Regression
10
Explained & Unexplained Variation
11
Key Assumptions of Regression
The regression has five key assumptions:
▶ Linear relationship
▶ Multivariate normality
▶ No or little multicollinearity
▶ No auto-correlation
▶ Homoscedasticity
12
Goodness of Fit
▶ R Square - This statistic indicates the percentage of the
variance in the dependent variable that the independent
variables explain collectively.
▶ R-squared measures the strength of the relationship
between your model and the dependent variable on a
convenient 0 – 100% scale.
▶ Adjusted R Square - The adjusted R-squared is a modified
version of R-squared that has been adjusted for the
number of predictors in the model.
▶ The adjusted R-squared increases only if the new term
improves the model more than would be expected by
chance.
▶ It decreases when a predictor improves the model by less
than expected by chance.
13
Business Application
▶ A factory manager can create a statistical model to understand the impact of
oven temperature on the working of instrument
▶ A call centre can optimize customer handling operation by understanding the
impact of number of customers on wait time
▶ A hotel can forecast the number of staff required as per the demand fluctuate
14
Logistic Regression
▶ It’s a classification algorithm
▶ it predicts the probability of occurrence of an event by fitting data to a logit
function.
▶ Mathematically, logistic regression estimates a multiple linear regression
function defined as:
▶ this is interpreted as taking input log-odds and having output probability.
15
Sigmoid Function
16
Model Validation
▶ Confusion Matrix
17
ROC -Receiver Operating Characteristics
▶ The ROC curve shows the true positive rate (i.e., recall) against the false
positive rate.
▶ The False Positive Rate is the ratio of negative instances that are incorrectly
classified as positive. It is equal to one minus the true negative rate.
▶ Specificity - The True Negative rate is called specificity
▶ The ROC curve plots sensitivity (recall) versus 1-specificity
18
Decision Tree
▶ Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that
is mostly used in classification problems. It works for both categorical and continuous input and
output variables.
▶ Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it
called as categorical variable decision tree. Example:- In above scenario of student problem, where
the target variable was “Student will play cricket or not” i.e. YES or NO.
▶ Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as
Continuous Variable Decision Tree.
19
Terminology Used in Decision Tree
20
How does a tree
decide where to
split?
▶ Gini Index - if we select two items
from a population at random then
they must be of same class and
probability for this is 1 if population is
pure.
▶ It works with categorical target
variable “Success” or “Failure”.
▶ It performs only Binary splits
▶ Higher the value of Gini higher the
homogeneity.
▶ CART (Classification and
Regression Tree) uses Gini method
to create binary splits.
21
22
Information Gain
▶ How much information is required to explain the disorganisation in the system
▶ Here p and q is probability of success and failure respectively in that node.
Entropy
23
Advantage of Decision Tree
▶ Easy to Understand
▶ Useful in Data exploration
▶ Less data cleaning required
▶ Over fitting
▶ Not fit for continuous variables
Disadvantage of Decision Tree
24
Bias-Variance Tradeoff
25
Ensemble
▶ Ensemble methods is a machine learning technique that combines several base
models in order to produce one optimal predictive model
▶ Bagging - Here idea is to create several subsets of data from training sample
chosen randomly with replacement.
▶ Now, each collection of subset data is used to train their decision trees. As a
result, we end up with an ensemble of different models. Average of all the
predictions from different trees are used
▶ Boosting - In this technique, learners are learned sequentially with early learners
fitting simple models to the data and then analyzing data for errors.
Types of Ensemble Method
26
Random Forest
▶ Random Forest is an extension over bagging.
▶ It takes one extra step where in addition to taking the random subset of data, it
also takes the random selection of features rather than using all features to grow
trees.
▶ Handles higher dimensionality data very well.
▶ Handles missing values and maintains accuracy for missing data.
▶ Since final prediction is based on the mean predictions from subset trees, it won’t
give precise values for the regression model.
Advantages of Random Forest
Disadvantages of Random Forest
27
Thank You
28

Weitere ähnliche Inhalte

Ähnlich wie Machine Learning Approach.pptx

MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
Rahul Bhatia
 
Top 20 Data Science Interview Questions and Answers in 2023.pptx
Top 20 Data Science Interview Questions and Answers in 2023.pptxTop 20 Data Science Interview Questions and Answers in 2023.pptx
Top 20 Data Science Interview Questions and Answers in 2023.pptx
AnanthReddy38
 

Ähnlich wie Machine Learning Approach.pptx (20)

Ds for finance day 3
Ds for finance day 3Ds for finance day 3
Ds for finance day 3
 
CUSTOMER CHURN PREDICTION
CUSTOMER CHURN PREDICTIONCUSTOMER CHURN PREDICTION
CUSTOMER CHURN PREDICTION
 
A detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning AlgorithmsA detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning Algorithms
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
 
Second subjective assignment
Second  subjective assignmentSecond  subjective assignment
Second subjective assignment
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Optimal Model Complexity (1).pptx
Optimal Model Complexity (1).pptxOptimal Model Complexity (1).pptx
Optimal Model Complexity (1).pptx
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESCASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
 
Decision trees
Decision treesDecision trees
Decision trees
 
Top 20 Data Science Interview Questions and Answers in 2023.pptx
Top 20 Data Science Interview Questions and Answers in 2023.pptxTop 20 Data Science Interview Questions and Answers in 2023.pptx
Top 20 Data Science Interview Questions and Answers in 2023.pptx
 
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
Diabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine LearningDiabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine Learning
 
Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...
Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...
Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...
 

Kürzlich hochgeladen

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 

Kürzlich hochgeladen (20)

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Machine Learning Approach.pptx

  • 2. Topics Covered ▶ K Means ▶ Linear Regression ▶ Logistic Regression ▶ Decision Tree ▶ Random Forest 2
  • 3. Machine Learning Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labelled responses. Supervised learning is inferring from labelled training data. Unsupervised and Supervised Learning 3
  • 4. 4
  • 5. K Means ▶ k-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. ▶ The algorithm classify a given data set through a certain number of clusters (assume k clusters) fixed apriori. ▶ Algorithmic steps for k-means clustering ▶ Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the set of centers. ▶ 1) Randomly select ‘c’ cluster centers. ▶ 2) Calculate the eucledian distance between each data point and cluster centers. ▶ 3) Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers.. ▶ 4) Recalculate the new cluster center ▶ 5) Recalculate the distance between each data point and new obtained cluster centers. ▶ 6) If no data point was reassigned then stop, otherwise repeat from step 3). 5
  • 6. Optimal Value of K - Elbow Method 6
  • 7. Advantage of K Means ▶ 1) Fast, robust and easier to understand and implement. ▶ 2) Gives best result when data set are distinct or well separated from each other. ▶ 1) The learning algorithm requires apriori specification of the number of cluster centers. ▶ 2) The use of Exclusive Assignment - If there are two highly overlapping data then k-means will not be able to resolve that there are two clusters. ▶ 3) Euclidean distance measures can unequally weight underlying factors. ▶ 4) The learning algorithm provides the local optima of the squared error function. Disadvantage of K Means 7
  • 8. Business Application 1) Document Classification - Cluster documents in multiple categories based on tags, topics, and the content of the document. 2) Customer Segmentation - Clustering helps marketers improve their customer base, work on target areas, and segment customers based on purchase history, interests, or activity monitoring 3) Insurance Fraud Detection - Utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. 4) Cyber Profile Criminals -The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. 8
  • 9. Linear Regression ▶ Linear regression is the simplest and most widely used statistical technique for predictive modeling. ▶ It basically gives us an equation, where we have our features as independent variables, on which our target variable is dependent upon. ▶ Linear regression equation looks like ▶ Here, we have Y as our dependent variable (Sales), X’s are the independent variables and all thetas are the coefficients. ▶ Coefficients are basically the weights assigned to the features, based on their importance. ▶ Residuals - The difference between the values predicted by us and the observed values 9
  • 11. Explained & Unexplained Variation 11
  • 12. Key Assumptions of Regression The regression has five key assumptions: ▶ Linear relationship ▶ Multivariate normality ▶ No or little multicollinearity ▶ No auto-correlation ▶ Homoscedasticity 12
  • 13. Goodness of Fit ▶ R Square - This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. ▶ R-squared measures the strength of the relationship between your model and the dependent variable on a convenient 0 – 100% scale. ▶ Adjusted R Square - The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. ▶ The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. ▶ It decreases when a predictor improves the model by less than expected by chance. 13
  • 14. Business Application ▶ A factory manager can create a statistical model to understand the impact of oven temperature on the working of instrument ▶ A call centre can optimize customer handling operation by understanding the impact of number of customers on wait time ▶ A hotel can forecast the number of staff required as per the demand fluctuate 14
  • 15. Logistic Regression ▶ It’s a classification algorithm ▶ it predicts the probability of occurrence of an event by fitting data to a logit function. ▶ Mathematically, logistic regression estimates a multiple linear regression function defined as: ▶ this is interpreted as taking input log-odds and having output probability. 15
  • 18. ROC -Receiver Operating Characteristics ▶ The ROC curve shows the true positive rate (i.e., recall) against the false positive rate. ▶ The False Positive Rate is the ratio of negative instances that are incorrectly classified as positive. It is equal to one minus the true negative rate. ▶ Specificity - The True Negative rate is called specificity ▶ The ROC curve plots sensitivity (recall) versus 1-specificity 18
  • 19. Decision Tree ▶ Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables. ▶ Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it called as categorical variable decision tree. Example:- In above scenario of student problem, where the target variable was “Student will play cricket or not” i.e. YES or NO. ▶ Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as Continuous Variable Decision Tree. 19
  • 20. Terminology Used in Decision Tree 20
  • 21. How does a tree decide where to split? ▶ Gini Index - if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure. ▶ It works with categorical target variable “Success” or “Failure”. ▶ It performs only Binary splits ▶ Higher the value of Gini higher the homogeneity. ▶ CART (Classification and Regression Tree) uses Gini method to create binary splits. 21
  • 22. 22
  • 23. Information Gain ▶ How much information is required to explain the disorganisation in the system ▶ Here p and q is probability of success and failure respectively in that node. Entropy 23
  • 24. Advantage of Decision Tree ▶ Easy to Understand ▶ Useful in Data exploration ▶ Less data cleaning required ▶ Over fitting ▶ Not fit for continuous variables Disadvantage of Decision Tree 24
  • 26. Ensemble ▶ Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model ▶ Bagging - Here idea is to create several subsets of data from training sample chosen randomly with replacement. ▶ Now, each collection of subset data is used to train their decision trees. As a result, we end up with an ensemble of different models. Average of all the predictions from different trees are used ▶ Boosting - In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. Types of Ensemble Method 26
  • 27. Random Forest ▶ Random Forest is an extension over bagging. ▶ It takes one extra step where in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. ▶ Handles higher dimensionality data very well. ▶ Handles missing values and maintains accuracy for missing data. ▶ Since final prediction is based on the mean predictions from subset trees, it won’t give precise values for the regression model. Advantages of Random Forest Disadvantages of Random Forest 27