SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Machine Learning Class Assignment 2
Msc Data Analytics
Trushita Redij
Student ID: 10504099
Dublin Business School
Supervisor: Abhishek Kaushik
Dublin Business School
Assignment Submission Sheet
Msc Data Analytics
Student Name: Trushita Redij
Student ID: 10504099
Programme: Msc Data Analytics
Year: 2019
Supervisor: Abhishek Kaushik
Submission Due Date: 17/12/2019
Project Title: Machine Learning Class Assignment 2
Word Count: 1653
Page Count: 11
Machine Learning Class Assignment 2
Trushita Redij
10504099
Contents
1 Definition 2
1.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Data Preprocessing Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Data Quality Assessment . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Feature Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Feature Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.4 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . 3
1.2.5 Feature Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Definition 4
2.1 Decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Chinese Restaurant Algorithm 6
3.0.1 Working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Building Models using Supervised Learning Approach 8
4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3.1 Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3.2 Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . 9
1
1 Definition
1.1 Data Preprocessing
Data Preprocessing is a most significant step for Machine Learning. In this step the raw
data is transformed or endcoded so that the machine can parse it for further implement-
ation.
Figure 1: Caption
Raw data has many discrepancies, inconsistency, errors, missing value which needs to
be handled before it is parsed by the machine.
1.2 Data Preprocessing Steps
Figure 2: Data Preprocessing Steps
2
1.2.1 Data Quality Assessment
Raw data is often fetched from multiple sources in different formats thus it becomes
important to structurize the data prior to processing. Various factors are responsible for
data qaulity like human error, measuring devices or redundancy in methods of collecting
data. In this step we primarily focus on enhancing the quality of data by fixing the below
mentioned issues:
1. Missing Value: Eliminating or replacing the missing values. Most common method
used in this scenario is substituting with median,mean or mode value of feature.
2. Inconsistent Value: Dealing with inconsistent data cells wherein it may have merged
the data from another column or split the data. Thus understanding the datatype
of all the variables is necessary
3. Duplicate values: The dataset might contain rows or columns which are duplic-
ated which needs to be removed to avoid bias in implementing machine learning
algorithm.
1.2.2 Feature Aggregation
This step performs aggregation on the feature to derive aggregated values and reduce the
number of objects thereby minimizing consumption of memory and time. Aggregation
helps us build a higher level view of data using groups which are more stable.
1.2.3 Feature Sampling
Sampling it used to derive subset of dataset that we will be analyzing. Sampling algorithm
helps in reducing dataset’s size without reducing the properties of original dataset. This
steps selects the appropriate sampling size and strategy. There are two types of sampling
one with replacement and without replacement.
1.2.4 Dimensionality Reduction
Raw Dataset’s have many features, which needs to be reduced to derive significant output.
Dimensionality reduction is used to reduce the feature size by using feature selection or
subset selection thereby reducing the complexity of dataset.
1.2.5 Feature Encoding
This steps transforms the data to machine readable format. For continues nominal data
one to one mapping is done which helps to retain the meaning of feature. For numeric
variables having intervals or ratios simple mathematical transformation can be used.
3
2 Definition
2.1 Decision tree
In Decision tree learning approach a predictive model is build using observations and
conclusions. Observations about an item are represented in branches and conclusions
about item’s target are represented in leaves (Wik19a).
There are two types of decision tree:
• Classification Tree: These tree models take discrete set of values. Labels are defined
by leaves and concurrence to features are represented by branches.
• Regression Tree: The target variable takes continues set of values.
Figure 3: Decision Tree diagram
The source set is split based on classification features into subsets which comprises of
child node. The process is recursive on derived subset and is called recursive partitioning.
The recursion is concluded when the values in subset matches the target variable. This
top down approach is termed as greedy algorithm
4
2.2 Entropy
Entropy is a measure of the number of ways in which a system may be arranged, often
taken to be a measure of ”disorder” (Wik19c).
In machine learning Entropy can be termed as measure of ambivalence, grime
and confusion.
It establishes control on the splitting of Decision Tree thereby effecting the boundaries
of decision tree. Entropy = -(p(0) * log(P(0)) + p(1) * log(P(1))) Formula for Entropy
is:
Figure 4: Entropy Equation
2.3 Information Gain
Information gain is termed as the conditional expected value of the Kullback–Leibler
divergence of the univariate probability distribution of one variable from the conditional
distribution of this variable given the other one (Wik19b).
Figure 5: Information Gain
• It measures the quantity of “information” depicted by a feature with respect to a
class.
• It’s a prominent factor which is used in implementation of Decision Tree Algorithm.
5
3 Chinese Restaurant Algorithm
CRP algorithm is useful when we have a collection of observation and want to partition
them into groups. It is prominently based on working of Chinese restaurant in San Fran-
cisco.
Observation: Customer(C) entering the restaurant. Group (G) : Collection of Observa-
tion.
Assumption 1: Restaurant has limitless capacity.
Assumption 2:Every group(G) corresponds to a Table(T)
Observation: Customer(C) entering the restaurant.
Probability = 0 Every group(G) prefer sitting at popular table.
Probability = 1( New Customer will sit at unoccupied table)
Figure 6: Chinese Restaurant
6
3.0.1 Working
Statement: Suppose that there are currently N customer sitting in a restaurant.
Zi: Indicator variable (describes the table number of ith customer)
Vector: Table assignments (Z= Z1 + Z2.....Zn)
Algorithm:
Figure 7: Chinese Restaurant Algorithm
Observation: Customer(C) entering the restaurant.
Group (G) : Collection of Observation.
Assumption 1: Restaurant has limitless capacity.
Assumption 2:Every group(G) corresponds to a Table(T)
Observation: Customer(C) entering the restaurant.
Probability = 0 Every group(G) prefer sitting at popular table.
Probability = 1( New Customer will sit at unoccupied table)
7
4 Building Models using Supervised Learning Ap-
proach
The proposed dataset highlights All Island Population which includes Northern Ireland
and Republic of Ireland.
4.1 Data Collection
Data Source: https://data.gov.ie/dataset/all-island-population-sa.
This file contains variables from the Population Theme that was produced by AIRO
using data from the census unit at the CSO and the Northern Ireland Research and
Statistics Agency. This data was developed under the Evidence Based Planning theme
of the Ireland Northern Cross Border Cooperation Observatory and CrosSPlaN-2 funded
research programme.
No.Of Rows: 23026
No.of columns: 30
4.2 Data Preprocessing
• Remove null and missing values: The Dataset had no null values or missing
values.
• Convert string type to numeric type The few numeric variables had string
datatype which need to be converted into integer.
• Visualize Dataset Understanding dataset using visualization like histogram, plots,
graphs.
Figure 8: Histogram
• Rescaling Dataset To prepare the data for implementation we used MinMax-
Scaler to rescale the data for effective implementation.
• Plotting Correlation To understand the correlation between the variables and
drop the variables which has highly correlated.
8
The correlation coefficient is an index that ranges from -1 to 1. There exist no
correlation when value is 0.If the value is 1 or -1 it indicates negative correlation.
Figure 9: Correlation Heatmap
4.3 Implementation
4.3.1 Regression Model
We used Linear Regression approach by considering ’Total Population’ as label and ’Fer-
tility rate’ as the dependent variable. We intend to study the effect of fertility rate on
total population of Island which includes Northern Ireland and Republic of Ireland.
Steps:
• Training a Linear Regression Model We assign X array to features and Y array
to target variable that is ’TOTPOP’.
• Train Test Split In order to create model which can be used on new data we
split the datatset into Train data on which we apply linear regression and Test data
wherein we test our algorithm.
• Creating and Training the Model From sklearn.linear model we imported Lin-
ear Regression.
• Predictions from our Model We used the Test dataset to predict our output.
• Visualise the prediction
• Evaluation The mean of the target variable is 277.920304. However the rmse score
on the test data is 123.41 which is lesser then the mean score of target variable.
Thus, using linear model on the given dataset is not efficient.
4.3.2 Classification Model
Steps:
• Training a Linear Regression Model We assign X array to features ’TOT-
POP’,’MALE’,’FEMALE’ and Y array to target variable that is ’Country’.
9
Figure 10: Linear Regression
• Train Test Split In order to create model which can be used on new data we split
the datatset into Train data on which we apply classification and Test data wherein
we test our algorithm.
• Testing accuracy of different classifier We tested accuracy of ’DecisionTree-
Classifier’,’KNeighborsClassifier’, ’GaussianNB’ and ’SVM’.
• Selecting the best fit Classifier and Training the Model We selected KNN
Classifier to train our model as it portrayed highest accuracy on train set as well
as test set.
• Evaluation Confusion matrix, precision, recall and f1 score are the most commonly
used evaluation metrics. The confusion matrix and classification report methods of
the sklearn.metrics were used to evaluate the model.
The KNN algorithm classified all the records in the test set with 80 percent accuracy.
• Comparing Error Rate with the K Value To find the best value of K we plot
the graph of K value and the corresponding error rate for the dataset. Finally, we
plotted the error values against K values
Figure 11: Error Rate K Value
10
References
[Wik19a] Wikipedia contributors, “Decision tree learning — Wikipedia, the free
encyclopedia,” 2019, [Online; accessed 17-December-2019]. [Online]. Avail-
able: https://en.wikipedia.org/w/index.php?title=Decision tree learning&
oldid=926138607
[Wik19b] ——, “Information gain in decision trees — Wikipedia, the free
encyclopedia,” 2019, [Online; accessed 18-December-2019]. [Online].
Available: https://en.wikipedia.org/w/index.php?title=Information gain in
decision trees&oldid=930926162
[Wik19c] ——, “Introduction to entropy — Wikipedia, the free encyclopedia,”
2019, [Online; accessed 18-December-2019]. [Online]. Available: https://en.
wikipedia.org/w/index.php?title=Introduction to entropy&oldid=926007171
11

Weitere ähnliche Inhalte

Was ist angesagt?

Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Sanghun Kim
 
30thSep2014
30thSep201430thSep2014
30thSep2014Mia liu
 
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMSMULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMSijcsit
 
Clustering and Regression using WEKA
Clustering and Regression using WEKAClustering and Regression using WEKA
Clustering and Regression using WEKAVijaya Prabhu
 
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random ForestUnsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random ForestMohamed Medhat Gaber
 
Handling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random UndersamplingHandling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random UndersamplingIRJET Journal
 
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...IRJET Journal
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression TreesHemant Chetwani
 
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SETTWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SETIJDKP
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Kazi Toufiq Wadud
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Simplilearn
 
Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Editor IJARCET
 
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
 
Ant System and Weighted Voting Method for Multiple Classifier Systems
Ant System and Weighted Voting Method for Multiple Classifier Systems Ant System and Weighted Voting Method for Multiple Classifier Systems
Ant System and Weighted Voting Method for Multiple Classifier Systems IJECEIAES
 
Distributed Decision Tree Induction
Distributed Decision Tree InductionDistributed Decision Tree Induction
Distributed Decision Tree Inductiongregoryg
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleSajith Edirisinghe
 

Was ist angesagt? (18)

Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)
 
30thSep2014
30thSep201430thSep2014
30thSep2014
 
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMSMULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
 
Clustering and Regression using WEKA
Clustering and Regression using WEKAClustering and Regression using WEKA
Clustering and Regression using WEKA
 
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random ForestUnsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
 
Handling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random UndersamplingHandling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random Undersampling
 
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SETTWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
 
Random forest
Random forestRandom forest
Random forest
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
 
Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582
 
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
 
E1802023741
E1802023741E1802023741
E1802023741
 
Ant System and Weighted Voting Method for Multiple Classifier Systems
Ant System and Weighted Voting Method for Multiple Classifier Systems Ant System and Weighted Voting Method for Multiple Classifier Systems
Ant System and Weighted Voting Method for Multiple Classifier Systems
 
Distributed Decision Tree Induction
Distributed Decision Tree InductionDistributed Decision Tree Induction
Distributed Decision Tree Induction
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - Kaggle
 

Ähnlich wie ML Class Assignment Insights

Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction modelsMuthu Kumaar Thangavelu
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction modelsMuthu Kumaar Thangavelu
 
Black_Friday_Sales_Trushita
Black_Friday_Sales_TrushitaBlack_Friday_Sales_Trushita
Black_Friday_Sales_TrushitaTrushita Redij
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataKathleneNgo
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationHariniMS1
 
Microsoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMicrosoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMashfiq Shahriar
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET Journal
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET Journal
 
Classification and Prediction Based Data Mining Algorithm in Weka Tool
Classification and Prediction Based Data Mining Algorithm in Weka ToolClassification and Prediction Based Data Mining Algorithm in Weka Tool
Classification and Prediction Based Data Mining Algorithm in Weka ToolIRJET Journal
 
Performance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various ClassifiersPerformance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various Classifiersamreshkr19
 
A Firefly based improved clustering algorithm
A Firefly based improved clustering algorithmA Firefly based improved clustering algorithm
A Firefly based improved clustering algorithmIRJET Journal
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27IJARIIE JOURNAL
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...IJCSES Journal
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reductionShatakirti Er
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYEditor IJMTER
 

Ähnlich wie ML Class Assignment Insights (20)

Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Black_Friday_Sales_Trushita
Black_Friday_Sales_TrushitaBlack_Friday_Sales_Trushita
Black_Friday_Sales_Trushita
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
 
forest-cover-type
forest-cover-typeforest-cover-type
forest-cover-type
 
Z suzanne van_den_bosch
Z suzanne van_den_boschZ suzanne van_den_bosch
Z suzanne van_den_bosch
 
Microsoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMicrosoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data Science
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
 
Classification and Prediction Based Data Mining Algorithm in Weka Tool
Classification and Prediction Based Data Mining Algorithm in Weka ToolClassification and Prediction Based Data Mining Algorithm in Weka Tool
Classification and Prediction Based Data Mining Algorithm in Weka Tool
 
Performance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various ClassifiersPerformance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various Classifiers
 
A Firefly based improved clustering algorithm
A Firefly based improved clustering algorithmA Firefly based improved clustering algorithm
A Firefly based improved clustering algorithm
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27
 
M2R Group 26
M2R Group 26M2R Group 26
M2R Group 26
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reduction
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
 

Kürzlich hochgeladen

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 

Kürzlich hochgeladen (20)

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 

ML Class Assignment Insights

  • 1. Machine Learning Class Assignment 2 Msc Data Analytics Trushita Redij Student ID: 10504099 Dublin Business School Supervisor: Abhishek Kaushik
  • 2. Dublin Business School Assignment Submission Sheet Msc Data Analytics Student Name: Trushita Redij Student ID: 10504099 Programme: Msc Data Analytics Year: 2019 Supervisor: Abhishek Kaushik Submission Due Date: 17/12/2019 Project Title: Machine Learning Class Assignment 2 Word Count: 1653 Page Count: 11
  • 3. Machine Learning Class Assignment 2 Trushita Redij 10504099 Contents 1 Definition 2 1.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Data Preprocessing Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Data Quality Assessment . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Feature Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.3 Feature Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.4 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . 3 1.2.5 Feature Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Definition 4 2.1 Decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Chinese Restaurant Algorithm 6 3.0.1 Working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4 Building Models using Supervised Learning Approach 8 4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.3.1 Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.3.2 Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . 9 1
  • 4. 1 Definition 1.1 Data Preprocessing Data Preprocessing is a most significant step for Machine Learning. In this step the raw data is transformed or endcoded so that the machine can parse it for further implement- ation. Figure 1: Caption Raw data has many discrepancies, inconsistency, errors, missing value which needs to be handled before it is parsed by the machine. 1.2 Data Preprocessing Steps Figure 2: Data Preprocessing Steps 2
  • 5. 1.2.1 Data Quality Assessment Raw data is often fetched from multiple sources in different formats thus it becomes important to structurize the data prior to processing. Various factors are responsible for data qaulity like human error, measuring devices or redundancy in methods of collecting data. In this step we primarily focus on enhancing the quality of data by fixing the below mentioned issues: 1. Missing Value: Eliminating or replacing the missing values. Most common method used in this scenario is substituting with median,mean or mode value of feature. 2. Inconsistent Value: Dealing with inconsistent data cells wherein it may have merged the data from another column or split the data. Thus understanding the datatype of all the variables is necessary 3. Duplicate values: The dataset might contain rows or columns which are duplic- ated which needs to be removed to avoid bias in implementing machine learning algorithm. 1.2.2 Feature Aggregation This step performs aggregation on the feature to derive aggregated values and reduce the number of objects thereby minimizing consumption of memory and time. Aggregation helps us build a higher level view of data using groups which are more stable. 1.2.3 Feature Sampling Sampling it used to derive subset of dataset that we will be analyzing. Sampling algorithm helps in reducing dataset’s size without reducing the properties of original dataset. This steps selects the appropriate sampling size and strategy. There are two types of sampling one with replacement and without replacement. 1.2.4 Dimensionality Reduction Raw Dataset’s have many features, which needs to be reduced to derive significant output. Dimensionality reduction is used to reduce the feature size by using feature selection or subset selection thereby reducing the complexity of dataset. 1.2.5 Feature Encoding This steps transforms the data to machine readable format. For continues nominal data one to one mapping is done which helps to retain the meaning of feature. For numeric variables having intervals or ratios simple mathematical transformation can be used. 3
  • 6. 2 Definition 2.1 Decision tree In Decision tree learning approach a predictive model is build using observations and conclusions. Observations about an item are represented in branches and conclusions about item’s target are represented in leaves (Wik19a). There are two types of decision tree: • Classification Tree: These tree models take discrete set of values. Labels are defined by leaves and concurrence to features are represented by branches. • Regression Tree: The target variable takes continues set of values. Figure 3: Decision Tree diagram The source set is split based on classification features into subsets which comprises of child node. The process is recursive on derived subset and is called recursive partitioning. The recursion is concluded when the values in subset matches the target variable. This top down approach is termed as greedy algorithm 4
  • 7. 2.2 Entropy Entropy is a measure of the number of ways in which a system may be arranged, often taken to be a measure of ”disorder” (Wik19c). In machine learning Entropy can be termed as measure of ambivalence, grime and confusion. It establishes control on the splitting of Decision Tree thereby effecting the boundaries of decision tree. Entropy = -(p(0) * log(P(0)) + p(1) * log(P(1))) Formula for Entropy is: Figure 4: Entropy Equation 2.3 Information Gain Information gain is termed as the conditional expected value of the Kullback–Leibler divergence of the univariate probability distribution of one variable from the conditional distribution of this variable given the other one (Wik19b). Figure 5: Information Gain • It measures the quantity of “information” depicted by a feature with respect to a class. • It’s a prominent factor which is used in implementation of Decision Tree Algorithm. 5
  • 8. 3 Chinese Restaurant Algorithm CRP algorithm is useful when we have a collection of observation and want to partition them into groups. It is prominently based on working of Chinese restaurant in San Fran- cisco. Observation: Customer(C) entering the restaurant. Group (G) : Collection of Observa- tion. Assumption 1: Restaurant has limitless capacity. Assumption 2:Every group(G) corresponds to a Table(T) Observation: Customer(C) entering the restaurant. Probability = 0 Every group(G) prefer sitting at popular table. Probability = 1( New Customer will sit at unoccupied table) Figure 6: Chinese Restaurant 6
  • 9. 3.0.1 Working Statement: Suppose that there are currently N customer sitting in a restaurant. Zi: Indicator variable (describes the table number of ith customer) Vector: Table assignments (Z= Z1 + Z2.....Zn) Algorithm: Figure 7: Chinese Restaurant Algorithm Observation: Customer(C) entering the restaurant. Group (G) : Collection of Observation. Assumption 1: Restaurant has limitless capacity. Assumption 2:Every group(G) corresponds to a Table(T) Observation: Customer(C) entering the restaurant. Probability = 0 Every group(G) prefer sitting at popular table. Probability = 1( New Customer will sit at unoccupied table) 7
  • 10. 4 Building Models using Supervised Learning Ap- proach The proposed dataset highlights All Island Population which includes Northern Ireland and Republic of Ireland. 4.1 Data Collection Data Source: https://data.gov.ie/dataset/all-island-population-sa. This file contains variables from the Population Theme that was produced by AIRO using data from the census unit at the CSO and the Northern Ireland Research and Statistics Agency. This data was developed under the Evidence Based Planning theme of the Ireland Northern Cross Border Cooperation Observatory and CrosSPlaN-2 funded research programme. No.Of Rows: 23026 No.of columns: 30 4.2 Data Preprocessing • Remove null and missing values: The Dataset had no null values or missing values. • Convert string type to numeric type The few numeric variables had string datatype which need to be converted into integer. • Visualize Dataset Understanding dataset using visualization like histogram, plots, graphs. Figure 8: Histogram • Rescaling Dataset To prepare the data for implementation we used MinMax- Scaler to rescale the data for effective implementation. • Plotting Correlation To understand the correlation between the variables and drop the variables which has highly correlated. 8
  • 11. The correlation coefficient is an index that ranges from -1 to 1. There exist no correlation when value is 0.If the value is 1 or -1 it indicates negative correlation. Figure 9: Correlation Heatmap 4.3 Implementation 4.3.1 Regression Model We used Linear Regression approach by considering ’Total Population’ as label and ’Fer- tility rate’ as the dependent variable. We intend to study the effect of fertility rate on total population of Island which includes Northern Ireland and Republic of Ireland. Steps: • Training a Linear Regression Model We assign X array to features and Y array to target variable that is ’TOTPOP’. • Train Test Split In order to create model which can be used on new data we split the datatset into Train data on which we apply linear regression and Test data wherein we test our algorithm. • Creating and Training the Model From sklearn.linear model we imported Lin- ear Regression. • Predictions from our Model We used the Test dataset to predict our output. • Visualise the prediction • Evaluation The mean of the target variable is 277.920304. However the rmse score on the test data is 123.41 which is lesser then the mean score of target variable. Thus, using linear model on the given dataset is not efficient. 4.3.2 Classification Model Steps: • Training a Linear Regression Model We assign X array to features ’TOT- POP’,’MALE’,’FEMALE’ and Y array to target variable that is ’Country’. 9
  • 12. Figure 10: Linear Regression • Train Test Split In order to create model which can be used on new data we split the datatset into Train data on which we apply classification and Test data wherein we test our algorithm. • Testing accuracy of different classifier We tested accuracy of ’DecisionTree- Classifier’,’KNeighborsClassifier’, ’GaussianNB’ and ’SVM’. • Selecting the best fit Classifier and Training the Model We selected KNN Classifier to train our model as it portrayed highest accuracy on train set as well as test set. • Evaluation Confusion matrix, precision, recall and f1 score are the most commonly used evaluation metrics. The confusion matrix and classification report methods of the sklearn.metrics were used to evaluate the model. The KNN algorithm classified all the records in the test set with 80 percent accuracy. • Comparing Error Rate with the K Value To find the best value of K we plot the graph of K value and the corresponding error rate for the dataset. Finally, we plotted the error values against K values Figure 11: Error Rate K Value 10
  • 13. References [Wik19a] Wikipedia contributors, “Decision tree learning — Wikipedia, the free encyclopedia,” 2019, [Online; accessed 17-December-2019]. [Online]. Avail- able: https://en.wikipedia.org/w/index.php?title=Decision tree learning& oldid=926138607 [Wik19b] ——, “Information gain in decision trees — Wikipedia, the free encyclopedia,” 2019, [Online; accessed 18-December-2019]. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Information gain in decision trees&oldid=930926162 [Wik19c] ——, “Introduction to entropy — Wikipedia, the free encyclopedia,” 2019, [Online; accessed 18-December-2019]. [Online]. Available: https://en. wikipedia.org/w/index.php?title=Introduction to entropy&oldid=926007171 11