This document is a machine learning class assignment submitted by Trushita Redij to their supervisor Abhishek Kaushik at Dublin Business School. The assignment discusses data preprocessing techniques, decision trees, the Chinese Restaurant algorithm, and building supervised learning models. Specifically, linear regression and KNN classification models are implemented on population data from Ireland to predict total population and classify countries.
4. 1 Definition
1.1 Data Preprocessing
Data Preprocessing is a most significant step for Machine Learning. In this step the raw
data is transformed or endcoded so that the machine can parse it for further implement-
ation.
Figure 1: Caption
Raw data has many discrepancies, inconsistency, errors, missing value which needs to
be handled before it is parsed by the machine.
1.2 Data Preprocessing Steps
Figure 2: Data Preprocessing Steps
2
5. 1.2.1 Data Quality Assessment
Raw data is often fetched from multiple sources in different formats thus it becomes
important to structurize the data prior to processing. Various factors are responsible for
data qaulity like human error, measuring devices or redundancy in methods of collecting
data. In this step we primarily focus on enhancing the quality of data by fixing the below
mentioned issues:
1. Missing Value: Eliminating or replacing the missing values. Most common method
used in this scenario is substituting with median,mean or mode value of feature.
2. Inconsistent Value: Dealing with inconsistent data cells wherein it may have merged
the data from another column or split the data. Thus understanding the datatype
of all the variables is necessary
3. Duplicate values: The dataset might contain rows or columns which are duplic-
ated which needs to be removed to avoid bias in implementing machine learning
algorithm.
1.2.2 Feature Aggregation
This step performs aggregation on the feature to derive aggregated values and reduce the
number of objects thereby minimizing consumption of memory and time. Aggregation
helps us build a higher level view of data using groups which are more stable.
1.2.3 Feature Sampling
Sampling it used to derive subset of dataset that we will be analyzing. Sampling algorithm
helps in reducing dataset’s size without reducing the properties of original dataset. This
steps selects the appropriate sampling size and strategy. There are two types of sampling
one with replacement and without replacement.
1.2.4 Dimensionality Reduction
Raw Dataset’s have many features, which needs to be reduced to derive significant output.
Dimensionality reduction is used to reduce the feature size by using feature selection or
subset selection thereby reducing the complexity of dataset.
1.2.5 Feature Encoding
This steps transforms the data to machine readable format. For continues nominal data
one to one mapping is done which helps to retain the meaning of feature. For numeric
variables having intervals or ratios simple mathematical transformation can be used.
3
6. 2 Definition
2.1 Decision tree
In Decision tree learning approach a predictive model is build using observations and
conclusions. Observations about an item are represented in branches and conclusions
about item’s target are represented in leaves (Wik19a).
There are two types of decision tree:
• Classification Tree: These tree models take discrete set of values. Labels are defined
by leaves and concurrence to features are represented by branches.
• Regression Tree: The target variable takes continues set of values.
Figure 3: Decision Tree diagram
The source set is split based on classification features into subsets which comprises of
child node. The process is recursive on derived subset and is called recursive partitioning.
The recursion is concluded when the values in subset matches the target variable. This
top down approach is termed as greedy algorithm
4
7. 2.2 Entropy
Entropy is a measure of the number of ways in which a system may be arranged, often
taken to be a measure of ”disorder” (Wik19c).
In machine learning Entropy can be termed as measure of ambivalence, grime
and confusion.
It establishes control on the splitting of Decision Tree thereby effecting the boundaries
of decision tree. Entropy = -(p(0) * log(P(0)) + p(1) * log(P(1))) Formula for Entropy
is:
Figure 4: Entropy Equation
2.3 Information Gain
Information gain is termed as the conditional expected value of the Kullback–Leibler
divergence of the univariate probability distribution of one variable from the conditional
distribution of this variable given the other one (Wik19b).
Figure 5: Information Gain
• It measures the quantity of “information” depicted by a feature with respect to a
class.
• It’s a prominent factor which is used in implementation of Decision Tree Algorithm.
5
8. 3 Chinese Restaurant Algorithm
CRP algorithm is useful when we have a collection of observation and want to partition
them into groups. It is prominently based on working of Chinese restaurant in San Fran-
cisco.
Observation: Customer(C) entering the restaurant. Group (G) : Collection of Observa-
tion.
Assumption 1: Restaurant has limitless capacity.
Assumption 2:Every group(G) corresponds to a Table(T)
Observation: Customer(C) entering the restaurant.
Probability = 0 Every group(G) prefer sitting at popular table.
Probability = 1( New Customer will sit at unoccupied table)
Figure 6: Chinese Restaurant
6
9. 3.0.1 Working
Statement: Suppose that there are currently N customer sitting in a restaurant.
Zi: Indicator variable (describes the table number of ith customer)
Vector: Table assignments (Z= Z1 + Z2.....Zn)
Algorithm:
Figure 7: Chinese Restaurant Algorithm
Observation: Customer(C) entering the restaurant.
Group (G) : Collection of Observation.
Assumption 1: Restaurant has limitless capacity.
Assumption 2:Every group(G) corresponds to a Table(T)
Observation: Customer(C) entering the restaurant.
Probability = 0 Every group(G) prefer sitting at popular table.
Probability = 1( New Customer will sit at unoccupied table)
7
10. 4 Building Models using Supervised Learning Ap-
proach
The proposed dataset highlights All Island Population which includes Northern Ireland
and Republic of Ireland.
4.1 Data Collection
Data Source: https://data.gov.ie/dataset/all-island-population-sa.
This file contains variables from the Population Theme that was produced by AIRO
using data from the census unit at the CSO and the Northern Ireland Research and
Statistics Agency. This data was developed under the Evidence Based Planning theme
of the Ireland Northern Cross Border Cooperation Observatory and CrosSPlaN-2 funded
research programme.
No.Of Rows: 23026
No.of columns: 30
4.2 Data Preprocessing
• Remove null and missing values: The Dataset had no null values or missing
values.
• Convert string type to numeric type The few numeric variables had string
datatype which need to be converted into integer.
• Visualize Dataset Understanding dataset using visualization like histogram, plots,
graphs.
Figure 8: Histogram
• Rescaling Dataset To prepare the data for implementation we used MinMax-
Scaler to rescale the data for effective implementation.
• Plotting Correlation To understand the correlation between the variables and
drop the variables which has highly correlated.
8
11. The correlation coefficient is an index that ranges from -1 to 1. There exist no
correlation when value is 0.If the value is 1 or -1 it indicates negative correlation.
Figure 9: Correlation Heatmap
4.3 Implementation
4.3.1 Regression Model
We used Linear Regression approach by considering ’Total Population’ as label and ’Fer-
tility rate’ as the dependent variable. We intend to study the effect of fertility rate on
total population of Island which includes Northern Ireland and Republic of Ireland.
Steps:
• Training a Linear Regression Model We assign X array to features and Y array
to target variable that is ’TOTPOP’.
• Train Test Split In order to create model which can be used on new data we
split the datatset into Train data on which we apply linear regression and Test data
wherein we test our algorithm.
• Creating and Training the Model From sklearn.linear model we imported Lin-
ear Regression.
• Predictions from our Model We used the Test dataset to predict our output.
• Visualise the prediction
• Evaluation The mean of the target variable is 277.920304. However the rmse score
on the test data is 123.41 which is lesser then the mean score of target variable.
Thus, using linear model on the given dataset is not efficient.
4.3.2 Classification Model
Steps:
• Training a Linear Regression Model We assign X array to features ’TOT-
POP’,’MALE’,’FEMALE’ and Y array to target variable that is ’Country’.
9
12. Figure 10: Linear Regression
• Train Test Split In order to create model which can be used on new data we split
the datatset into Train data on which we apply classification and Test data wherein
we test our algorithm.
• Testing accuracy of different classifier We tested accuracy of ’DecisionTree-
Classifier’,’KNeighborsClassifier’, ’GaussianNB’ and ’SVM’.
• Selecting the best fit Classifier and Training the Model We selected KNN
Classifier to train our model as it portrayed highest accuracy on train set as well
as test set.
• Evaluation Confusion matrix, precision, recall and f1 score are the most commonly
used evaluation metrics. The confusion matrix and classification report methods of
the sklearn.metrics were used to evaluate the model.
The KNN algorithm classified all the records in the test set with 80 percent accuracy.
• Comparing Error Rate with the K Value To find the best value of K we plot
the graph of K value and the corresponding error rate for the dataset. Finally, we
plotted the error values against K values
Figure 11: Error Rate K Value
10
13. References
[Wik19a] Wikipedia contributors, “Decision tree learning — Wikipedia, the free
encyclopedia,” 2019, [Online; accessed 17-December-2019]. [Online]. Avail-
able: https://en.wikipedia.org/w/index.php?title=Decision tree learning&
oldid=926138607
[Wik19b] ——, “Information gain in decision trees — Wikipedia, the free
encyclopedia,” 2019, [Online; accessed 18-December-2019]. [Online].
Available: https://en.wikipedia.org/w/index.php?title=Information gain in
decision trees&oldid=930926162
[Wik19c] ——, “Introduction to entropy — Wikipedia, the free encyclopedia,”
2019, [Online; accessed 18-December-2019]. [Online]. Available: https://en.
wikipedia.org/w/index.php?title=Introduction to entropy&oldid=926007171
11