SlideShare a Scribd company logo
1 of 5
Download to read offline
SHIVAM PAWAR 5492083
1
Data and Data Preprocessing
Problem 1: Types of attributes
Q 1) Classify the following attributes as nominal, ordinal, interval, ratio:
(a) Rating of an Amazon product by a person on a scale of 1 to 5 – Ordinal
Ordinal measurement holds importance of the position. So here rating a product will keep
that product in a place as per the rating provided by an individual. It holds importance to
the value. Hence, we should use Ordinal scale of measurement for this case.
(b) The Internet Speed – Interval
The reason for using Interval here is because it does not have a true zero point. Adding the
internet speed of 2 devices does not mean the speed has been increased.
(c) Number of customers in a store – Ratio
For this we have to use Ratio measurement as we are considering the number of people
inside a store where the count can be increased, decreased which further will create a
difference with the change.
(d) UCF Student ID – Nominal
Here the Nominal will count the students and does not hold any importance to the value in
terms of academic position.
(e) Distance – Ratio
For measuring distance, we have to use Ratio scale as the distance can be increased or
decreased. Here for distance, there will be true zero point as adding distance will create
difference.
(f) Letter grade (A, B, C, D) – Ordinal
Here as grading something holds importance of the value, we need to take Ordinal
measurement. A grade is considered higher than B in academic standards.
(g) The temperature at Orlando – Interval
For measuring temperature, we need to use Interval scale of measurement as 0 degrees
does not mean an absence of the property and doubling a degree will not make any
difference.
SHIVAM PAWAR 5492083
2
Problem 2: Exploring Data Pre-processing Techniques :
Q1) (Reproduce): Please read, understand, run the code and reproduce the model accuracies.
Please briefly explain whether you can reproduce the classification accuracies of 'Support Vector
Machines', 'KNN', 'Logistic Regression', 'Random Forest', 'Naive Bayes', 'Perceptron', 'Stochastic
Gradient Decent', 'Linear SVC', 'Decision Tree'.
In the given Kaggle Titanic Dataset the workflows they have followed are Classifying,
Correlating, Converting, Completing, Correcting, Creating and Charting in order to process the data
using algorithms. The main aim of this Dataset is to find the survival rate. Initially they have taken
‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’ and
‘Embarked’ as features to categorize the data for better idea. Later after running few scenarios, they
have removed few features like Fare, Ticket and Cabin as removing them will not create any
difference for finding the Survival rate.
I tried to reproduce the code with the same machine learning models and could be able to
see the same accuracies for all the algorithms for every run except for the Stochastic Gradient
Decent. Because the Stochastic Gradient Decent is an iterative algorithm which takes the data sets
randomly for each iteration. So as the datasets this algorithm will get varied differently each time
making the algorithm to display different score for each run.
Sample accuracies for the algorithms are as below:
Sample 1:
Random Forest - 86.76
Decision Tree - 86.76
KNN - 74.47
Support Vector Machines - 83.84
Logistic Regression - 80.36
Linear SVC - 79.12
Perceptron - 78.00
Naive Bayes - 72.28
Stochastic Gradient Decent - 51.63
SHIVAM PAWAR 5492083
3
Sample 2:
Random Forest - 86.76
Decision Tree - 86.76
KNN - 74.47
Support Vector Machines - 83.84
Logistic Regression - 80.36
Linear SVC - 79.12
Stochastic Gradient Decent - 78.68
Perceptron - 78.00
Naive Bayes - 72.28
Q2) (Improve): Is the data pre-processing process proposed in the Kaggle post the best pre-
processing solution? If yes, please explain why. If not, can you leverage what you learned in the
class and your previous experiences to improve data processing, to obtain better accuracies for all
these classification models? Describe what is your improved data pre-processing, and what are
your improved accuracies?
As stated above in the first question the algorithms and data processing techniques used are
very well written in the given Kaggle Titanic Dataset. As the workflow follows six steps. After
understanding or defining the problem we need to acquire the training and testing data. Then we
have to prepare and cleanse the data. Now we have to analyse the data and explore the data. Now
we need to predict the possible situations/scenarios to solve the problem which will further supplies
the result.
Here in this Dataset the work flow techniques have been started with some features as
mentioned in the above question (‘PassergerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’,
‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’ and ‘Embarked’). Later after improvising the data few features like
Ticket, Fare, Cabin, Embarked, Parch have been dropped to increase the accuracy of the algorithms.
Also added other features like AgeBand, IsAlone which improvised the code.
The technique I have used is to change the values in AgeBand which increased the accuracy
from 86.68 to 90.46 for the algorithms. Previously they have given higher difference to the age
values given AgeBand later I have decreased them and added new values which will enable the code
to run faster with high accuracy.
Below is the sample of the accuracy of the algorithms post making the changes in the code.
SHIVAM PAWAR 5492083
4
Sample 1:
Random Forest - 90.46
Decision Tree - 90.46
KNN - 87.09
Support Vector Machines - 85.63
Perceptron - 80.02
Linear SVC - 78.79
Logistic Regression - 78.45
Naive Bayes - 77.89
Stochastic Gradient Decent - 74.97
Sample 2:
Random Forest - 89.56
Decision Tree - 89.56
KNN - 87.21
Support Vector Machines - 85.07
Linear SVC - 78.45
Logistic Regression - 78.23
Perceptron - 78.23
Stochastic Gradient Decent - 77.89
Naive Bayes - 77.67
In the Sample-2 I have made changes again in AgeBand which further displayed the above accuracy.
Below is the link for Sample-2:
https://www.kaggle.com/code/nikhithakonda/titanic-data-science-solutions/edit
https://www.kaggle.com/code/nikhithakonda/titanic-data-science-solutions
SHIVAM PAWAR 5492083
5
Problem 3: Distance/Similarity Measures
Given the four boxes shown in the following figure, answer the following questions. In the
diagram, numbers indicate the lengths and widths and you can consider each box to be a vector of
two real numbers, length and width. For example, the top left box would be (2,1), while the
bottom right box would be (3,3). Restrict your choices of similarity/distance measure to Euclidean
distance and correlation
Which proximity measure would you use to group the boxes based on their shapes (length-width
ratio)?
For measuring the boxes based on the shapes (length-width ratio) we need to use
Corelation. Below is the formula to measure the Corelation.
Where n = 2 as we are comparing two sets
For the values of x and y we need to take the values simultaneously as we compare 2 conditions.
Sigma x and Sigma y would be 3 and 2
Corelation for box 1 and box 2 comes as 0 after calculating the above values in the corelation
formula which is the smallest distance.
Corelation for box 1 and box 3 comes as 1
Corelation for box 1 and box 4 approximately equals to 1(0.9)
Similarly for box 2 and box 4 the corelation is equal to 1
And the boxes 2 and 3 will be the same as 0
Which proximity measure would you use to group the boxes based on their size?
Based on the size of the boxes we need to use Euclidean formula.
If we calculate the values of the boxes like below
Box 1(2,1) ; Box 2(1,1); Box 3(6,3) ; Box 4(3,3)
If we substitute the above values of all boxes in to the formula then we will get the answer.
We will get the smallest distance for box 1 and box 2 as 1. Also for the box 2 and box 4 we will get
the smallest as approximately equal to 3.

More Related Content

Similar to HW1 assignment Shivam.pdf

Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Simplilearn
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
HariniMS1
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 

Similar to HW1 assignment Shivam.pdf (20)

Guide
GuideGuide
Guide
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classification
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
 
casestudy_important.pptx
casestudy_important.pptxcasestudy_important.pptx
casestudy_important.pptx
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
 
German credit data analysis
German credit data analysisGerman credit data analysis
German credit data analysis
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
07 learning
07 learning07 learning
07 learning
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
 
svm-proyekt.pptx
svm-proyekt.pptxsvm-proyekt.pptx
svm-proyekt.pptx
 
forest-cover-type
forest-cover-typeforest-cover-type
forest-cover-type
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
 
The Beginnings Of A Search Engine
The Beginnings Of A Search EngineThe Beginnings Of A Search Engine
The Beginnings Of A Search Engine
 
The Beginnings of a Search Engine
The Beginnings of a Search EngineThe Beginnings of a Search Engine
The Beginnings of a Search Engine
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 

Recently uploaded

Greenery-Palette Pitch Deck by Slidesgo.pptx
Greenery-Palette Pitch Deck by Slidesgo.pptxGreenery-Palette Pitch Deck by Slidesgo.pptx
Greenery-Palette Pitch Deck by Slidesgo.pptx
zohiiimughal286
 
Call Girls Bangalore Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Bangalore Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Bangalore Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Bangalore Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men 🔝narsinghpur🔝 ...
➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men  🔝narsinghpur🔝  ...➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men  🔝narsinghpur🔝  ...
➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men 🔝narsinghpur🔝 ...
nirzagarg
 
一比一原版(UdeM学位证书)蒙特利尔大学毕业证学历认证怎样办
一比一原版(UdeM学位证书)蒙特利尔大学毕业证学历认证怎样办一比一原版(UdeM学位证书)蒙特利尔大学毕业证学历认证怎样办
一比一原版(UdeM学位证书)蒙特利尔大学毕业证学历认证怎样办
ezgenuh
 
Majestic Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...
Majestic Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...Majestic Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...
Majestic Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...
amitlee9823
 
Call Girls In Kotla Mubarakpur Delhi ❤️8448577510 ⊹Best Escorts Service In 24...
Call Girls In Kotla Mubarakpur Delhi ❤️8448577510 ⊹Best Escorts Service In 24...Call Girls In Kotla Mubarakpur Delhi ❤️8448577510 ⊹Best Escorts Service In 24...
Call Girls In Kotla Mubarakpur Delhi ❤️8448577510 ⊹Best Escorts Service In 24...
lizamodels9
 
Call Girls in Patel Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Patel Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Patel Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Patel Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Top Rated Call Girls Vashi : 9920725232 We offer Beautiful and sexy Call Girl...
Top Rated Call Girls Vashi : 9920725232 We offer Beautiful and sexy Call Girl...Top Rated Call Girls Vashi : 9920725232 We offer Beautiful and sexy Call Girl...
Top Rated Call Girls Vashi : 9920725232 We offer Beautiful and sexy Call Girl...
amitlee9823
 
Rekha Agarkar Escorts Service Kollam ❣️ 7014168258 ❣️ High Cost Unlimited Har...
Rekha Agarkar Escorts Service Kollam ❣️ 7014168258 ❣️ High Cost Unlimited Har...Rekha Agarkar Escorts Service Kollam ❣️ 7014168258 ❣️ High Cost Unlimited Har...
Rekha Agarkar Escorts Service Kollam ❣️ 7014168258 ❣️ High Cost Unlimited Har...
nirzagarg
 
Tata_Nexon_brochure tata nexon brochure tata
Tata_Nexon_brochure tata nexon brochure tataTata_Nexon_brochure tata nexon brochure tata
Tata_Nexon_brochure tata nexon brochure tata
aritradey27234
 
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdfSales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Aggregage
 

Recently uploaded (20)

Why Does My Porsche Cayenne's Exhaust Sound So Loud
Why Does My Porsche Cayenne's Exhaust Sound So LoudWhy Does My Porsche Cayenne's Exhaust Sound So Loud
Why Does My Porsche Cayenne's Exhaust Sound So Loud
 
(INDIRA) Call Girl Nashik Call Now 8617697112 Nashik Escorts 24x7
(INDIRA) Call Girl Nashik Call Now 8617697112 Nashik Escorts 24x7(INDIRA) Call Girl Nashik Call Now 8617697112 Nashik Escorts 24x7
(INDIRA) Call Girl Nashik Call Now 8617697112 Nashik Escorts 24x7
 
Lecture-20 Kleene’s Theorem-1.pptx best for understanding the automata
Lecture-20 Kleene’s Theorem-1.pptx best for understanding the automataLecture-20 Kleene’s Theorem-1.pptx best for understanding the automata
Lecture-20 Kleene’s Theorem-1.pptx best for understanding the automata
 
Greenery-Palette Pitch Deck by Slidesgo.pptx
Greenery-Palette Pitch Deck by Slidesgo.pptxGreenery-Palette Pitch Deck by Slidesgo.pptx
Greenery-Palette Pitch Deck by Slidesgo.pptx
 
Call Girls in Malviya Nagar Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts Ser...
Call Girls in Malviya Nagar Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts Ser...Call Girls in Malviya Nagar Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts Ser...
Call Girls in Malviya Nagar Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts Ser...
 
John Deere 335 375 385 435 Service Repair Manual
John Deere 335 375 385 435 Service Repair ManualJohn Deere 335 375 385 435 Service Repair Manual
John Deere 335 375 385 435 Service Repair Manual
 
Call Girls Bangalore Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Bangalore Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Bangalore Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Bangalore Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men 🔝narsinghpur🔝 ...
➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men  🔝narsinghpur🔝  ...➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men  🔝narsinghpur🔝  ...
➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men 🔝narsinghpur🔝 ...
 
一比一原版(UdeM学位证书)蒙特利尔大学毕业证学历认证怎样办
一比一原版(UdeM学位证书)蒙特利尔大学毕业证学历认证怎样办一比一原版(UdeM学位证书)蒙特利尔大学毕业证学历认证怎样办
一比一原版(UdeM学位证书)蒙特利尔大学毕业证学历认证怎样办
 
Majestic Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...
Majestic Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...Majestic Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...
Majestic Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...
 
What Causes BMW Chassis Stabilization Malfunction Warning To Appear
What Causes BMW Chassis Stabilization Malfunction Warning To AppearWhat Causes BMW Chassis Stabilization Malfunction Warning To Appear
What Causes BMW Chassis Stabilization Malfunction Warning To Appear
 
Call Girls In Kotla Mubarakpur Delhi ❤️8448577510 ⊹Best Escorts Service In 24...
Call Girls In Kotla Mubarakpur Delhi ❤️8448577510 ⊹Best Escorts Service In 24...Call Girls In Kotla Mubarakpur Delhi ❤️8448577510 ⊹Best Escorts Service In 24...
Call Girls In Kotla Mubarakpur Delhi ❤️8448577510 ⊹Best Escorts Service In 24...
 
Call Girls in Patel Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Patel Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Patel Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Patel Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
John deere 425 445 455 Maitenance Manual
John deere 425 445 455 Maitenance ManualJohn deere 425 445 455 Maitenance Manual
John deere 425 445 455 Maitenance Manual
 
Is Your BMW PDC Malfunctioning Discover How to Easily Reset It
Is Your BMW PDC Malfunctioning Discover How to Easily Reset ItIs Your BMW PDC Malfunctioning Discover How to Easily Reset It
Is Your BMW PDC Malfunctioning Discover How to Easily Reset It
 
John Deere Tractors 6130M 6140M Diagnostic Manual
John Deere Tractors  6130M 6140M Diagnostic ManualJohn Deere Tractors  6130M 6140M Diagnostic Manual
John Deere Tractors 6130M 6140M Diagnostic Manual
 
Top Rated Call Girls Vashi : 9920725232 We offer Beautiful and sexy Call Girl...
Top Rated Call Girls Vashi : 9920725232 We offer Beautiful and sexy Call Girl...Top Rated Call Girls Vashi : 9920725232 We offer Beautiful and sexy Call Girl...
Top Rated Call Girls Vashi : 9920725232 We offer Beautiful and sexy Call Girl...
 
Rekha Agarkar Escorts Service Kollam ❣️ 7014168258 ❣️ High Cost Unlimited Har...
Rekha Agarkar Escorts Service Kollam ❣️ 7014168258 ❣️ High Cost Unlimited Har...Rekha Agarkar Escorts Service Kollam ❣️ 7014168258 ❣️ High Cost Unlimited Har...
Rekha Agarkar Escorts Service Kollam ❣️ 7014168258 ❣️ High Cost Unlimited Har...
 
Tata_Nexon_brochure tata nexon brochure tata
Tata_Nexon_brochure tata nexon brochure tataTata_Nexon_brochure tata nexon brochure tata
Tata_Nexon_brochure tata nexon brochure tata
 
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdfSales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
 

HW1 assignment Shivam.pdf

  • 1. SHIVAM PAWAR 5492083 1 Data and Data Preprocessing Problem 1: Types of attributes Q 1) Classify the following attributes as nominal, ordinal, interval, ratio: (a) Rating of an Amazon product by a person on a scale of 1 to 5 – Ordinal Ordinal measurement holds importance of the position. So here rating a product will keep that product in a place as per the rating provided by an individual. It holds importance to the value. Hence, we should use Ordinal scale of measurement for this case. (b) The Internet Speed – Interval The reason for using Interval here is because it does not have a true zero point. Adding the internet speed of 2 devices does not mean the speed has been increased. (c) Number of customers in a store – Ratio For this we have to use Ratio measurement as we are considering the number of people inside a store where the count can be increased, decreased which further will create a difference with the change. (d) UCF Student ID – Nominal Here the Nominal will count the students and does not hold any importance to the value in terms of academic position. (e) Distance – Ratio For measuring distance, we have to use Ratio scale as the distance can be increased or decreased. Here for distance, there will be true zero point as adding distance will create difference. (f) Letter grade (A, B, C, D) – Ordinal Here as grading something holds importance of the value, we need to take Ordinal measurement. A grade is considered higher than B in academic standards. (g) The temperature at Orlando – Interval For measuring temperature, we need to use Interval scale of measurement as 0 degrees does not mean an absence of the property and doubling a degree will not make any difference.
  • 2. SHIVAM PAWAR 5492083 2 Problem 2: Exploring Data Pre-processing Techniques : Q1) (Reproduce): Please read, understand, run the code and reproduce the model accuracies. Please briefly explain whether you can reproduce the classification accuracies of 'Support Vector Machines', 'KNN', 'Logistic Regression', 'Random Forest', 'Naive Bayes', 'Perceptron', 'Stochastic Gradient Decent', 'Linear SVC', 'Decision Tree'. In the given Kaggle Titanic Dataset the workflows they have followed are Classifying, Correlating, Converting, Completing, Correcting, Creating and Charting in order to process the data using algorithms. The main aim of this Dataset is to find the survival rate. Initially they have taken ‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’ and ‘Embarked’ as features to categorize the data for better idea. Later after running few scenarios, they have removed few features like Fare, Ticket and Cabin as removing them will not create any difference for finding the Survival rate. I tried to reproduce the code with the same machine learning models and could be able to see the same accuracies for all the algorithms for every run except for the Stochastic Gradient Decent. Because the Stochastic Gradient Decent is an iterative algorithm which takes the data sets randomly for each iteration. So as the datasets this algorithm will get varied differently each time making the algorithm to display different score for each run. Sample accuracies for the algorithms are as below: Sample 1: Random Forest - 86.76 Decision Tree - 86.76 KNN - 74.47 Support Vector Machines - 83.84 Logistic Regression - 80.36 Linear SVC - 79.12 Perceptron - 78.00 Naive Bayes - 72.28 Stochastic Gradient Decent - 51.63
  • 3. SHIVAM PAWAR 5492083 3 Sample 2: Random Forest - 86.76 Decision Tree - 86.76 KNN - 74.47 Support Vector Machines - 83.84 Logistic Regression - 80.36 Linear SVC - 79.12 Stochastic Gradient Decent - 78.68 Perceptron - 78.00 Naive Bayes - 72.28 Q2) (Improve): Is the data pre-processing process proposed in the Kaggle post the best pre- processing solution? If yes, please explain why. If not, can you leverage what you learned in the class and your previous experiences to improve data processing, to obtain better accuracies for all these classification models? Describe what is your improved data pre-processing, and what are your improved accuracies? As stated above in the first question the algorithms and data processing techniques used are very well written in the given Kaggle Titanic Dataset. As the workflow follows six steps. After understanding or defining the problem we need to acquire the training and testing data. Then we have to prepare and cleanse the data. Now we have to analyse the data and explore the data. Now we need to predict the possible situations/scenarios to solve the problem which will further supplies the result. Here in this Dataset the work flow techniques have been started with some features as mentioned in the above question (‘PassergerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’ and ‘Embarked’). Later after improvising the data few features like Ticket, Fare, Cabin, Embarked, Parch have been dropped to increase the accuracy of the algorithms. Also added other features like AgeBand, IsAlone which improvised the code. The technique I have used is to change the values in AgeBand which increased the accuracy from 86.68 to 90.46 for the algorithms. Previously they have given higher difference to the age values given AgeBand later I have decreased them and added new values which will enable the code to run faster with high accuracy. Below is the sample of the accuracy of the algorithms post making the changes in the code.
  • 4. SHIVAM PAWAR 5492083 4 Sample 1: Random Forest - 90.46 Decision Tree - 90.46 KNN - 87.09 Support Vector Machines - 85.63 Perceptron - 80.02 Linear SVC - 78.79 Logistic Regression - 78.45 Naive Bayes - 77.89 Stochastic Gradient Decent - 74.97 Sample 2: Random Forest - 89.56 Decision Tree - 89.56 KNN - 87.21 Support Vector Machines - 85.07 Linear SVC - 78.45 Logistic Regression - 78.23 Perceptron - 78.23 Stochastic Gradient Decent - 77.89 Naive Bayes - 77.67 In the Sample-2 I have made changes again in AgeBand which further displayed the above accuracy. Below is the link for Sample-2: https://www.kaggle.com/code/nikhithakonda/titanic-data-science-solutions/edit https://www.kaggle.com/code/nikhithakonda/titanic-data-science-solutions
  • 5. SHIVAM PAWAR 5492083 5 Problem 3: Distance/Similarity Measures Given the four boxes shown in the following figure, answer the following questions. In the diagram, numbers indicate the lengths and widths and you can consider each box to be a vector of two real numbers, length and width. For example, the top left box would be (2,1), while the bottom right box would be (3,3). Restrict your choices of similarity/distance measure to Euclidean distance and correlation Which proximity measure would you use to group the boxes based on their shapes (length-width ratio)? For measuring the boxes based on the shapes (length-width ratio) we need to use Corelation. Below is the formula to measure the Corelation. Where n = 2 as we are comparing two sets For the values of x and y we need to take the values simultaneously as we compare 2 conditions. Sigma x and Sigma y would be 3 and 2 Corelation for box 1 and box 2 comes as 0 after calculating the above values in the corelation formula which is the smallest distance. Corelation for box 1 and box 3 comes as 1 Corelation for box 1 and box 4 approximately equals to 1(0.9) Similarly for box 2 and box 4 the corelation is equal to 1 And the boxes 2 and 3 will be the same as 0 Which proximity measure would you use to group the boxes based on their size? Based on the size of the boxes we need to use Euclidean formula. If we calculate the values of the boxes like below Box 1(2,1) ; Box 2(1,1); Box 3(6,3) ; Box 4(3,3) If we substitute the above values of all boxes in to the formula then we will get the answer. We will get the smallest distance for box 1 and box 2 as 1. Also for the box 2 and box 4 we will get the smallest as approximately equal to 3.