wk5ppt2_Iris

A
Which flower species is it?
Building Models with Data
What warrants this as a good learning
dataset?
• Clean dataset available: 4 numeric attributes with no missing values
• Target is 3 different species of flowers. Multi-class classification.
• Well known dataset
What software do I need?
• IDE to run Python
• Online: https://repl.it
• Code Editor: VS Code https://code.visualstudio.com/download
• Data Science Platform: Anaconda https://www.anaconda.com/distribution/
Data Source
• Titanic dataset on Kaggle
• https://www.kaggle.com/c/titanic
def load_data(url):
'''
Loads data into Python environment.
Parameters: url with .csv
Returns: dataframe
'''
variables = ['sepal_len', 'sepal_w', 'petal_len',
'petal_w', 'class']
df = pd.read_csv(url, names=variables)
return df
Exploratory Data Analysis
1. Summary statistics
2. Data visualization
3. Data processing
Summary Statistics
• # of rows, # of features
• Frequency distribution
• Number of missing values Variable # of missing values
Sepal Width 0
Sepal Length 0
Petal Width 0
Petal Length 0
def summary_statistics(df):
'''
Generates summary statistics like the # of variables & columns,
pivot table, and 5 # summary.
Parameters: dataframe
Returns: none
'''
# shape
print('Shape of dataframe: %d instances and %d features' %
(df.shape[0], df.shape[1]))
# description
print(df.describe())
# class frequency
print(df.groupby('class').size())
# missing values
print(df.isnull().sum())
return
Flower Species # of instances
setosa 50
versicolor 50
virginica 50
The Base Rate is 0.33.
Our model has to beat
that.
Data Visualization
• Box Plot
• Histogram
• Scatter plot
• Correlation table
def visualize(df):
'''
Visualizes data using a box plot, histogram, scatter
matrix, and correlation matrix.
Parameters: dataframe
Returns: none
'''
# box plot
df.plot(kind='box', subplots=True, layout=(2,2),
showfliers=True, sharex=False, sharey=False)
plt.show()
# histogram - distribution
df.hist()
plt.show()
# scatter matrix
scatter_matrix(df)
plt.show()
print()
### Correlation Matrix
corr = df.corr()
corr.style.background_gradient()
return
Data Processing
• Dataset is really neat, so minimal processing needed.
• All features will be selected
• Split into training and test sets
Split Data Set
• We have a small data set, so later on we will use 10-fold validation to
create a more accurate representation of model performance.
def split_train_test(df):
'''
Splits available data into 80% training set, 20% test
set.
Parameters: dataframe
Returns: training set - features and output, test set -
features and output
'''
# 80% training set, 20% test set
array = df.values
X = array[:,0:4]
Y = array[:,4]
n_test = 0.2
seed = 7
X_train, X_test, Y_train, Y_test =
model_selection.train_test_split(X, Y, test_size=n_test,
random_state=seed)
return X_train, X_test, Y_train, Y_test
def k_fold_validation(models, X_train, Y_train):
'''
Performs 10-fold validation and prints the mean and standard
deviation of accuracies.
Parameters: array of models, training set - features and output
Returns:
'''
results = []
means = []
stds = []
names = []
scoring = 'accuracy'
seed = 7
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train,
Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
means.append(cv_results.mean())
stds.append(cv_results.std())
names.append(name)
msg = '%s: %f (%f)' % (name, cv_results.mean(),
cv_results.std())
print(msg)
return names, models, means, stds
Model Building = Equation
• Multi-Class Classification with only numeric variables
• Logistic Regression
• Linear Discriminant Analysis
• K Nearest Neighbor
• Decision Tree
• Random Forest
• Naïve Bayes
• Support Vector Machine
Model Building code
• Logistic Regression
• Linear Discriminant Analysis
• K Nearest Neighbor
• Decision Tree
• Random Forest
• Naïve Bayes
• Support Vector Machine
def build_model(X_train, Y_train):
'''
Runs training data through Logistic Regression, Linear
Discriminant Analysis, KNN, Decision Tree, Random
Forest, Naive Bayes, and Support Vector Machine.
Parameters: training set - features and output
Returns: array of names, models, means, and stds
'''
models = []
models.append(('LR',
LogisticRegression(solver='liblinear',
multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RF',
RandomForestClassifier(n_estimators = 100,
max_depth=5)))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
return models
Estimation
• Gini Impurity
• The dimensions of the petal is more predictive than those of the sepal.
Feature Gini Index
Petal Width 0.46
Petal Length 0.42
Sepal Length 0.09
Sepal Width 0.03
def gini_impurity(models, X_train, Y_train, X_test,
df):
'''
Examines feature importance using Gini impurity.
Parameters: models, training set, test set, dataframe
Returns: none
'''
random_forest = models[4][1]
keys = df.keys()
keys = keys[[0,1,2,3]]
models[4][1].fit(X_train, Y_train)
pred = models[4][1].predict(X_test)
print(sorted(zip(map(lambda x: round(x, 4),
random_forest.feature_importances_), keys),
reverse=True))
return
Model Evaluation
• Run on training set
• Performance metric: Accuracy
• Null Error Rate is the baseline accuracy if we predicted flower as being setosa. Only
algorithms that beat this base rate will be considered.
• Null Error Rate = 0.33
• Visualize in: Error Bars
Model Evaluation
• Error Bars show us the accuracy of each model.
def evaluate_error_bar(names, models, means, stds):
'''
Compare accuracy values with Error Bar graph.
Parameters: array of names, models, means, stds
Returns: none
'''
# error bar
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(1, 1, 1)
ax.set_xticklabels(names)
plt.errorbar(names, means, stds, linestyle='None',
marker='^')
plt.ylim(0.92,1)
plt.show()
return
Explanation
• Run on test set
• Performance metric:
Accuracy, Recall, Precision, F1
score
• Visualize in: Confusion Matrix
and Classification Report
def test_set(X_train, Y_train, X_test, Y_test, models):
'''
Runs test data through all models. Prints confusion
matrices and classification reports.
Parameters: training set and test set, array of models
Returns: none
'''
for name, model in models:
if name == 'RF':
model.fit(X_train, Y_train)
pred = model.predict(X_test)
print('nnn%s Accuracy: %.2f' % (name,
accuracy_score(Y_test, pred)))
labels = np.unique(Y_test)
confusion = confusion_matrix(Y_test, pred,
labels=labels)
print('nConfusion Matrix:')
print(pd.DataFrame(confusion, index=labels,
columns=labels))
print('nClassification Report:')
print(classification_report(Y_test, pred))
return
Explanation
Confusion Matrix Classification Report
Accuracy = 0.87
• Better than Base Rate = 0.33
• Precision
• Precision for setosa is perfect (1.00). This means that if the model
predicted that the flower species is setosa, then it is always right.
• Recall
• Recall rate for setosa is high (0.93). This means that we correctly
identified all setosa flowers.
• F1 Score
• Weighted mean of precision and recall. Here we see that we do a
better job at identifying setosa (F1 = 1.00) than the other two flower
species (F1 = 0.83 and 0.82)
Predicted Class
setosa versicolor virginica
Actual
Class
setosa 7 0 0
versicolor 0 10 2
virginica 0 2 9
Precision Recall F1_Score
Actual: Setosa 1.00 1.00 1.00
Actual: Versicolor 0.83 0.83 0.83
Actual: Virginica 0.82 0.82 0.82
1 von 16

Recomendados

wk5ppt1_Titanic von
wk5ppt1_Titanicwk5ppt1_Titanic
wk5ppt1_TitanicAliciaWei1
32 views20 Folien
Arrays von
ArraysArrays
ArraysVenkataRangaRaoKommi1
21 views31 Folien
Extractors & Implicit conversions von
Extractors & Implicit conversionsExtractors & Implicit conversions
Extractors & Implicit conversionsKnoldus Inc.
2.1K views50 Folien
Parallel and Iterative Processing for Machine Learning Recommendations with S... von
Parallel and Iterative Processing for Machine Learning Recommendations with S...Parallel and Iterative Processing for Machine Learning Recommendations with S...
Parallel and Iterative Processing for Machine Learning Recommendations with S...MapR Technologies
2.1K views23 Folien
2 Arrays & Strings.pptx von
2 Arrays & Strings.pptx2 Arrays & Strings.pptx
2 Arrays & Strings.pptxaarockiaabinsAPIICSE
38 views108 Folien
COM1407: Arrays von
COM1407: ArraysCOM1407: Arrays
COM1407: ArraysHemantha Kulathilake
169 views39 Folien

Más contenido relacionado

Was ist angesagt?

CIS 115 Achievement Education--cis115.com von
CIS 115 Achievement Education--cis115.comCIS 115 Achievement Education--cis115.com
CIS 115 Achievement Education--cis115.comagathachristie170
29 views51 Folien
Test design techniques von
Test design techniquesTest design techniques
Test design techniquesManindra Simhadri
2.1K views32 Folien
Friendly Functional Programming von
Friendly Functional ProgrammingFriendly Functional Programming
Friendly Functional ProgrammingWiem Zine Elabidine
953 views94 Folien
CIS 115 Education for Service--cis115.com von
CIS 115 Education for Service--cis115.com  CIS 115 Education for Service--cis115.com
CIS 115 Education for Service--cis115.com williamwordsworth10
15 views51 Folien
CIS 115 Redefined Education--cis115.com von
CIS 115 Redefined Education--cis115.comCIS 115 Redefined Education--cis115.com
CIS 115 Redefined Education--cis115.comagathachristie208
16 views51 Folien
Templates von
TemplatesTemplates
TemplatesNilesh Dalvi
2.5K views13 Folien

Was ist angesagt?(17)

Cis 115 Extraordinary Success/newtonhelp.com von amaranthbeg143
Cis 115 Extraordinary Success/newtonhelp.com  Cis 115 Extraordinary Success/newtonhelp.com
Cis 115 Extraordinary Success/newtonhelp.com
amaranthbeg14310 views
Ml5 svm and-kernels von ankit_ppt
Ml5 svm and-kernelsMl5 svm and-kernels
Ml5 svm and-kernels
ankit_ppt332 views
Array 31.8.2020 updated von vrgokila
Array 31.8.2020 updatedArray 31.8.2020 updated
Array 31.8.2020 updated
vrgokila86 views
CIS 115 Education Counseling--cis115.com von claric59
CIS 115 Education Counseling--cis115.comCIS 115 Education Counseling--cis115.com
CIS 115 Education Counseling--cis115.com
claric5913 views
Custom Star Creation for Ellucain's Enterprise Data Warehouse von Bryan L. Mack
Custom Star Creation for Ellucain's Enterprise Data WarehouseCustom Star Creation for Ellucain's Enterprise Data Warehouse
Custom Star Creation for Ellucain's Enterprise Data Warehouse
Bryan L. Mack277 views
Property-Based Testing von Shai Geva
Property-Based TestingProperty-Based Testing
Property-Based Testing
Shai Geva314 views
Csphtp1 07 von HUST
Csphtp1 07Csphtp1 07
Csphtp1 07
HUST615 views
mc_simulation documentation von Carlo Parodi
mc_simulation documentationmc_simulation documentation
mc_simulation documentation
Carlo Parodi595 views
An Introduction to Property Based Testing von C4Media
An Introduction to Property Based TestingAn Introduction to Property Based Testing
An Introduction to Property Based Testing
C4Media497 views

Similar a wk5ppt2_Iris

somebody plz help with this Make use of the scikit-learn (sklearn) pyt.pdf von
somebody plz help with this Make use of the scikit-learn (sklearn) pyt.pdfsomebody plz help with this Make use of the scikit-learn (sklearn) pyt.pdf
somebody plz help with this Make use of the scikit-learn (sklearn) pyt.pdfjohn344
2 views6 Folien
I am working on this code for my project- but the accuracy is 0-951601.docx von
I am working on this code for my project- but the accuracy is 0-951601.docxI am working on this code for my project- but the accuracy is 0-951601.docx
I am working on this code for my project- but the accuracy is 0-951601.docxRyanEAcTuckern
8 views4 Folien
# Produce the features of a testing data instance X_new = np. arr.pdf von
# Produce the features of a testing data instance X_new = np. arr.pdf# Produce the features of a testing data instance X_new = np. arr.pdf
# Produce the features of a testing data instance X_new = np. arr.pdfinfo893569
2 views4 Folien
Lab 2: Classification and Regression Prediction Models, training and testing ... von
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
421 views43 Folien
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre... von
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
160 views22 Folien
Write a Matlab script for the function below, that runs the script 5.pdf von
Write a Matlab script for the function below, that runs the script 5.pdfWrite a Matlab script for the function below, that runs the script 5.pdf
Write a Matlab script for the function below, that runs the script 5.pdfkrishahuja1992
4 views1 Folie

Similar a wk5ppt2_Iris(20)

somebody plz help with this Make use of the scikit-learn (sklearn) pyt.pdf von john344
somebody plz help with this Make use of the scikit-learn (sklearn) pyt.pdfsomebody plz help with this Make use of the scikit-learn (sklearn) pyt.pdf
somebody plz help with this Make use of the scikit-learn (sklearn) pyt.pdf
john3442 views
I am working on this code for my project- but the accuracy is 0-951601.docx von RyanEAcTuckern
I am working on this code for my project- but the accuracy is 0-951601.docxI am working on this code for my project- but the accuracy is 0-951601.docx
I am working on this code for my project- but the accuracy is 0-951601.docx
RyanEAcTuckern8 views
# Produce the features of a testing data instance X_new = np. arr.pdf von info893569
# Produce the features of a testing data instance X_new = np. arr.pdf# Produce the features of a testing data instance X_new = np. arr.pdf
# Produce the features of a testing data instance X_new = np. arr.pdf
info8935692 views
Lab 2: Classification and Regression Prediction Models, training and testing ... von Yao Yao
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
Yao Yao421 views
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre... von Yao Yao
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Yao Yao160 views
Write a Matlab script for the function below, that runs the script 5.pdf von krishahuja1992
Write a Matlab script for the function below, that runs the script 5.pdfWrite a Matlab script for the function below, that runs the script 5.pdf
Write a Matlab script for the function below, that runs the script 5.pdf
krishahuja19924 views
Now that you have written functions for different steps of t.pdf von aarthitimesgd
Now that you have written functions for different steps of t.pdfNow that you have written functions for different steps of t.pdf
Now that you have written functions for different steps of t.pdf
aarthitimesgd9 views
Practical Predictive Modeling in Python von Robert Dempsey
Practical Predictive Modeling in PythonPractical Predictive Modeling in Python
Practical Predictive Modeling in Python
Robert Dempsey3K views
- K-Nearest Neighbours Classifier Now we can start building the actua.pdf von info893569
- K-Nearest Neighbours Classifier Now we can start building the actua.pdf- K-Nearest Neighbours Classifier Now we can start building the actua.pdf
- K-Nearest Neighbours Classifier Now we can start building the actua.pdf
info8935692 views
Ml2 train test-splits_validation_linear_regression von ankit_ppt
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regression
ankit_ppt300 views
Machine learning and_nlp von ankit_ppt
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
ankit_ppt264 views
Workshop - Introduction to Machine Learning with R von Shirin Elsinghorst
Workshop - Introduction to Machine Learning with RWorkshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with R
Shirin Elsinghorst39.5K views
MT_01_unittest_python.pdf von Hans Jones
MT_01_unittest_python.pdfMT_01_unittest_python.pdf
MT_01_unittest_python.pdf
Hans Jones16 views
Intro to Machine Learning for non-Data Scientists von Parinaz Ameri
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
Parinaz Ameri224 views
For this practice work- you are to determine which model is best for p.pdf von Max3zSLangdonj
For this practice work- you are to determine which model is best for p.pdfFor this practice work- you are to determine which model is best for p.pdf
For this practice work- you are to determine which model is best for p.pdf
Max3zSLangdonj2 views
Quick Machine learning projects steps in 5 mins von Naveen Davis
Quick Machine learning projects steps in 5 minsQuick Machine learning projects steps in 5 mins
Quick Machine learning projects steps in 5 mins
Naveen Davis106 views
maxbox starter60 machine learning von Max Kleiner
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
Max Kleiner96 views
Workshop: Your first machine learning project von Alex Austin
Workshop: Your first machine learning projectWorkshop: Your first machine learning project
Workshop: Your first machine learning project
Alex Austin76 views

Último

LIVE OAK MEMORIAL PARK.pptx von
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptxms2332always
7 views6 Folien
CRM stick or twist.pptx von
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptxinfo828217
11 views16 Folien
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ... von
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...DataScienceConferenc1
7 views18 Folien
Survey on Factuality in LLM's.pptx von
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptxNeethaSherra1
7 views9 Folien
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... von
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...DataScienceConferenc1
6 views15 Folien
Infomatica-MDM.pptx von
Infomatica-MDM.pptxInfomatica-MDM.pptx
Infomatica-MDM.pptxKapil Rangwani
11 views16 Folien

Último(20)

CRM stick or twist.pptx von info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 views
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ... von DataScienceConferenc1
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
Survey on Factuality in LLM's.pptx von NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra17 views
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... von DataScienceConferenc1
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
CRM stick or twist workshop von info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821710 views
Short Story Assignment by Kelly Nguyen von kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... von DataScienceConferenc1
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx von DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
SUPER STORE SQL PROJECT.pptx von khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862013 views
Data Journeys Hard Talk workshop final.pptx von info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821710 views
Ukraine Infographic_22NOV2023_v2.pdf von AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
Cross-network in Google Analytics 4.pdf von GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation von DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
Chapter 3b- Process Communication (1) (1)(1) (1).pptx von ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20047 views
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... von DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...

wk5ppt2_Iris

  • 1. Which flower species is it? Building Models with Data
  • 2. What warrants this as a good learning dataset? • Clean dataset available: 4 numeric attributes with no missing values • Target is 3 different species of flowers. Multi-class classification. • Well known dataset
  • 3. What software do I need? • IDE to run Python • Online: https://repl.it • Code Editor: VS Code https://code.visualstudio.com/download • Data Science Platform: Anaconda https://www.anaconda.com/distribution/
  • 4. Data Source • Titanic dataset on Kaggle • https://www.kaggle.com/c/titanic def load_data(url): ''' Loads data into Python environment. Parameters: url with .csv Returns: dataframe ''' variables = ['sepal_len', 'sepal_w', 'petal_len', 'petal_w', 'class'] df = pd.read_csv(url, names=variables) return df
  • 5. Exploratory Data Analysis 1. Summary statistics 2. Data visualization 3. Data processing
  • 6. Summary Statistics • # of rows, # of features • Frequency distribution • Number of missing values Variable # of missing values Sepal Width 0 Sepal Length 0 Petal Width 0 Petal Length 0 def summary_statistics(df): ''' Generates summary statistics like the # of variables & columns, pivot table, and 5 # summary. Parameters: dataframe Returns: none ''' # shape print('Shape of dataframe: %d instances and %d features' % (df.shape[0], df.shape[1])) # description print(df.describe()) # class frequency print(df.groupby('class').size()) # missing values print(df.isnull().sum()) return Flower Species # of instances setosa 50 versicolor 50 virginica 50 The Base Rate is 0.33. Our model has to beat that.
  • 7. Data Visualization • Box Plot • Histogram • Scatter plot • Correlation table def visualize(df): ''' Visualizes data using a box plot, histogram, scatter matrix, and correlation matrix. Parameters: dataframe Returns: none ''' # box plot df.plot(kind='box', subplots=True, layout=(2,2), showfliers=True, sharex=False, sharey=False) plt.show() # histogram - distribution df.hist() plt.show() # scatter matrix scatter_matrix(df) plt.show() print() ### Correlation Matrix corr = df.corr() corr.style.background_gradient() return
  • 8. Data Processing • Dataset is really neat, so minimal processing needed. • All features will be selected • Split into training and test sets
  • 9. Split Data Set • We have a small data set, so later on we will use 10-fold validation to create a more accurate representation of model performance. def split_train_test(df): ''' Splits available data into 80% training set, 20% test set. Parameters: dataframe Returns: training set - features and output, test set - features and output ''' # 80% training set, 20% test set array = df.values X = array[:,0:4] Y = array[:,4] n_test = 0.2 seed = 7 X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=n_test, random_state=seed) return X_train, X_test, Y_train, Y_test def k_fold_validation(models, X_train, Y_train): ''' Performs 10-fold validation and prints the mean and standard deviation of accuracies. Parameters: array of models, training set - features and output Returns: ''' results = [] means = [] stds = [] names = [] scoring = 'accuracy' seed = 7 for name, model in models: kfold = model_selection.KFold(n_splits=10, random_state=seed) cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) results.append(cv_results) means.append(cv_results.mean()) stds.append(cv_results.std()) names.append(name) msg = '%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()) print(msg) return names, models, means, stds
  • 10. Model Building = Equation • Multi-Class Classification with only numeric variables • Logistic Regression • Linear Discriminant Analysis • K Nearest Neighbor • Decision Tree • Random Forest • Naïve Bayes • Support Vector Machine
  • 11. Model Building code • Logistic Regression • Linear Discriminant Analysis • K Nearest Neighbor • Decision Tree • Random Forest • Naïve Bayes • Support Vector Machine def build_model(X_train, Y_train): ''' Runs training data through Logistic Regression, Linear Discriminant Analysis, KNN, Decision Tree, Random Forest, Naive Bayes, and Support Vector Machine. Parameters: training set - features and output Returns: array of names, models, means, and stds ''' models = [] models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('RF', RandomForestClassifier(n_estimators = 100, max_depth=5))) models.append(('NB', GaussianNB())) models.append(('SVM', SVC(gamma='auto'))) return models
  • 12. Estimation • Gini Impurity • The dimensions of the petal is more predictive than those of the sepal. Feature Gini Index Petal Width 0.46 Petal Length 0.42 Sepal Length 0.09 Sepal Width 0.03 def gini_impurity(models, X_train, Y_train, X_test, df): ''' Examines feature importance using Gini impurity. Parameters: models, training set, test set, dataframe Returns: none ''' random_forest = models[4][1] keys = df.keys() keys = keys[[0,1,2,3]] models[4][1].fit(X_train, Y_train) pred = models[4][1].predict(X_test) print(sorted(zip(map(lambda x: round(x, 4), random_forest.feature_importances_), keys), reverse=True)) return
  • 13. Model Evaluation • Run on training set • Performance metric: Accuracy • Null Error Rate is the baseline accuracy if we predicted flower as being setosa. Only algorithms that beat this base rate will be considered. • Null Error Rate = 0.33 • Visualize in: Error Bars
  • 14. Model Evaluation • Error Bars show us the accuracy of each model. def evaluate_error_bar(names, models, means, stds): ''' Compare accuracy values with Error Bar graph. Parameters: array of names, models, means, stds Returns: none ''' # error bar fig = plt.figure() fig.suptitle('Algorithm Comparison') ax = fig.add_subplot(1, 1, 1) ax.set_xticklabels(names) plt.errorbar(names, means, stds, linestyle='None', marker='^') plt.ylim(0.92,1) plt.show() return
  • 15. Explanation • Run on test set • Performance metric: Accuracy, Recall, Precision, F1 score • Visualize in: Confusion Matrix and Classification Report def test_set(X_train, Y_train, X_test, Y_test, models): ''' Runs test data through all models. Prints confusion matrices and classification reports. Parameters: training set and test set, array of models Returns: none ''' for name, model in models: if name == 'RF': model.fit(X_train, Y_train) pred = model.predict(X_test) print('nnn%s Accuracy: %.2f' % (name, accuracy_score(Y_test, pred))) labels = np.unique(Y_test) confusion = confusion_matrix(Y_test, pred, labels=labels) print('nConfusion Matrix:') print(pd.DataFrame(confusion, index=labels, columns=labels)) print('nClassification Report:') print(classification_report(Y_test, pred)) return
  • 16. Explanation Confusion Matrix Classification Report Accuracy = 0.87 • Better than Base Rate = 0.33 • Precision • Precision for setosa is perfect (1.00). This means that if the model predicted that the flower species is setosa, then it is always right. • Recall • Recall rate for setosa is high (0.93). This means that we correctly identified all setosa flowers. • F1 Score • Weighted mean of precision and recall. Here we see that we do a better job at identifying setosa (F1 = 1.00) than the other two flower species (F1 = 0.83 and 0.82) Predicted Class setosa versicolor virginica Actual Class setosa 7 0 0 versicolor 0 10 2 virginica 0 2 9 Precision Recall F1_Score Actual: Setosa 1.00 1.00 1.00 Actual: Versicolor 0.83 0.83 0.83 Actual: Virginica 0.82 0.82 0.82