2. What warrants this as a good learning
dataset?
• Clean dataset available: 4 numeric attributes with no missing values
• Target is 3 different species of flowers. Multi-class classification.
• Well known dataset
3. What software do I need?
• IDE to run Python
• Online: https://repl.it
• Code Editor: VS Code https://code.visualstudio.com/download
• Data Science Platform: Anaconda https://www.anaconda.com/distribution/
4. Data Source
• Titanic dataset on Kaggle
• https://www.kaggle.com/c/titanic
def load_data(url):
'''
Loads data into Python environment.
Parameters: url with .csv
Returns: dataframe
'''
variables = ['sepal_len', 'sepal_w', 'petal_len',
'petal_w', 'class']
df = pd.read_csv(url, names=variables)
return df
6. Summary Statistics
• # of rows, # of features
• Frequency distribution
• Number of missing values Variable # of missing values
Sepal Width 0
Sepal Length 0
Petal Width 0
Petal Length 0
def summary_statistics(df):
'''
Generates summary statistics like the # of variables & columns,
pivot table, and 5 # summary.
Parameters: dataframe
Returns: none
'''
# shape
print('Shape of dataframe: %d instances and %d features' %
(df.shape[0], df.shape[1]))
# description
print(df.describe())
# class frequency
print(df.groupby('class').size())
# missing values
print(df.isnull().sum())
return
Flower Species # of instances
setosa 50
versicolor 50
virginica 50
The Base Rate is 0.33.
Our model has to beat
that.
8. Data Processing
• Dataset is really neat, so minimal processing needed.
• All features will be selected
• Split into training and test sets
9. Split Data Set
• We have a small data set, so later on we will use 10-fold validation to
create a more accurate representation of model performance.
def split_train_test(df):
'''
Splits available data into 80% training set, 20% test
set.
Parameters: dataframe
Returns: training set - features and output, test set -
features and output
'''
# 80% training set, 20% test set
array = df.values
X = array[:,0:4]
Y = array[:,4]
n_test = 0.2
seed = 7
X_train, X_test, Y_train, Y_test =
model_selection.train_test_split(X, Y, test_size=n_test,
random_state=seed)
return X_train, X_test, Y_train, Y_test
def k_fold_validation(models, X_train, Y_train):
'''
Performs 10-fold validation and prints the mean and standard
deviation of accuracies.
Parameters: array of models, training set - features and output
Returns:
'''
results = []
means = []
stds = []
names = []
scoring = 'accuracy'
seed = 7
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train,
Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
means.append(cv_results.mean())
stds.append(cv_results.std())
names.append(name)
msg = '%s: %f (%f)' % (name, cv_results.mean(),
cv_results.std())
print(msg)
return names, models, means, stds
10. Model Building = Equation
• Multi-Class Classification with only numeric variables
• Logistic Regression
• Linear Discriminant Analysis
• K Nearest Neighbor
• Decision Tree
• Random Forest
• Naïve Bayes
• Support Vector Machine
11. Model Building code
• Logistic Regression
• Linear Discriminant Analysis
• K Nearest Neighbor
• Decision Tree
• Random Forest
• Naïve Bayes
• Support Vector Machine
def build_model(X_train, Y_train):
'''
Runs training data through Logistic Regression, Linear
Discriminant Analysis, KNN, Decision Tree, Random
Forest, Naive Bayes, and Support Vector Machine.
Parameters: training set - features and output
Returns: array of names, models, means, and stds
'''
models = []
models.append(('LR',
LogisticRegression(solver='liblinear',
multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RF',
RandomForestClassifier(n_estimators = 100,
max_depth=5)))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
return models
12. Estimation
• Gini Impurity
• The dimensions of the petal is more predictive than those of the sepal.
Feature Gini Index
Petal Width 0.46
Petal Length 0.42
Sepal Length 0.09
Sepal Width 0.03
def gini_impurity(models, X_train, Y_train, X_test,
df):
'''
Examines feature importance using Gini impurity.
Parameters: models, training set, test set, dataframe
Returns: none
'''
random_forest = models[4][1]
keys = df.keys()
keys = keys[[0,1,2,3]]
models[4][1].fit(X_train, Y_train)
pred = models[4][1].predict(X_test)
print(sorted(zip(map(lambda x: round(x, 4),
random_forest.feature_importances_), keys),
reverse=True))
return
13. Model Evaluation
• Run on training set
• Performance metric: Accuracy
• Null Error Rate is the baseline accuracy if we predicted flower as being setosa. Only
algorithms that beat this base rate will be considered.
• Null Error Rate = 0.33
• Visualize in: Error Bars
14. Model Evaluation
• Error Bars show us the accuracy of each model.
def evaluate_error_bar(names, models, means, stds):
'''
Compare accuracy values with Error Bar graph.
Parameters: array of names, models, means, stds
Returns: none
'''
# error bar
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(1, 1, 1)
ax.set_xticklabels(names)
plt.errorbar(names, means, stds, linestyle='None',
marker='^')
plt.ylim(0.92,1)
plt.show()
return
15. Explanation
• Run on test set
• Performance metric:
Accuracy, Recall, Precision, F1
score
• Visualize in: Confusion Matrix
and Classification Report
def test_set(X_train, Y_train, X_test, Y_test, models):
'''
Runs test data through all models. Prints confusion
matrices and classification reports.
Parameters: training set and test set, array of models
Returns: none
'''
for name, model in models:
if name == 'RF':
model.fit(X_train, Y_train)
pred = model.predict(X_test)
print('nnn%s Accuracy: %.2f' % (name,
accuracy_score(Y_test, pred)))
labels = np.unique(Y_test)
confusion = confusion_matrix(Y_test, pred,
labels=labels)
print('nConfusion Matrix:')
print(pd.DataFrame(confusion, index=labels,
columns=labels))
print('nClassification Report:')
print(classification_report(Y_test, pred))
return
16. Explanation
Confusion Matrix Classification Report
Accuracy = 0.87
• Better than Base Rate = 0.33
• Precision
• Precision for setosa is perfect (1.00). This means that if the model
predicted that the flower species is setosa, then it is always right.
• Recall
• Recall rate for setosa is high (0.93). This means that we correctly
identified all setosa flowers.
• F1 Score
• Weighted mean of precision and recall. Here we see that we do a
better job at identifying setosa (F1 = 1.00) than the other two flower
species (F1 = 0.83 and 0.82)
Predicted Class
setosa versicolor virginica
Actual
Class
setosa 7 0 0
versicolor 0 10 2
virginica 0 2 9
Precision Recall F1_Score
Actual: Setosa 1.00 1.00 1.00
Actual: Versicolor 0.83 0.83 0.83
Actual: Virginica 0.82 0.82 0.82