Testing is how we guess at the efficacy of our machine learning models out in the real world. The basics may seem obvious, but specific test metrics can help you emphasize performance on the parts of your application that are the most important.
The previous part in this series (found here: https://www.youtube.com/watch?v=ahqWq6Gkwbw, https://blog.anant.us/spark-and-cassandra-for-machine-learning-data-pre-processing/) discussed data pre-processing methods. This part focuses on how we test the efficacy of our machine learning models and tells us how well they might generalize to real data.
The first part (found here: https://blog.anant.us/spark-and-cassandra-for-machine-learning-setup/) helps set up the environment we work in and discusses why we might want to use Spark and Cassandra here.
Code for the environment can be found here: https://github.com/HadesArchitect/CaSpark
Extra Notebooks and Datasets not included above can be found here: https://github.com/anomnaco/CaSparkExtension
Webinar Recording: https://youtu.be/mHFUJGntk78
Follow Us and Reach Us At:
Anant:
https://www.anant.us/home
Cassandra.Link:
https://cassandra.link/
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/organization...
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-10...
Facebook:
https://www.facebook.com/AnantCorp/
Scaling API-first – The story of a global engineering organization
Machine Learning with Spark and Cassandra - Testing
1. Machine Learning with Spark
and Cassandra - Testing
Tests for Binary Classification Models,
Regression Models,
And Multi-class Classification Models
2. Series
Machine Learning with Spark and
Cassandra
● Environment Setup
● Data Pre-processing
● Testing
● Validation
● Model Selection Tests
4. ● Tests are a statistical measure of how well our models
work.
● Calculated by running a model on held out data with known
properties and comparing model predictions to known
labels
● Works differently for different types of ML models
● An attempt to capture the potential performance on data the
model will see in day to day operation
6. When to test?
● Whenever we have a trained model, we can start testing. Depending on what we find and where we
are, the test can have us proceeding on to next steps or returning to previous ones.
○ Sometimes we go back to tune the parameters of our model.
○ Sometimes we may want to pick a new algorithm to train altogether.
○ Other times we move forwards to more complex testing strategies or onwards to deployment.
● The same calculations for test statistics can also be a part of the mathematical process for training
our model
7.
8. What data to train on.
● Should always train on held out data, never the same data that was
used to train the model.
○ ML algorithms often involve optimization on test statistics for the
training dataset. Testing on the training set completely fails to help us
generalize to real data.
● There exist multiple methods for choosing data to be held out,
should always be done randomly.
○ Simplest method is to split data into two random chunks, train on one and
then test on the other
○ Can also split into three chunks, one for training, one for testing, one for
final validation
○ More complex schemes exist, to be covered next time in talk on validation
10. ● Binary classifiers predict a value which has a boolean typing. It sometimes focuses on the presence
or absence of a particular thing, other times picking between two categories.
● In order to test our binary classification models we use something called a confusion matrix. It
categorizes our predictions based on what value we predicted and what the actual value is.
● Binary classifiers predict a value which has a boolean typing. It sometimes focuses on the presence
or absence of a particular thing, other times picking between two categories.
● In order to test our binary classification models we use something called a confusion matrix. It
categorizes our predictions based on what value we predicted and what the actual value is.
11. ● We use these values to compute more meaningful metrics.
● The most commonly used is accuracy. Accuracy is computed as
correct predictions divided by all predictions. Its a general measure
of how likely we are to correctly predict a given example.
● Recall is computed as the number of correctly identified positive
values divided by the number of actual positive values. It measures
how well our model detects the presence of positive values.
● Precision is calculated as the number of correctly identified positive
values divided by the number of positive predictions. It measures the
reliability of the positive prediction.
● We can use Recall and Precision to calculate a composite value, the
F1 score. If either recall or precision is low, the f1-score will also be
small. It emphasizes the importance of the incorrect predictions.
13. ● Regression models estimate functions, and produce predictions in the form of scalar values.
Classification tests do not work for them. Instead we use the difference between predicted and
actual values as a simple error metric.
● Adding error values without extra processing is a bad idea since
errors in different directions can cancel out.
● Instead we use metrics like the sum of squared error (SSE) a simple
measure that captures error over the entire test set.
● We can also use mean squared error (MSE), which in some cases is
better since it is independent of the number of examples in the test
set.
● Root mean squared error (RMSE) is sometimes preferable since it is
returned in the same units as our predictions rather than units
squared, but still maintains many of the statistical feature of the
MSE.
14. ● We sometimes prefer absolute error measures to squared error measures, which we calculate by
taking the absolute value of our error measure rather that squaring them.
● Large error values and therefore outliers are emphasized more by squared error measures.
● The discontinuity in the absolute value function makes it difficult to calculate gradients.
16. ● Multiclass classifiers predict a value that can have more than two but still finite possible values
● We test them by building confusion matrices, similar to binary classification, but these cannot be
turned directly into test metrics.
● We build an n by n grid, where n is the number of possible classes and place each test result into its
cell based on what was predicted and the actual class of the example.
● We can then turn that into n individual matrices, one for each class. We treat correct predictions
on a particular class as true positives, and then all other predictions are classified based on their
relation to the class that the matrix is for.
17. ● From these new matrices, we can calculate our test metrics for each class. We can then combine
these values in various ways based on what is important for our application.
● We average scores together, but we can average based on the number of classes, weighting each
classes scores equally (called macro-average), or we can weight each score by the number of
examples that class has (called micro-average).
● Macro-average can act as a general score though it may obscure very high or low performance on
particular classes. If performance on a particular class is important we may choose to
micro-average or even look at the individual test scores.