4. These notes follow LinuxAcademy structure which can be found here: https://
linuxacademy.com/course/aws-certified-machine-learning-specialty/. I would
recommend viewing the course to gain full and detailed explanations.
I have created these notes as part of my personal learning and hope to be able to
help and inspire others.
As i am also learning, there may well be mistakes, please do reach out and let me
know, so i can correct them.
Follow me on instagram: https://www.instagram.com/adnans_techie_studies/
All my notes are hosted here: https://adnan.study
Connect with me on LinkedIn: https://www.linkedin.com/in/adnanrashid1/
Overview
5.
6. Machine Learning provides the ability to learn without being explicitly programmed.•
It focuses on the development of programs that can access data and use it to learn for them•
selves
Example of ML is AlphaGo which beat one of the best players of Go.•
Machine learning is when you load lots of data into a computer program and choose a model to•
“fit” the data, which allows the computer (without your help) to come up with predictions.
The way the computer makes the model is through algorithms, which can range from a simple•
equation (like the equation of a line) to a very complex system of logic/math that gets the
computer to the best predictions.
What is Machine Learning?
Artificial Intelligence
Advancements with compute power has brought a new wave of artificial intelligence•
AI being used to analyse big data•
Machine Learning is an subset of Artificial Intelligence•
Deep Learning is a subnet of Machine Learning•
7. 1
HEIGHT
v
1
HEIGHT
TESTINGDATA
r
1 I
I
r
3
HEIGHT
This would then become our training data for our
inferences for height and weight given that we have
one of the values
Once the data is plotted we can then create a trend line
that goes through the data set.
The trend line can then be used to predict the weight
This blue line may appear to be a better fit for
predicting weights as opposed to the straight line.
What is Machine Learning?
8. 1
I
HEIGHT
DATA
I RAIN MODEL PREDICTION
ALGORITHM
r
LINEN2 REGRESSION
We can use our test data to test the line and see how well
it fits.
These differences would be the actual observed weights
vs the predicted line. We then take those differences and
add them up together creating the sum of the actual
observed weight and predicted weight.
We could do the same for the curved line which goes
through the actual data however it is overfit to our
training data, which means it does not handle new data
very well.
The green line is better to make predictions as it is more
generalised.
This green line represents our machine learning model.
This particular type of model is called linear regression.
We would use other models depending on what we are
trying to achieve i.e. Logistic Regression, Support Vector
Machines and Decision Trees.
This is an very simplified view of what we are doing with machine learning.
We have only looked at two dimensions which is easy to visualise, however when we
get beyond 3 it becomes more difficult. Having lots of dimensions is more closer to
reality and considering we cannot draw a 200 dimension graph, Machine Learning
can help towards solving these problems.
9. What is Deep Learning?
Deep learning is based on the principles of an organic brain with the aim to get machines to
learn in a similar way.
Neurons are chained together as a Neural Network with inputs and outputs.
Inside the neuron is an Activation Function
Function of code◦
It takes the inputs, decides what to do with the data, stores a value and passes it that on◦
through the output
The neuron’s are usually all connected together
Examples of deep learning:
Self Driving Cars, Object detection, decision making◦
Object Classification, visual search, face recognition◦
Natural language processing, spam filters, Siri, Alexa, or Google Assistant◦
Health Care. MRI scans, CT scans, records analysis◦
10.
11. COLLECT PROCESS SPLIT
7 TRAIN
DATA DATA DATA
9
IMPROVE TEST
C
PREDICTIONS
L INFER c DEPLOY
FEATURE
LABEL
c v s r
AGE STARSIGN DRINKCOFFEE LIKESCATS
20 3 YES YES
25 1 NO YES
33 2 NO NO
42 6 YES YES
The data collected in real world will be in various
different formats.
The first and primary item is to bring it into a format
which our ML algorithm will be able to understand.
In order to do this, we will also need to organise this
data.
Machine Learning Lifecycle
This is a general process for a machine learning lifecycle.
We go through iterations of the lifecycle improving our inference.
Process Data
12. 7
I
We may do the following to the data depending on the data set.
Feature reduction
We want as much data as
possible when training our
model however, we dont want to
pass data that is not related.
This can be difficult as you may
be looking for relationships in
the data, that you are not aware
of.
Encoding
In the previous image, the
star signs are numeric
values, therefore the string
has been encoded. We
could look up the data in a
separate table.
Features and Labels
We are using these
features to try and
understand if people like
penguins or not and this is
our labelled data.Feature Engineering
We may map the features
down between 0 to 1
such that features can be
compared
Formatting
The file format that we will use
for providing the data to the ML
algorithm.
We believe there is relationship
somewhere in this dataset, but
we need a deep understanding
of the data.
Data
13. Once we are happy with the dataset, we would then split
into sections
Training◦
Validation◦
Testing◦
The Algorithm
Can see and is directly influenced by the training data◦
Uses but is indirectly influenced from validation data◦
Does not see testing data during training◦
Perform inference on testing data to see how well the•
model fits
Is it overfit?•
This data wasn't used to train the data, because we are
looking to see how well the model works.
If the model is overfit, then it would be really good to predict
based on data its already seen.
However what we want is to make inference on similar data.
Host model in execution environment according to the
requirements
Batch◦
As a service◦
Infrastructure required◦
Split Data
Train
Test
Deploy
14. AGE STARSIGN DRINKCOFFEE LIKESCATS
20 3 YES YES
25 1 No YES
33 2 NO NO
42 6 YES YES
29 3 YES YES
1
HEIGHT
DOGS CATS
An example of supervised learning is the
ability to show what it looks like when
someone likes cats
This is another example of supervised data
where we can infer the weight based on the
height.
In this example we providing different
data as opposed to just cats which will
help train the model to be able to make
inferences
Supervised, Unsupervised, and Reinforcement
15. M
En
SCORE A
REWARD
ACTION
Reinforcement learning we give the robot a reward
based on its actions.
As an example, we want it to pick up a cat and if it does
selects one, we would give it a reward of +1, however if
it picks up something else we would remove that
reward, in order to try enforce a preference towards the
cat.
Unsupervised learning involves finding relationships
where we did not know there was one.
It is best used when we are trying to analyse data with
lots of dimensions in order to find relationships
between the data points, where we would not
normally find using conventional methods.
16. Variety of algorithms that can be used such as:
Recurrent neural network◦
Convolutional Neural Network◦
Linear Regression◦
Latent Dirichlet Allocation◦
Support Vector Machines◦
Summary
Unsupervised learning
involves looking for patterns
when it is not initially evident
there are ones. It is best used
with hundreds of dimensions
where it is not possible to be
able to plot on graphs
Supervised learning is where
we use labeled data to classify
unlabelled data using a
machine learning algorithm
Reinforcement learning
involves providing a reward
when it does something
correct and taking away the
reward when it is incorrect. It
involves a lot of trial and
error to get it right
Supervised
Unsupervised
Reinforcement
Learning
17. 1
HEIGHT
1
HEIGHT
VE
SLOPE OFMODEL LINE
Here we can get the sum of the residuals.
Some of the differences will be negative and some
positive. If we square it before we add it together, such
that they will always be positive.
We can then add the square of the residuals to get the
sum and be able to see how it differs for different lines.
This line represents the machine learning model.
How we do know that this is the best line to fit the
model?
We could have drawn at different gradients.
This is a plot of the sum of the residuals vs the slope
of the line.
The job of the machine learning algorithm is to find
the lowest point of the parabola.
The bottom of this curve would show the line with the
best fit as it has the least amount of differences.
Optimization
18. MINIMUMSLOPE O
O
E
V
EO u
n n
BEST FIT SLOPEFORMODEL
SLOPE OFMODEL LINE
e
VE
SLOPE OFMODEL LINE
STEP
e L
Vs
SLOPE OFMODEL LINE
It is easy for us to see the bottom of the slope but
the computer needs to be able to calculate this.
If you pick a point on the parabola you can then
calculate the slope at that point.
You can then tell if you are heading towards reduction
in slope or increase in order to understand the
gradient.
It is then possible to keep stepping until you get to
the bottom of the graph
This technique is called gradient descent.
To find the bottom of the line it depends on the step
size. If it is too large then we might miss the bottom
of the graph or too small and it would be inefficient.
This technique is used for Linear Regression,
Logistical Regression and also Support Vector
Machines.
19. Summary
Sum of the residuals•
Looking at the difference between the line through multiple data points and the difference◦
between the data point and the line
Square the values•
We then square the values as some of the values are positive and some negative due to◦
being below the line.
If it is squared we can get the overall positive number to be the sum of the squared◦
residuals
We can then put all those differences on a graph by taking a number of different lines and◦
see which one is the least, which is the line that best fits the data points
Graph•
If we do sum of the squares vs the slope of the model line, we will end up with a parabola.◦
The algorithm needs to find the lowest point.
The bottom of the curve is where the slope is 0 and this is the best fit.◦
Gradient Descent•
In order to discover the gradient, the model will pick a point and find the gradient and◦
move in the direction where its less steep
This technique is called gradient descent◦
Learning Rate•
The step size sets the learning rate◦
If the step size is too large it might miss the bottom of the graph and too small is not◦
efficient.
Important•
The other thing to bare in mind is that there might be multiple dips in the line◦
20. 1
I
HEIGHT
Technique when we dont see our dataset fit real world data that well.•
Looking at the graph we can see that a small differences can have a larger effect overall.•
Regularisation
Your sample data may fit well, but real world data
generally does not fit so well straight away.
Regularisation through regression
L1 Regularisation (Lasso Regression)
L2 Regularisation (Ridge regression)
We apply regularisation when our model is
overfit, and it fits the training data really well but
does not generalise to real world inference.
A method to fix this is to apply regularisation and
it is achieved through regression.
21. PARAMETERS
HYPERPARAMETERS s MODEL a
Hyperparameters
These are parameters we can set to tune a
model
Hyperparameters are external settings we
can set before we train a model and
influence how the training occurs.
Parameters are internal to the algorithm
that get tunes during the training
Hyperparameters
Learning rate◦
Epochs◦
Batch size.◦
Learning Rate
Determine the size of the step taken during gradient descent optimisation.◦
It is set between 0 and 1◦
Batch Size
Batch size is the number of samples used to train at any one time.◦
It could be all of the data, some of the data or a single sample.◦
Another way to put it is batch, stochastic or mini-batch. It is often 32, 64 or 128.◦
It is possible to calculate based on infrastructure.◦
It is also based on the amount of data you have.◦
If you span over multiple servers then you might use a batch size that splits over that◦
infrastructure
Epochs
The number of times the algorithm will process the entire data set multiple times.◦
Each time it passes through the data, the intention is to improve the accuracy of the◦
algorithm.
Common values of these are high numbers - the number of times the algorithm will◦
sample the data set.
22. COLLECT PROCESS SPLIT
7 TRAIN
DATA DATA DATA
9
IMPROVE
TEST
PREDICTIONSC INFER c DEPLOY
TRAINING VALIDATION TESTING
TRAINING S
TESTING
L.ttVALIDATION
Cross Validation
Training data is seen by the training
process and directly influences the model
Validation is not seen by the training process
but indirectly influences the model in order to
tweak it
Testing dataset is not seen by the training
process but informs the user of success.
Cross validation data is where we dont
isolate validation data and instead we split
our training data into a number of partitions
and we use the different sections to perform
the validation to get a better fit for our
model.
As a result we use all data for training and
for validation which is called k-fold
validation.
This technique can also be used to compare
different algorithms and validating different
data sets.
23.
24. NAME COUNTRY AGE HEIGHT STARSIGN LIKECOFFEE
ADNAN UK 33 170 VIRGO YES
133 GEMINI NO
SARAH BRAZIL 23
ALEXA USA 5 z PISCES NO
ALI INDIA 39 175 SCORPIO YES
SPARKY AUSTRALIA 5 35 JEDI YES
Feature Selection and Engineering
This example dataset can be used to understand if people like coffee or not.
The first thing to do is remove anything in the data set which does not have anything to do
with the inference we are making, however this does require specific domain knowledge in
order to establish if we are taking away the correct features or not.
In this data set the name is not relevant and therefore can be removed and also helps
towards making the algorithm more efficient as it won't try and make a relationship between
someones name and if they like coffee. We need to be careful we do not remove a feature
that would of been useful.
The result will be a faster trained model and also one that is more accurate.
25. COUNTRY AGE HEIGHT LIKECOFFEE
UK 33 170 YES
NO
BRAZIL 23 133
NOUSA 5 2
INDIA 39 175 YES
AUSTRALIA 5 35 YES
NAME COUNTRY AGE HEIGHT STARSIGNLIKECOFFEE
ADNAN Uk 33 170 VIRGO YES
NO
SARAH BRAZIL 23 133 GEMINI
ALEXA USA 5 2 PISCES NO
9
ALI INDIA 39 175 Scorpio YES
SPARKY AUSTRALIA 5 35 JEDI YES
LOOKS SUSPICIOUS
COUNTRY AGEHEIGHT LIKECOFFEE
UK 5.15 YES
BRAZIL g78 NO
USA 0.4 NO
INDIA 4.48 YES
AUSTRALIA
7 YES
HOUSE CITY DATE COFFEECONSUMED
RED BRISBANE 8 2318 1233 NO
GREEN LONDON 9 1218 0710 YES
BLUE DALLAS 102218 1050 NO
GREEN LONDON 11 1018 1235 YES
RED BRISBANE 120918 1607 YES
We may not be as interested about
which day people drink coffee but
instead the time might be of more
relevance.
The other way to establish the
relevant is by checking if there is
any correlation between the
label and the feature.
This also needs domain level
knowledge and trial and error.
Gaps and anomalies will also
influence the data set where we
can either remove the feature
entirely or do imputation.
Another strategy is to engineer new
features.
We may decide that there is a
relationship between age and
height and dividing them together
to create a new column. It would
then require running some training
to understand if it was effective.
It would also reduce the amount of
data for the algorithm to analyse.
26. DIMENSION
REDUCTION
N
W
Eb
g
RE
3
SCORE 1
Three features are about
the limit on a graph.
Beyond that it starts to
become difficult to start
representing that data
visually.
PCA looks for aspects of the data which influence it
the most by finding the central point of the data set.
Once we find that central point all dataset is moved
such that it is centered around the origin.
PCA allows the ability to see the relationships
between the data.
It is an unsupervised algorithm which takes
place with dimension reduction.
Although were may loose some data, we
need to maintain the principal components.
Using PCA it is possible to look at hundreds
of dimensions.
Principle Component Analysis (PCA)
27. Eg
score3
SCORE1
Eg
score3
SCORE1
PCI Pez
r
t
PC3
N
PCI
PCA generally does this for us but once the data set
is captured, we would need to draw around it.
The longest length represents the largest variation
of the data set which is principle component 1.
The next longest is principle 2 followed by 3 which
gives us the spread of data which most influences
the data set
PCA looks for aspects of the data which influence
it the most by finding the central point of the data
set.
Once we find that central point all data set is
moved such that it is centered around the origin.
We do that by finding the mean value on score 1,
score 2 and score 3
We can then leave out the 3rd dataset.
We would expect our data to be spread across PC1
PCA is often used as a data preprocessing step
PC1 and PC2 are usually used to plot on graphs to
see the relationship between the data.
28. NAME COUNTRY AGE HEIGHT STARSIGNLIKECOFFEE
ADNAN Uk 33 170 VIRGO YES
SARAH BRAZIL 133 GEMINI NO
ALEXA USA 5 2 PISCES NO
ALI INDIA 39 175 Scorpio YES
SPARKY Australia 5 35 JEDI YES
IMPUTE 1 E MEAN
NAME COUNTRY AGE HEIGHT STARSIGNLIKECOFFEE
ADNAN Uk 33 170 VIRGO YES
SARAH BRAZIL 21 133 GEMINI NO
ALEXA USA 5 z PISCES NO
ALI INDIA 39 175 Scorpio YES
SPARKY Australia 5 35 JEDI YES
331 5 39 51 4 205
Missing Data
Imagine we had surveyed a number of
people on the street and got various
data.
If we have missing data in that data set,
we may need to calculate or impute a
value. One way we could do this is by
taking the mean of all the other values
which is part of that particular feature.
In this process, we are presenting the
data for a ML algorithm to make a
inference but we don't skew our data set
by having no value or 0 which would
impact the ML model.
If we have too much data missing, it may
be better to remove that feature entirely
as it would be of little value or remove
the row if that particular row is just
missing data.
We need domain level knowledge to find
outliers, you might have correct data but
perhaps mixed up i.e. animal age and
heights vs human age and height
29. We may have a dataset where we are looking for faults on a car engine. It could be that it is generally
fine, but there is a few reports of something faulty. As this is not a frequent occurrence, it is likely that
this particular data will become lost in a sea of other data. As a result, it may not be recognised by the
ML model.
There is a variety of different strategies which can be taken to help:
Try and source more data because the thing you are looking for is not represented as well as you•
would like.
If it is not possible to get the data, another option is to over sample the data but then faults will•
likely to look like whatever you have in your training data
We can synthesise the data to understand what can vary and affect the data set. That way the ML•
algorithm can approximate the data
Finally we can try a different algorithm - often people use the same algorithm frequently since we•
know that algorithm and understand it.
30. NAME COUNTRY AGE
ADNAN UK 33
SARAH BRAZIL 23
ALEXA USA 5
ALI INDIA 39
SPARKY AUSTRALIA 5
COUNTRY BRAZIL AUSTRALIA USA UK
UK 0 O O 1
BRAZIL I 0 O O
USA O O I 0
AUSTRALIA 0 I 0 O
Label and One Hot Encoding
ML algorithms are mathematical
constructs therefore it does not work with
strings and instead needs to be integers.
So we can encode the names and also
countries, which means it is a label
encoding.
The problem doing this is that, the ML
wont understand its a country and does
not need to look for a relationship but it
will still try and find one.
In this scenario one hot encoding comes into play, whereby new features are introduced into
the data set and therefore each country would become a feature and a table with 0's and 1's .
In this case it is important not to have a numerical relationship between the countries and no
implied hierarchy between the countries.
31.
32. s
NO
YES
6570758085 9095100 105110
RESTINGHEARTRATE TO 88 65 89 78 61 69 98 82
LIKES CATS Y N Y N Y Y N N N
In the example you can see
someones resting heart rate and
if they like cats or not.
However in the example you can
visually see that a low heart rate
indicates they do and higher
heart rates indicate they do not.
Supervised ML algorithm
Data is provided along with example
inferences
Looking for patterns in the data set
with examples of what you are looking
for which only be a yes or no - a binary
outcome.
Typically its used to understand if this
data is an example of something or is it
not.
Logistical Regression
33. NO
YES
65707580 85 90 95 100 105 110
RESTINGHEARTRATE TO 88 65 89 78 61 69 98 82
LIKES CATS Y N Y N Y Y N N N
SIGMOIDFUNCTION
YES
6570 75 80 85 90 95 100 105 110
A way to do this would be to draw a line using linear regression to find the
best fit.
A problem with this is that there may be outliers which can skew the data
set and therefore make the wrong inferences
Instead we could fit a sigmoid function which does not skew the line like linear
regression but instead looks for the cut off point between the yes and no
There are methods to fine tune this to understand what is most important.
34. LATITUDE COFFEE
CONSUMED
4 6 60
7 2 50
20 O 40
28 O 30
38 24 20
45 35 to
59 18 10 20 30 40 50 60 70 80 90 100
70 49
76 24
Linear Regression
The example data set might be latitude for where you live on the planet and then the
amount of coffee consumed.
We can then do techniques to understand exactly where that line sits
Although it does not go through any green points, it provides a generalised
statistically valid answer.
Supervised model
We dont just provide the core data but
also the output value that we would want
to infer later on.
An example inference is numeric where
the output is a range
This can be used for financial forecasting,
marketing effectiveness, risk evaluation
and more related to business
35. 40
30 SUPPORTVECTOR
20
10
10 20 30 40 so 60 70 80 90 100
40
30
20
10
10 20 30 40 so 60 70 80 90 100
Support Vector Machines (SVM)
How do we best identify where we
should draw lines by identifying the
boundaries of our data sets.
We would draw a hyper plane
between the support vectors such
that when we have a new data point
we can allocate it appropriately.
Supervised model
It would be used to classify data.
It can be used for customer classification. As an
example, if we already had a classified data set we
might want to identify the high value customers.
In this example we have 2 classifications but we
need to somehow draw a line in order to identify
new classifications.
36. I NODE
11200
LIKEWALKING
INTERNAL
NODE Y N
LIKERUNNING CATPERSON
Y N
LEAF
CATPERSON DOGPERSON NODE
BINARY NUMERIC CHOICE
WALKS DISTACE COLOR
Y N 4km 31km RED GREEN
v a v a u
DOG CAT DOG CAT DOG CAT
Decision Trees
Supervised algorithm
We provide training data along with labels and
they can be considered example inferences
We ask the algorithm to look at this data and find
the patterns, when we give it unlabelled data, we
ask it to discover how it fits.
It can be used for customer analysis and medical
conditions
Decision trees are essentially flow
diagrams which has root nodes, internal
nodes and leaf nodes
Root nodes are where things start, the
internal node asks another questions
which flows down to another node which
is the leaf node
We dont need to have the same number
of leafs across the branches
Decision tree outputs can be binary or
numeric but is generally based on a
numeric question
We can also use decision tree decision
points to find out a choice like fav colour.
37. RUNNING
LIKEWALKINGLIKERUNNING Km'sWALKED FAI8 E Type
WALKING KMS 4 No yes 1 GREEN DOG
No No 2 Blue CAT
e s Yes YES 1 RED DOG
s
KMS Z WALKµq
YES NO 1 GREEN CAT
YES No 3 GREEN DOG
YES yes 4 BLUE Dok
c s
e s No No 3 RED CAT
Decision Trees
How do we start our root? We would need to understand which feature assigns most
closely to the question we are asking.
In this example when analysing the data set, we see that it is, 'likes running' for who is a
dog person vs cat person.
We would then filter the data based on that data and identify what is the next most
important feature which in turn would make up the next node and go through other
branches.
You may find some of the features was not selected as it had no correlation between the
question.
We wont see the actual decision tree when it is created but we can give it new data and
categorise the data to see its behaviour
38. RUNNING
s
WALKING Kms 4
e s s
Kms 2 WALKING
c s
e s
u u u u
DOG
WINS
We repeat this entire process a
random number of times and then
surveying all the data set and run it
into the ML algorithm, and see the
output and then based on the
majority output, we would label the
new data.
Random Forest
When you create a decision tree you need to know what question you will place in the root node. The
random forest will check 2 different features and follow down the branch - it is chosen randomly.
We build the decision tree in this way and continue until we have a decision tree with a random
variance.
Random forests are supervised algorithms
We have pre-labelled the data and ask it to
infer a binary, classification and numeric
outputs
It is essentially a collection of decision trees
The problem with decision trees on its own, it
can be inaccurate.
Random forest is a way to make decision trees
more accurate.
39. so
so
so
so
40go
soso
so so
lo lo
lo w so 40 so so o so go coo to zo so 40 so so yo so go too
60
so
40
so
to 20 30 40 so Go to so go coo TOTALVARIATION TOTALVARIATION
Supervised ML algorithm
Data is provided and example inferences.
Looking for patterns in the data set with examples
of what you are looking for which can only be a
yes or no - a binary outcome.
Typically its used to understand if this data is an
example of something or not.
If we want to find 3 classes of data, the algorithm makes some random guesses and places 3 points
across the dataset.
It then goes through each data point and checks which centre point it is closest to. The next step is
to figure out all the closest data points. At this point the data classification will be wrong so it will
move the centre point to the middle of its classes.
The algorithm will then go through the cycle again including moving the central point until the
distributions make sense. We need to find equilibrium where moving to the central point does not
effect the classification
K-Means
40. ELBOW PLOT
I
Few
7
er
I 2 3 4 5 6 7 8
CLUSTERS K
To find out how many times we cycle through, we can graph the number of clusters and
reduction in variation.
The first cluster variation will be at 0 and as we increase the number of clusters we will
eventually see a elbow plot where the variation does not change much.
So after a certain number of clusters, it is ineffective to do more as the variation is minimal.
41. Go
so
40
so
20
o
to 20 30 40 so 60 To so 90 too
K-nearest neighbours takes into account,
the number of nearest neighbours to
consider.
The 'k' means number and therefore
considers the number of data points to
take into account in order to establish the
new data point.
It should be large enough to reduce
influence from others however small
enough such that small clusters do not get
overlooked.
Supervised algorithm
Used for classifying data thats already
classified
K-means would have found some
clusters within the dataset already,
however the challenge is to know
which class to associate the new
data point to.
K-Nearest Neighbour
42. Document
WORD
WORD
WORD
Document
word
Topic word
Topic word
wordTopic
wordDocument
word
word
word
Unsupervised Algorithm
It is used for classification and sentiment analysis.
It is a description of the way documents are
constructed. If you have a number of documents,
those documents are made of a number of different
topics along with multiple words which can also be
in multiple topics.
LDA does not understand what is written in the
document but it does statistical analysis to get some
idea of the content.
Latent Dirichlet Allocation (LDA)
There is data analysis steps which are done before any processing is done which involves removing
particular words like 'stop words' and words such as 'and'. These words do not help towards
understanding the content.
We then apply stemming to words such as, learned, learning, and learn are all condensed into a
single word i.e learn. Once this is complete, we can then tokenise the words into an array.
Finally we choose the number of topics we would want LDA to find and this is K.
So we take all the words in our array, if we select 3 topics to find, the algorithm will randomly assign a
topic number to all the words.
43. WORD TOPIC TOPIC2 Topless
Topic
WORD 1 MACHINELEARNING 22 33 43
WORD 2 FUNRUN 32 34 23
WORD 3 DEEPLEARNING 44 23 34
WORD 1
LAMBDA 51 43 23
WORD 2
WORD 3 STORAGE 33 64 54
WORD 2 ARTIFICIALINTELLIGENCE 45 33 23
WORD 3
WORD 1
WORD 2
DOCUMENT TOPIC7 TOPICZTOPICS
WORD 2
WORD 3 STORAGE 123 23 34
WORD 1 MACHINELEARNING 43 143 45
LAMBDA 24 35 132
Topic WORD TOPICITopic2Topless
WORD 1 MACHINELEARNING 22 33 43
WORD 2
FUNRUN 32 34 23
WORD 3
DEEPLEARNING 44 23 34WORD 1
WORD 2 PYTHON 51 43 23
WORD 3 STORAGE 33 64 54
WORD 2 ARTIFICIALINTELLIGENCE 45 33 23
WORD 3
WORD 1
WORD 2 DOCUMENT Topic7TopiczTopics
51 24 1224
WORD 2 43 35 1505STORAGE 123 23 34WORD 3
MACHINELEARNING 43 143 45 23 X 132 3036WORD I
LAMBDA 24 35 132
We then calculate each word and
how often they appear in each
topic
Once that is complete we can then
check each document and how
often each topic appears there.
We take the number of times a
word appears in a topic and how
many times it appeared for a
particular document and multiply
them together.
Whichever one comes out higher,
we then reallocate to that topic.
This happens as many times as
necessary until all the topics and
words are complete across the
documents.
We can then see what those
documents are mostly about.
44.
45. I 0 6 12 8 In 3 0.3
05in 2 5
2 t BIAS
0.3in v
ACTIVATIONFUNCTION
3
WEIGHTS
Neural Networks
On the left hand side is input layer, some
hidden layers and then an output layer
Data is processed at each layer on the
network and activated in order to get an
inference
On first layer, which is the input layer, we need
to load data into all of those inputs. As an
example, if it was an image, each pixel would
be put into every input.
Random values are then allocated to the input
neurons and these are referred to as weights.
These are the factors used to adjust before it
gets to the next layer.
The weights are multiplied together and we
add a value to the next neuron.
We also add a bias to the sum and this is
applied to an activation function.
46. ACTIVATION FUNCTIONS
x RELU
2.5
25
y
SIGMOID
if
NH
x
2s
b
w
b w
w
w
w b s
b
HOW CORRECT AM I
There are 3 types of activation functions which are
ReLU•
Does not consider any negative values◦
Sigmoid•
Generally places values between 0 and 1◦
Tanh•
Is similar to Sigmoid but also trends to negative 1 on the y axis.◦
If we plot the x value on the function, the y value is the activation function which is provided.
We do not tend to use Sigmoid or TanH generally, ReLU is most commonly used.
The bias is there to prevent our neuron
from being deactivated. If the result was 0
then it would not influence anything -the
more neurons you have turned off, the less
effective the network is.
At this point the output will be wrong
because everything will be random and this is
called forward propagation.
47. FORWARD PROPAGATION
b
w
N w
b w
LOSSFUNCTION
s w
E H w b s
EEEL EEE
b
BACK PROPAGATION
Once we get to the end we do a loss function which is an evaluation of the calculations that was
made.
This is also known as back propagation and it uses gradient descent and learning rates to reduce
the loss that takes place. It looks at a way to update weights and biases.
The iteration of doing forward and back propagation is epochs and this is how it learns.
48. CAT
DOG
Convolutional Neural Networks (CNN)
Supervised Algorithm
Mainly used for classifications and mostly
image classification and image detection.
The hidden layers inside the network are
known as the convolutional layers within
the network.
Images generally have particular
characteristics such as edges, feathers,
eyes and beak if it was an penguin for
example.
The different layers in the network will
work towards identifying these different
characteristics.
For an image, we would use a
convolutional filter.
We would use the first 9 pixels 3 x 3 and
use the filter on it and calculate the
outcome onto a new image. We continue
this across the whole image.
This particular filter does detection for
the edge. We can use multiple filters
which are pre-trained by others, this is
called transfer learning
49. HOUR ACTIVITY
6
7
8
MLMODEL
11
12 NOTRNN
13
Supervised Algorithm
This can be used for stock predictions,
time series data and voice recognition.
There is a pattern to the activities.
So on the left would be the input layer and
we would map it to the next layer.
We would imagine they all have a weight, but
the key part is whatever the output is,
becomes the input on the next round.
This robot helps in various scenarios.
Let's say we do these repeated activities at
various times during the day. There is a linear
relationship here of activities.
However the next day, we miss an activity as a
particular time and all the times are about to
change.
The Not RNN model is not going to handle
this very well.
Recurrent Neural Network (RNN)
50. MLMODEL s
MEMORY
u
The main thing is we take the output and feed it back into the model, it has a
memory to know previous predictions to influence future predictions.
Recurrent neural networks (RNN) can remember a bit
Long short-term memory (LSTM) can remember a lot
51.
52. I
SVM DECISIONTREES LOGISTIC REGRESSION
Confusion Matrix
Ability to visualise the output from the testing that we do
We can use different algorithms to our data but the question would be, which algorithm is
best suited to our desired inference?
53. KNOWNTRUTHS
LIKESDOGS LIKESCATS
GoesnotukeDoes
LikesDogs Truepositives falsepositives
1 LIKESCATS FALSENEGATIVES TRUENEGATIVES
LOGISTICREGRESSION 0
E CooesnotukeDoes
KNOWNTRUTHS
LIKESDOGS LIKESCATS
cooesnoiuxeooa.si
LIKESDOGS 120 98
E
E
LIKESCATS
SVM E 109 200E cooesnoiuxeooa.si
KNOWNTRUTHS
LIKESDOGS LIKESCATS
cooesnoiukeooa.si
LIKESDOGS 240 40E
E
EE LIKESCATS
45 202LOGISTIC REGRESSION E cooesnoiukeooa.si
We would do this confusion
matrix across the different
algorithms to be able to see
which algorithm performs
better.
Its not always clear which one
is better unless we
understand our question in
more detail, we would then
choose based on our
particular use case.
We can split our data to training and testing data and use Logistical Regression, SVM or
Decision Trees.
As we have labelled data, we can push the testing data through the models and get a result
but we want to establish which is best suited for our scenario.
One of the tools to be able to do this, is called a confusion matrix. This matrix maps on one
side the model prediction vs known truths such that we can see the accuracy.
You would see TP vs FP and FP vs TP.
So simply put the model predicted they do like animals when they didn't or the model
predicted they dont like animals when they did.
54. TP
SENSITIVITY
Tp FN
KNOWNTRUTHS
YES NO
E
E YES TruePositives FalsePositives
e
I No Farseneaatives Truenectarines
a
SPECIFICITY
TN
TN FP
Sensitivity and Specificity
True Positive Rate (TPR)
Recall
True Negative Rate (TNR)
Sensitivity Specificity
True Positive Rates (TPR / Recall) and
True Negative Rate (TNR)
TPR is the correct positives out of the
actual positives.
TNR is correct negatives out of the
actual negative results.
Banks are more interested in the sensitivity score since they are looking for fraudulent activities.
It is more important to catch fraud then falsely identifying - this can fixed it or account can be
unblocked if it was not fraud for example. Therefore the ML model will have higher sensitivity.
This could be similar to medical scenarios too, if it turns out to be false identification, the doctor
can use additional methods to verify.
Specificity is used for example when we have a child watching videos on YouTube. False positives
are not acceptable, we can put up with videos that would of been suitable but was not shown but
displaying unsuitable content will cause issues.
Sensitivity = True Positives / ( True
Positives + False Negatives)
The closer the sensitivity value is
to 1 then the most accurate it is.
Specificity = True Negatives / (True
Negatives + False Positives)
55. Accuracy and precision
Accuracy is the proportion of all the predictions that was correctly identified
Precision is the proportion of actual positives that were.correctly identified
We need to be careful how we frame the question when it comes to identifying and in a technical
manner.
Accuracy = TP + TN / total
Precision = TP / (TP + FP)
Accuracy with 100% means it is likely overfit and needs to be more generalised.
Precision of 1 can be possible to have no false positives
We can calculate the accuracy and precision for Logistic Regression against decision trees for example
then we can see the difference between each
56. LIKESCOFFEE
PROBABILITY
OFLIKING
COFFEE
go.gg
a
to 20 30 40 so 60 70 80 90 100
LIKESCOFFEE
INCREAJESPECIFICITY
L
PROBABILITY
OFLIKING
COFFEE
go.gg
a
10 20 30 40 so 60 70 80 90 100
KNOWNTRUTHS
LIKESCOFFEE LIKESCOFFEE
LIKESCOFFEE TRUEPOSITIVE FALSEPOSITIVE
a
1
LIKESCOFFEE FALSENEGATIVE TRUENEGATIVE
ROC/AUC
If we consider a logistical regression
graph for a Binary situation i.e. likes
coffee vs does not, we can then model
this behaviour. It must however, be
binary and also we need to identify
where that cut off is actually located.
If we move the line up, we are
increasing specificity, which means you do
not want any of the classifications incorrect.
If we move it down then we are
increasing sensitivity, but we dont mind if
some people are captured who was a false
positive but at least they are captured but
we can address this with further checks and
balances later.
The question is, where do we draw this line and it depends on what we want to show.
The other consideration is where is the best balance balance between sensitivity and specificity.
One extreme to another is not going to be useful as it will always return the same result.
The confusion matrix can be used to
identify where that line should be to
understand TP and TN the same is done
for FN and FP.
In this example there is some test data
that has been labelled as likes to drink
coffee vs do not.
Everything on the right of the vertical line
will be classified as liking coffee and
everything on the left as not liking.
57. LikesCOFFEE
knownhaunts
probability
ukescaieeukescai.ee
ofukinaco.ee
y.qu.es.oee g z
ukescai.ee O 3
DoesNotLike
COFFEE
to 20 so 40 so 60 70 so go 100
LikesCOFFEE
knownTrunts
probability
ukescaieeukescai.ee
ofukinaco.ee
y.qu.es.oee 4
ukescai.ee I 4
DoesNotLike
COFFEE
to 20 so 40 so 60 70 so go 100
LikesCOFFEE
knowntruths
probability
ukescaieeukescai.ee
aJDoesNotLike
COFFEE
to 20 so 40 so 60 70 so go too
We now have a selection of confusion matrix's. We now need to understand what we
do with these.
What is the best point for our cut off point with all our data?
We could have repeated the above at a variety of different points.
In this example we move the horizontal
line further up.
We ended up capturing 3 true
positives, and all 5 of the true negatives.
We did end up with 2 false negatives
and no false positives.
In this example if we now move the
horizontal line up we can see the
results of the confusion matrix
changes.
We misclassified a single point as
negative and positive
Here we can see we correctly identified
all 5 as liking coffee and we got 3 for
true negatives.
We did however end up with 2 false
positives and 0 false negatives
58. knownTruths
ukescoeeeeukescai.ee
Truepositiverate TPR Likescoffee
5 z
sensitivity E
ELikescoffee 0 3es
Sto
o FPR I
Falsepositiverate 2
3 2
0.4
ROC
BESTMODEL
TPR WITHMAX
SENSITIVITY
0 FPR I
BESTMODELWITH
TPR
MaySPECIFICITY
0 FPR i
r
TPR AUC
0 FPR I
ROC is useful to understand a balance between sensitivity and
specificity and the AUC for overall separability between the classes.
The line at the top is the ROC which is
Receiver Operating Characteristics and
the point where we go from the upper
slope to the line - this is the cut off point
for max sensitivity the start of the slop
is the best model for max specificity.
In both cases we need to identify where
on the graph the points change
direction effectively.
AUC is the area under the curve and it represents
generally how well the model overall is good at
distinguishing between the different classes. The larger
the area under curve, the better it is at distinguishing.
This is where ROC / AUC comes
into play.
If we have a graph of FPR and
TPR from our calculations of the
confusion matrix, we can then
plot our results.
59. GINIIMPURITY I PROBABILITYOFDOG
2 PROBABILITY OFCAT
v
WALKING RUNNING COLOR TYPE
LIKESWALKING
NO YES GREEN DOG
NO NO BLUE CAT
120 Y N 98
YES YES RED DOG S
YES NO GREEN CAT TYPE TYPE
YES NO GREEN DOG DOG CAT DOG CAT
YES YES BLUE DOG 97 23 30 68
NO NO RED CAT
LIKESWALKING
LIKESWALKING
120 Y N 98
1 ft
2
e s I Zog
2 2 120 Y N 98
s
TYPE TYPE Type TYPE
0310 DOG CAT DOG CAT DOG CAT DOG CAT
97 23 30 68 0.425 97 23 so 68
WEIGHTEDAVERAGE GINIIMPURITY VALUES
FEATURE GIYESLIKESWALKING GINI
ALLPEOPLE IMPURITY LIKESWALKING 0.362
LIKESRUNNING 0.384
NOLIKESWALKING GINI
AW PEOPLE IMPURITY FAVORITECOLOR 0371
Gini Impurity
In decision trees, the algorithm goes through the data looking for the data that represents
the biggest split. This can be calculated in various ways and Gini impurity is one of them.
What splits the data best? We need to look each of the features.
Likes walking has the lowest weighted Gini impurity, so it best separates people who like dogs
over cats. We will use likes walking as our root node.
60. FI COMBINATION OFRECALLANDPRECISION
2
I
RECALL
PRECISION
RECALLXPRECISION X 2
RECALL 1 PRECISION
LOGISTIC REGRESSION DECISIONTREES
SENSITIVITY 0.543 SENSITIVITY O 864
SPECIFICITY 0.835 SPECIFICITY O 824
ACCURACY 0.839 ACCURACY O 844
PRECISION 0.857 PRECISION 0.839
F1 Score
F1 is a combination of recall and precision, it takes into consideration
the false positives and false negatives in the calculation
F1 score is a better way to calculate accuracy•
Accuracy = (TP + TN) / Total•
Whenever you see F1, it is discussed as recall rather than sensitivity that's why its mentioned in that•
manner.
If you have an uneven class distribution then this is proved to be a better way to analyse•
61.
62. Ability put placeholders for values in tf graph capability by running
various models
Deep learning framework built on top of python•
PyTorch better for recurrent neural networks•
TensorFlow has a graph so it can see where jt came from for back•
propagation
PyTorch needs to keep track of what happen so it can improve the model•
The auto grad feature stores where calculations come from•
Used most by SageMaker•
Shares a architecture similar to PyTorch rather than TensorFlow•
Nd array is similar to np array(Numpy) which is a tensor for MXNet•
MXNet is aware of the processes it runs on, so can see gpu and cpu•
We need it to record and watch the tensor for when it comes to back•
propagation and we do this with autograd
MXNet
SciKit-Learn
Algorithm such as CNN along with a framework such as MXNet, these two put together make up•
the model which is then trained to create inferences
TensorFlow has been developed by google and powers suggested videos, spam filtering etc..•
AWS have done considerable work with MXNet and SageMaker. MXNet is very good at scaling•
across cloud infrastructure.
PyTorch is runner up to TensorFlow and established machine learning and SciKit learn is a easier•
framework to use and natively has support for many algorithms.
PyTorch
TensorFlow
AWS Services, ML and DL Frameworks
It has a number of datasets built into it already•
Getting the data, formatted and enough of it, is the biggest challenge in•
ML
Digits dataset•
63. AWSGLUE DATACATALOG
v r
KINESIS s S3 ATHENA HEAREGEMAKER
O
SQaL s
f c s.ESneetEMAkER
SCHEMA GLUE SCHEMA
n n
v v v
S3 S3 S3
KINESIS
CAMERA VIDEO Rekoanition
streams video
KINESIS
DATA
MOBILE L SNS L LAMBDA L STREAMS
It provides an SQL interface into S3•
Source data from multiple S3 locations•
Athena looks at the schema of the data•
which comes from glue
We can do feature engineering from the•
original dataset to then use for analysis
or train our algorithm
Kinesis
AWS Services
AWS Glue Crawler, can create a database•
definition from the data stored in S3
Glue does not store any data but makes•
connections including JDBC or
DynamoDB
It can perform some ETL tasks and some•
ML capabilities
The ML algorithm can DeDupe table•
record sets
We can load up CSVs and essentially grab•
the schema of the dataset
We can also glue together different•
datasets to have a single view
Athena
Glue
Ingesting large amounts of data and this might be•
from few or many data points
Video streams, data streams, data firehose,•
data analytics
Video streams allows streaming video from•
connected devices for analytics and ML and other
processing.
Data streams is a catch all and general endpoint•
to ingest large quantities of data which might to
send to EC2 instances which can do the logic or
other services like Spark on EMR. However it is
more complex to configure.
Data firehose is an endpoint to stream data into•
S3, RedShift, ES or Splunk.
Data analytics can process streaming data from•
Kinesis Streams or Firehose at scale using SQL.
64. KINESIS ATHENA
FIREHOSE
g
f S3 EHAEGEMAKER
OTHER
EMR
MASTERNODE
CORE NODE TASK NODES
Cost effective storage for large amounts of data•
Structured data•
CSV◦
JSON◦
Unstructured data•
Text files◦
Images◦
Data lake•
Add data from many sources◦
Define the data schema at the time of◦
analysis
Much lower cost than data warehouse◦
solutions
Unsuitable for transactional systems◦
Needs cataloguing before analysis◦
S3
Business Intelligence (BI) tool•
Visualise data from many sources•
Dashboards◦
Email reports◦
Embedded reports◦
End user targeted•
QuickSight
EMR
Managed service for hosting massively•
parallel compute tasks.
Integrates with storage service S3•
Petabyte scale•
Uses 'big data' tools like Spark, Hadoop,•
HBase
65.
66. Amazon Rekognition
Image moderation
Facial analysis
Celebrity recognition
Face comparison
Text in image
Use Cases
Create a filter to prevent inappropriate images being sent via a messaging platform. This can
include nudity or offensive text.
Enhance metadata catalog of an image library to include the number of people in each image
Scan an image library to detect instances of famous people
67. LAMBDA
a
GET
S3
REKOGNITIONc LAMBDA s
L
SNS SQS
Amazon Rekognition Video
In this example we start off by storing a video in S3 bucket
We have a lambda function which is invoked based on new object event
The lambda function uses the Rekognition which would go to S3 bucket and get the data
Rekognition will go through the data and will send a message to SNS Topic on completion
which will be written to an SQS Queue.
Another lambda function will see the message in the queue and go to Recognition to get the
completed job.
Use Cases
Detect people of interest in a live video stream for a public safety application.
Create a metadata catalog for stock video footage library
Detect offensive content within videos uploaded to a social media platform
68. You can enter some plain text and it will transform into speech
Female or male voices
Custom lexicons which is the ability to create your own specific words and pronunciations.
SSML (Speech Synthesis Markup Language) allows you to add syntax to change the way
something is spoken i.e. you could put an effect like 'whispered' which would say it in a whispered
tone.
There is a variety of languages like French, German, Hindi, Italian, Romanian etc..
Use Cases
Create accessibility tools to 'read' web content
Provide automatically generated announcements via a public address (PA) system.
Create an automated voice response (AVR) solution for a telephony system (including Connect)
Amazon Polly
Amazon Transcribe
You can either speak directly into the mic or pass it files which would be written to text
Use Cases
Create a call centre monitoring solution that integrates with other services to analyse caller
sentiment
Create a solution to enable text search of media with spoken words.
Provide a closed captioning solution for online video training
69. Amazon Translate
We can either pass in files or do in real-tie
There is a large variety of languages
Ability to add custom terminology also
Provides a variety of metrics, such as the ability to see successful request count, throttled request
count and character count along with others
Use Cases
Enhance an online customer chat application to translate conversations in real-time
Batch translate documents within a multilingual company.
Create a news publishing solution to convert posted stories to multiple languages
70. Amazon Lex
Automatic speed recognition (ASR)
Natural language understanding (NLU)
Use Cases
Creates a chatbot that triages customer support requests directly on the product page of a website
Create an automated receptionist that directs people as they enter a building
Provide an interactive voice interface to any application
71. v
LAMBDA LAMBDA s
J s LAMBDA LAMBDA s s
n
v
AMAZON
AMAZON
SPEECH S3 TRANSCRIBE
COMPREHEND
AWS Step Functions
In this example we recorded some audio and uploaded it into S3.
We could then use call a lambda function based on an event which would fall into amazon step
function which will then orchestrate the desired behaviour between the different services.
We are triggering another lambda function which in turn will go and speak with Transcribe and kick
off a job against the s3 bucket.
We could then use another function after a period of time to check if the job has completed or not
and based on the response we can decide what we want to do next.
Once we have the desired response we can then use another lambda function to speak with Amazon
Comprehend which allows you to extract key phases, entitles, sentiment, language amongst other
things.
This data can then be stored in a database or used by a application
AWS Step Functions lets you coordinate multiple AWS services into serverless workflows.
It allows you to stitch together services such as Transcribe, Comprehend along with others lambda
functions and services
72. S3 EFS FSX
PARAMETER CHANNEL
SageMaker Overview
Ability to build, train and deploy machine learning models quickly
It covers the entire machine learning workflow to label and prepare your data, choose an
algorithm, train the model, tune and optimise it for deployment, make predictions and take action.
Ability to pull data in data from
various different sources and we do
this using Channel Parameters.
AWS Recommend the larger instance sizes for training
Some algorithms only support GPUs
GPU instances are more expensive, but faster
There is also managed spot training and you can keep checkpoints of the model state in S3.
This is 90% cheaper than on-demand instances
SageMaker --> Training Jobs•
In a S3 bucket if we have a collection of data i.e. cats and dogs◦
We could then pick a Algorithm source i.e. a built in algorithm from SageMaker◦
We can also choose the type of algorithm like Image classification◦
Then we need to decide how we wish to input the data i.e. File or Pipe◦
Ability to select here the type of instance sizes, VPC and encryption◦
There is also ability to use hyper parameters here such as batch size, minimum epochs etc. AWS◦
pre-populates a lot of this when doing in the console.
We can then set our training data, validation data and output◦
Once you have the above, you will end up with a model which can then be used to make inferences.
73. INVOKEENDPOINT
SAGEMAKERENDPOINT
S3 MODELL ECR
S3 s S3
BATCHTRANSFORMATION
S3 SAGEMAKER DOCKER
SageMaker - Batch / Realtime
Real Time•
It is possible to do real-time inferences by allowing the application to invoke the SageMaker◦
Endpoint which would then call on the model.
Batch•
Batch Transform jobs◦
They will put in data that we want to get inference from◦
We could then push that into our classification model for example to understand if we have◦
a high value customer
74. SageMaker --> Models•
Once we have created our model, we can then set a container to host the model◦
SageMaker --> Endpoint Configuration•
We can then add our model to this endpoint configuration◦
Within this we will know the model, the instance type◦
SageMaker --> Endpoints•
We create a new endpoint here and use and existing configuration which was create◦
At this point you could run a command and give it a new file and use the model to create an
inference.
aws sagemaker-runtime invoke-endpoint --endpoint-name catdog --body filled://cat.png --profile
sandbox ./output.json
SageMaker - Deploy