SlideShare ist ein Scribd-Unternehmen logo
1 von 75
Downloaden Sie, um offline zu lesen
These notes follow LinuxAcademy structure which can be found here: https://
linuxacademy.com/course/aws-certified-machine-learning-specialty/. I would
recommend viewing the course to gain full and detailed explanations.
I have created these notes as part of my personal learning and hope to be able to
help and inspire others.
As i am also learning, there may well be mistakes, please do reach out and let me
know, so i can correct them.
Follow me on instagram: https://www.instagram.com/adnans_techie_studies/
All my notes are hosted here: https://adnan.study
Connect with me on LinkedIn: https://www.linkedin.com/in/adnanrashid1/
Overview
Machine Learning provides the ability to learn without being explicitly programmed.•
It focuses on the development of programs that can access data and use it to learn for them•
selves
Example of ML is AlphaGo which beat one of the best players of Go.•
Machine learning is when you load lots of data into a computer program and choose a model to•
“fit” the data, which allows the computer (without your help) to come up with predictions.
The way the computer makes the model is through algorithms, which can range from a simple•
equation (like the equation of a line) to a very complex system of logic/math that gets the
computer to the best predictions.
What is Machine Learning?
Artificial Intelligence
Advancements with compute power has brought a new wave of artificial intelligence•
AI being used to analyse big data•
Machine Learning is an subset of Artificial Intelligence•
Deep Learning is a subnet of Machine Learning•
1
HEIGHT
v
1
HEIGHT
TESTINGDATA
r
1 I
I
r
3
HEIGHT
This would then become our training data for our
inferences for height and weight given that we have
one of the values
Once the data is plotted we can then create a trend line
that goes through the data set.
The trend line can then be used to predict the weight
This blue line may appear to be a better fit for
predicting weights as opposed to the straight line.
What is Machine Learning?
1
I
HEIGHT
DATA
I RAIN MODEL PREDICTION
ALGORITHM
r
LINEN2 REGRESSION
We can use our test data to test the line and see how well
it fits.
These differences would be the actual observed weights
vs the predicted line. We then take those differences and
add them up together creating the sum of the actual
observed weight and predicted weight.
We could do the same for the curved line which goes
through the actual data however it is overfit to our
training data, which means it does not handle new data
very well.
The green line is better to make predictions as it is more
generalised.
This green line represents our machine learning model.
This particular type of model is called linear regression.
We would use other models depending on what we are
trying to achieve i.e. Logistic Regression, Support Vector
Machines and Decision Trees.
This is an very simplified view of what we are doing with machine learning.
We have only looked at two dimensions which is easy to visualise, however when we
get beyond 3 it becomes more difficult. Having lots of dimensions is more closer to
reality and considering we cannot draw a 200 dimension graph, Machine Learning
can help towards solving these problems.
What is Deep Learning?
Deep learning is based on the principles of an organic brain with the aim to get machines to
learn in a similar way.
Neurons are chained together as a Neural Network with inputs and outputs.
Inside the neuron is an Activation Function
Function of code◦
It takes the inputs, decides what to do with the data, stores a value and passes it that on◦
through the output
The neuron’s are usually all connected together
Examples of deep learning:
Self Driving Cars, Object detection, decision making◦
Object Classification, visual search, face recognition◦
Natural language processing, spam filters, Siri, Alexa, or Google Assistant◦
Health Care. MRI scans, CT scans, records analysis◦
COLLECT PROCESS SPLIT
7 TRAIN
DATA DATA DATA
9
IMPROVE TEST
C
PREDICTIONS
L INFER c DEPLOY
FEATURE
LABEL
c v s r
AGE STARSIGN DRINKCOFFEE LIKESCATS
20 3 YES YES
25 1 NO YES
33 2 NO NO
42 6 YES YES
The data collected in real world will be in various
different formats.
The first and primary item is to bring it into a format
which our ML algorithm will be able to understand.
In order to do this, we will also need to organise this
data.
Machine Learning Lifecycle
This is a general process for a machine learning lifecycle.
We go through iterations of the lifecycle improving our inference.
Process Data
7
I
We may do the following to the data depending on the data set.
Feature reduction
We want as much data as
possible when training our
model however, we dont want to
pass data that is not related.
This can be difficult as you may
be looking for relationships in
the data, that you are not aware
of.
Encoding
In the previous image, the
star signs are numeric
values, therefore the string
has been encoded. We
could look up the data in a
separate table.
Features and Labels
We are using these
features to try and
understand if people like
penguins or not and this is
our labelled data.Feature Engineering
We may map the features
down between 0 to 1
such that features can be
compared
Formatting
The file format that we will use
for providing the data to the ML
algorithm.
We believe there is relationship
somewhere in this dataset, but
we need a deep understanding
of the data.
Data
Once we are happy with the dataset, we would then split
into sections
Training◦
Validation◦
Testing◦
The Algorithm
Can see and is directly influenced by the training data◦
Uses but is indirectly influenced from validation data◦
Does not see testing data during training◦
Perform inference on testing data to see how well the•
model fits
Is it overfit?•
This data wasn't used to train the data, because we are
looking to see how well the model works.
If the model is overfit, then it would be really good to predict
based on data its already seen.
However what we want is to make inference on similar data.
Host model in execution environment according to the
requirements
Batch◦
As a service◦
Infrastructure required◦
Split Data
Train
Test
Deploy
AGE STARSIGN DRINKCOFFEE LIKESCATS
20 3 YES YES
25 1 No YES
33 2 NO NO
42 6 YES YES
29 3 YES YES
1
HEIGHT
DOGS CATS
An example of supervised learning is the
ability to show what it looks like when
someone likes cats
This is another example of supervised data
where we can infer the weight based on the
height.
In this example we providing different
data as opposed to just cats which will
help train the model to be able to make
inferences
Supervised, Unsupervised, and Reinforcement
M
En
SCORE A
REWARD
ACTION
Reinforcement learning we give the robot a reward
based on its actions.
As an example, we want it to pick up a cat and if it does
selects one, we would give it a reward of +1, however if
it picks up something else we would remove that
reward, in order to try enforce a preference towards the
cat.
Unsupervised learning involves finding relationships
where we did not know there was one.
It is best used when we are trying to analyse data with
lots of dimensions in order to find relationships
between the data points, where we would not
normally find using conventional methods.
Variety of algorithms that can be used such as:
Recurrent neural network◦
Convolutional Neural Network◦
Linear Regression◦
Latent Dirichlet Allocation◦
Support Vector Machines◦
Summary
Unsupervised learning
involves looking for patterns
when it is not initially evident
there are ones. It is best used
with hundreds of dimensions
where it is not possible to be
able to plot on graphs
Supervised learning is where
we use labeled data to classify
unlabelled data using a
machine learning algorithm
Reinforcement learning
involves providing a reward
when it does something
correct and taking away the
reward when it is incorrect. It
involves a lot of trial and
error to get it right
Supervised
Unsupervised
Reinforcement
Learning
1
HEIGHT
1
HEIGHT
VE
SLOPE OFMODEL LINE
Here we can get the sum of the residuals.
Some of the differences will be negative and some
positive. If we square it before we add it together, such
that they will always be positive.
We can then add the square of the residuals to get the
sum and be able to see how it differs for different lines.
This line represents the machine learning model.
How we do know that this is the best line to fit the
model?
We could have drawn at different gradients.
This is a plot of the sum of the residuals vs the slope
of the line.
The job of the machine learning algorithm is to find
the lowest point of the parabola.
The bottom of this curve would show the line with the
best fit as it has the least amount of differences.
Optimization
MINIMUMSLOPE O
O
E
V
EO u
n n
BEST FIT SLOPEFORMODEL
SLOPE OFMODEL LINE
e
VE
SLOPE OFMODEL LINE
STEP
e L
Vs
SLOPE OFMODEL LINE
It is easy for us to see the bottom of the slope but
the computer needs to be able to calculate this.
If you pick a point on the parabola you can then
calculate the slope at that point.
You can then tell if you are heading towards reduction
in slope or increase in order to understand the
gradient.
It is then possible to keep stepping until you get to
the bottom of the graph
This technique is called gradient descent.
To find the bottom of the line it depends on the step
size. If it is too large then we might miss the bottom
of the graph or too small and it would be inefficient.
This technique is used for Linear Regression,
Logistical Regression and also Support Vector
Machines.
Summary
Sum of the residuals•
Looking at the difference between the line through multiple data points and the difference◦
between the data point and the line
Square the values•
We then square the values as some of the values are positive and some negative due to◦
being below the line.
If it is squared we can get the overall positive number to be the sum of the squared◦
residuals
We can then put all those differences on a graph by taking a number of different lines and◦
see which one is the least, which is the line that best fits the data points
Graph•
If we do sum of the squares vs the slope of the model line, we will end up with a parabola.◦
The algorithm needs to find the lowest point.
The bottom of the curve is where the slope is 0 and this is the best fit.◦
Gradient Descent•
In order to discover the gradient, the model will pick a point and find the gradient and◦
move in the direction where its less steep
This technique is called gradient descent◦
Learning Rate•
The step size sets the learning rate◦
If the step size is too large it might miss the bottom of the graph and too small is not◦
efficient.
Important•
The other thing to bare in mind is that there might be multiple dips in the line◦
1
I
HEIGHT
Technique when we dont see our dataset fit real world data that well.•
Looking at the graph we can see that a small differences can have a larger effect overall.•
Regularisation
Your sample data may fit well, but real world data
generally does not fit so well straight away.
Regularisation through regression
L1 Regularisation (Lasso Regression)
L2 Regularisation (Ridge regression)
We apply regularisation when our model is
overfit, and it fits the training data really well but
does not generalise to real world inference.
A method to fix this is to apply regularisation and
it is achieved through regression.
PARAMETERS
HYPERPARAMETERS s MODEL a
Hyperparameters
These are parameters we can set to tune a
model
Hyperparameters are external settings we
can set before we train a model and
influence how the training occurs.
Parameters are internal to the algorithm
that get tunes during the training
Hyperparameters
Learning rate◦
Epochs◦
Batch size.◦
Learning Rate
Determine the size of the step taken during gradient descent optimisation.◦
It is set between 0 and 1◦
Batch Size
Batch size is the number of samples used to train at any one time.◦
It could be all of the data, some of the data or a single sample.◦
Another way to put it is batch, stochastic or mini-batch. It is often 32, 64 or 128.◦
It is possible to calculate based on infrastructure.◦
It is also based on the amount of data you have.◦
If you span over multiple servers then you might use a batch size that splits over that◦
infrastructure
Epochs
The number of times the algorithm will process the entire data set multiple times.◦
Each time it passes through the data, the intention is to improve the accuracy of the◦
algorithm.
Common values of these are high numbers - the number of times the algorithm will◦
sample the data set.
COLLECT PROCESS SPLIT
7 TRAIN
DATA DATA DATA
9
IMPROVE
TEST
PREDICTIONSC INFER c DEPLOY
TRAINING VALIDATION TESTING
TRAINING S
TESTING
L.ttVALIDATION
Cross Validation
Training data is seen by the training
process and directly influences the model
Validation is not seen by the training process
but indirectly influences the model in order to
tweak it
Testing dataset is not seen by the training
process but informs the user of success.
Cross validation data is where we dont
isolate validation data and instead we split
our training data into a number of partitions
and we use the different sections to perform
the validation to get a better fit for our
model.
As a result we use all data for training and
for validation which is called k-fold
validation.
This technique can also be used to compare
different algorithms and validating different
data sets.
NAME COUNTRY AGE HEIGHT STARSIGN LIKECOFFEE
ADNAN UK 33 170 VIRGO YES
133 GEMINI NO
SARAH BRAZIL 23
ALEXA USA 5 z PISCES NO
ALI INDIA 39 175 SCORPIO YES
SPARKY AUSTRALIA 5 35 JEDI YES
Feature Selection and Engineering
This example dataset can be used to understand if people like coffee or not.
The first thing to do is remove anything in the data set which does not have anything to do
with the inference we are making, however this does require specific domain knowledge in
order to establish if we are taking away the correct features or not.
In this data set the name is not relevant and therefore can be removed and also helps
towards making the algorithm more efficient as it won't try and make a relationship between
someones name and if they like coffee. We need to be careful we do not remove a feature
that would of been useful.
The result will be a faster trained model and also one that is more accurate.
COUNTRY AGE HEIGHT LIKECOFFEE
UK 33 170 YES
NO
BRAZIL 23 133
NOUSA 5 2
INDIA 39 175 YES
AUSTRALIA 5 35 YES
NAME COUNTRY AGE HEIGHT STARSIGNLIKECOFFEE
ADNAN Uk 33 170 VIRGO YES
NO
SARAH BRAZIL 23 133 GEMINI
ALEXA USA 5 2 PISCES NO
9
ALI INDIA 39 175 Scorpio YES
SPARKY AUSTRALIA 5 35 JEDI YES
LOOKS SUSPICIOUS
COUNTRY AGEHEIGHT LIKECOFFEE
UK 5.15 YES
BRAZIL g78 NO
USA 0.4 NO
INDIA 4.48 YES
AUSTRALIA
7 YES
HOUSE CITY DATE COFFEECONSUMED
RED BRISBANE 8 2318 1233 NO
GREEN LONDON 9 1218 0710 YES
BLUE DALLAS 102218 1050 NO
GREEN LONDON 11 1018 1235 YES
RED BRISBANE 120918 1607 YES
We may not be as interested about
which day people drink coffee but
instead the time might be of more
relevance.
The other way to establish the
relevant is by checking if there is
any correlation between the
label and the feature.
This also needs domain level
knowledge and trial and error.
Gaps and anomalies will also
influence the data set where we
can either remove the feature
entirely or do imputation.
Another strategy is to engineer new
features.
We may decide that there is a
relationship between age and
height and dividing them together
to create a new column. It would
then require running some training
to understand if it was effective.
It would also reduce the amount of
data for the algorithm to analyse.
DIMENSION
REDUCTION
N
W
Eb
g
RE
3
SCORE 1
Three features are about
the limit on a graph.
Beyond that it starts to
become difficult to start
representing that data
visually.
PCA looks for aspects of the data which influence it
the most by finding the central point of the data set.
Once we find that central point all dataset is moved
such that it is centered around the origin.
PCA allows the ability to see the relationships
between the data.
It is an unsupervised algorithm which takes
place with dimension reduction.
Although were may loose some data, we
need to maintain the principal components.
Using PCA it is possible to look at hundreds
of dimensions.
Principle Component Analysis (PCA)
Eg
score3
SCORE1
Eg
score3
SCORE1
PCI Pez
r
t
PC3
N
PCI
PCA generally does this for us but once the data set
is captured, we would need to draw around it.
The longest length represents the largest variation
of the data set which is principle component 1.
The next longest is principle 2 followed by 3 which
gives us the spread of data which most influences
the data set
PCA looks for aspects of the data which influence
it the most by finding the central point of the data
set.
Once we find that central point all data set is
moved such that it is centered around the origin.
We do that by finding the mean value on score 1,
score 2 and score 3
We can then leave out the 3rd dataset.
We would expect our data to be spread across PC1
PCA is often used as a data preprocessing step
PC1 and PC2 are usually used to plot on graphs to
see the relationship between the data.
NAME COUNTRY AGE HEIGHT STARSIGNLIKECOFFEE
ADNAN Uk 33 170 VIRGO YES
SARAH BRAZIL 133 GEMINI NO
ALEXA USA 5 2 PISCES NO
ALI INDIA 39 175 Scorpio YES
SPARKY Australia 5 35 JEDI YES
IMPUTE 1 E MEAN
NAME COUNTRY AGE HEIGHT STARSIGNLIKECOFFEE
ADNAN Uk 33 170 VIRGO YES
SARAH BRAZIL 21 133 GEMINI NO
ALEXA USA 5 z PISCES NO
ALI INDIA 39 175 Scorpio YES
SPARKY Australia 5 35 JEDI YES
331 5 39 51 4 205
Missing Data
Imagine we had surveyed a number of
people on the street and got various
data.
If we have missing data in that data set,
we may need to calculate or impute a
value. One way we could do this is by
taking the mean of all the other values
which is part of that particular feature.
In this process, we are presenting the
data for a ML algorithm to make a
inference but we don't skew our data set
by having no value or 0 which would
impact the ML model.
If we have too much data missing, it may
be better to remove that feature entirely
as it would be of little value or remove
the row if that particular row is just
missing data.
We need domain level knowledge to find
outliers, you might have correct data but
perhaps mixed up i.e. animal age and
heights vs human age and height
We may have a dataset where we are looking for faults on a car engine. It could be that it is generally
fine, but there is a few reports of something faulty. As this is not a frequent occurrence, it is likely that
this particular data will become lost in a sea of other data. As a result, it may not be recognised by the
ML model.
There is a variety of different strategies which can be taken to help:
Try and source more data because the thing you are looking for is not represented as well as you•
would like.
If it is not possible to get the data, another option is to over sample the data but then faults will•
likely to look like whatever you have in your training data
We can synthesise the data to understand what can vary and affect the data set. That way the ML•
algorithm can approximate the data
Finally we can try a different algorithm - often people use the same algorithm frequently since we•
know that algorithm and understand it.
NAME COUNTRY AGE
ADNAN UK 33
SARAH BRAZIL 23
ALEXA USA 5
ALI INDIA 39
SPARKY AUSTRALIA 5
COUNTRY BRAZIL AUSTRALIA USA UK
UK 0 O O 1
BRAZIL I 0 O O
USA O O I 0
AUSTRALIA 0 I 0 O
Label and One Hot Encoding
ML algorithms are mathematical
constructs therefore it does not work with
strings and instead needs to be integers.
So we can encode the names and also
countries, which means it is a label
encoding.
The problem doing this is that, the ML
wont understand its a country and does
not need to look for a relationship but it
will still try and find one.
In this scenario one hot encoding comes into play, whereby new features are introduced into
the data set and therefore each country would become a feature and a table with 0's and 1's .
In this case it is important not to have a numerical relationship between the countries and no
implied hierarchy between the countries.
s
NO
YES
6570758085 9095100 105110
RESTINGHEARTRATE TO 88 65 89 78 61 69 98 82
LIKES CATS Y N Y N Y Y N N N
In the example you can see
someones resting heart rate and
if they like cats or not.
However in the example you can
visually see that a low heart rate
indicates they do and higher
heart rates indicate they do not.
Supervised ML algorithm
Data is provided along with example
inferences
Looking for patterns in the data set
with examples of what you are looking
for which only be a yes or no - a binary
outcome.
Typically its used to understand if this
data is an example of something or is it
not.
Logistical Regression
NO
YES
65707580 85 90 95 100 105 110
RESTINGHEARTRATE TO 88 65 89 78 61 69 98 82
LIKES CATS Y N Y N Y Y N N N
SIGMOIDFUNCTION
YES
6570 75 80 85 90 95 100 105 110
A way to do this would be to draw a line using linear regression to find the
best fit.
A problem with this is that there may be outliers which can skew the data
set and therefore make the wrong inferences
Instead we could fit a sigmoid function which does not skew the line like linear
regression but instead looks for the cut off point between the yes and no
There are methods to fine tune this to understand what is most important.
LATITUDE COFFEE
CONSUMED
4 6 60
7 2 50
20 O 40
28 O 30
38 24 20
45 35 to
59 18 10 20 30 40 50 60 70 80 90 100
70 49
76 24
Linear Regression
The example data set might be latitude for where you live on the planet and then the
amount of coffee consumed.
We can then do techniques to understand exactly where that line sits
Although it does not go through any green points, it provides a generalised
statistically valid answer.
Supervised model
We dont just provide the core data but
also the output value that we would want
to infer later on.
An example inference is numeric where
the output is a range
This can be used for financial forecasting,
marketing effectiveness, risk evaluation
and more related to business
40
30 SUPPORTVECTOR
20
10
10 20 30 40 so 60 70 80 90 100
40
30
20
10
10 20 30 40 so 60 70 80 90 100
Support Vector Machines (SVM)
How do we best identify where we
should draw lines by identifying the
boundaries of our data sets.
We would draw a hyper plane
between the support vectors such
that when we have a new data point
we can allocate it appropriately.
Supervised model
It would be used to classify data.
It can be used for customer classification. As an
example, if we already had a classified data set we
might want to identify the high value customers.
In this example we have 2 classifications but we
need to somehow draw a line in order to identify
new classifications.
I NODE
11200
LIKEWALKING
INTERNAL
NODE Y N
LIKERUNNING CATPERSON
Y N
LEAF
CATPERSON DOGPERSON NODE
BINARY NUMERIC CHOICE
WALKS DISTACE COLOR
Y N 4km 31km RED GREEN
v a v a u
DOG CAT DOG CAT DOG CAT
Decision Trees
Supervised algorithm
We provide training data along with labels and
they can be considered example inferences
We ask the algorithm to look at this data and find
the patterns, when we give it unlabelled data, we
ask it to discover how it fits.
It can be used for customer analysis and medical
conditions
Decision trees are essentially flow
diagrams which has root nodes, internal
nodes and leaf nodes
Root nodes are where things start, the
internal node asks another questions
which flows down to another node which
is the leaf node
We dont need to have the same number
of leafs across the branches
Decision tree outputs can be binary or
numeric but is generally based on a
numeric question
We can also use decision tree decision
points to find out a choice like fav colour.
RUNNING
LIKEWALKINGLIKERUNNING Km'sWALKED FAI8 E Type
WALKING KMS 4 No yes 1 GREEN DOG
No No 2 Blue CAT
e s Yes YES 1 RED DOG
s
KMS Z WALKµq
YES NO 1 GREEN CAT
YES No 3 GREEN DOG
YES yes 4 BLUE Dok
c s
e s No No 3 RED CAT
Decision Trees
How do we start our root? We would need to understand which feature assigns most
closely to the question we are asking.
In this example when analysing the data set, we see that it is, 'likes running' for who is a
dog person vs cat person.
We would then filter the data based on that data and identify what is the next most
important feature which in turn would make up the next node and go through other
branches.
You may find some of the features was not selected as it had no correlation between the
question.
We wont see the actual decision tree when it is created but we can give it new data and
categorise the data to see its behaviour
RUNNING
s
WALKING Kms 4
e s s
Kms 2 WALKING
c s
e s
u u u u
DOG
WINS
We repeat this entire process a
random number of times and then
surveying all the data set and run it
into the ML algorithm, and see the
output and then based on the
majority output, we would label the
new data.
Random Forest
When you create a decision tree you need to know what question you will place in the root node. The
random forest will check 2 different features and follow down the branch - it is chosen randomly.
We build the decision tree in this way and continue until we have a decision tree with a random
variance.
Random forests are supervised algorithms
We have pre-labelled the data and ask it to
infer a binary, classification and numeric
outputs
It is essentially a collection of decision trees
The problem with decision trees on its own, it
can be inaccurate.
Random forest is a way to make decision trees
more accurate.
so
so
so
so
40go
soso
so so
lo lo
lo w so 40 so so o so go coo to zo so 40 so so yo so go too
60
so
40
so
to 20 30 40 so Go to so go coo TOTALVARIATION TOTALVARIATION
Supervised ML algorithm
Data is provided and example inferences.
Looking for patterns in the data set with examples
of what you are looking for which can only be a
yes or no - a binary outcome.
Typically its used to understand if this data is an
example of something or not.
If we want to find 3 classes of data, the algorithm makes some random guesses and places 3 points
across the dataset.
It then goes through each data point and checks which centre point it is closest to. The next step is
to figure out all the closest data points. At this point the data classification will be wrong so it will
move the centre point to the middle of its classes.
The algorithm will then go through the cycle again including moving the central point until the
distributions make sense. We need to find equilibrium where moving to the central point does not
effect the classification
K-Means
ELBOW PLOT
I
Few
7
er
I 2 3 4 5 6 7 8
CLUSTERS K
To find out how many times we cycle through, we can graph the number of clusters and
reduction in variation.
The first cluster variation will be at 0 and as we increase the number of clusters we will
eventually see a elbow plot where the variation does not change much.
So after a certain number of clusters, it is ineffective to do more as the variation is minimal.
Go
so
40
so
20
o
to 20 30 40 so 60 To so 90 too
K-nearest neighbours takes into account,
the number of nearest neighbours to
consider.
The 'k' means number and therefore
considers the number of data points to
take into account in order to establish the
new data point.
It should be large enough to reduce
influence from others however small
enough such that small clusters do not get
overlooked.
Supervised algorithm
Used for classifying data thats already
classified
K-means would have found some
clusters within the dataset already,
however the challenge is to know
which class to associate the new
data point to.
K-Nearest Neighbour
Document
WORD
WORD
WORD
Document
word
Topic word
Topic word
wordTopic
wordDocument
word
word
word
Unsupervised Algorithm
It is used for classification and sentiment analysis.
It is a description of the way documents are
constructed. If you have a number of documents,
those documents are made of a number of different
topics along with multiple words which can also be
in multiple topics.
LDA does not understand what is written in the
document but it does statistical analysis to get some
idea of the content.
Latent Dirichlet Allocation (LDA)
There is data analysis steps which are done before any processing is done which involves removing
particular words like 'stop words' and words such as 'and'. These words do not help towards
understanding the content.
We then apply stemming to words such as, learned, learning, and learn are all condensed into a
single word i.e learn. Once this is complete, we can then tokenise the words into an array.
Finally we choose the number of topics we would want LDA to find and this is K.
So we take all the words in our array, if we select 3 topics to find, the algorithm will randomly assign a
topic number to all the words.
WORD TOPIC TOPIC2 Topless
Topic
WORD 1 MACHINELEARNING 22 33 43
WORD 2 FUNRUN 32 34 23
WORD 3 DEEPLEARNING 44 23 34
WORD 1
LAMBDA 51 43 23
WORD 2
WORD 3 STORAGE 33 64 54
WORD 2 ARTIFICIALINTELLIGENCE 45 33 23
WORD 3
WORD 1
WORD 2
DOCUMENT TOPIC7 TOPICZTOPICS
WORD 2
WORD 3 STORAGE 123 23 34
WORD 1 MACHINELEARNING 43 143 45
LAMBDA 24 35 132
Topic WORD TOPICITopic2Topless
WORD 1 MACHINELEARNING 22 33 43
WORD 2
FUNRUN 32 34 23
WORD 3
DEEPLEARNING 44 23 34WORD 1
WORD 2 PYTHON 51 43 23
WORD 3 STORAGE 33 64 54
WORD 2 ARTIFICIALINTELLIGENCE 45 33 23
WORD 3
WORD 1
WORD 2 DOCUMENT Topic7TopiczTopics
51 24 1224
WORD 2 43 35 1505STORAGE 123 23 34WORD 3
MACHINELEARNING 43 143 45 23 X 132 3036WORD I
LAMBDA 24 35 132
We then calculate each word and
how often they appear in each
topic
Once that is complete we can then
check each document and how
often each topic appears there.
We take the number of times a
word appears in a topic and how
many times it appeared for a
particular document and multiply
them together.
Whichever one comes out higher,
we then reallocate to that topic.
This happens as many times as
necessary until all the topics and
words are complete across the
documents.
We can then see what those
documents are mostly about.
I 0 6 12 8 In 3 0.3
05in 2 5
2 t BIAS
0.3in v
ACTIVATIONFUNCTION
3
WEIGHTS
Neural Networks
On the left hand side is input layer, some
hidden layers and then an output layer
Data is processed at each layer on the
network and activated in order to get an
inference
On first layer, which is the input layer, we need
to load data into all of those inputs. As an
example, if it was an image, each pixel would
be put into every input.
Random values are then allocated to the input
neurons and these are referred to as weights.
These are the factors used to adjust before it
gets to the next layer.
The weights are multiplied together and we
add a value to the next neuron.
We also add a bias to the sum and this is
applied to an activation function.
ACTIVATION FUNCTIONS
x RELU
2.5
25
y
SIGMOID
if
NH
x
2s
b
w
b w
w
w
w b s
b
HOW CORRECT AM I
There are 3 types of activation functions which are
ReLU•
Does not consider any negative values◦
Sigmoid•
Generally places values between 0 and 1◦
Tanh•
Is similar to Sigmoid but also trends to negative 1 on the y axis.◦
If we plot the x value on the function, the y value is the activation function which is provided.
We do not tend to use Sigmoid or TanH generally, ReLU is most commonly used.
The bias is there to prevent our neuron
from being deactivated. If the result was 0
then it would not influence anything -the
more neurons you have turned off, the less
effective the network is.
At this point the output will be wrong
because everything will be random and this is
called forward propagation.
FORWARD PROPAGATION
b
w
N w
b w
LOSSFUNCTION
s w
E H w b s
EEEL EEE
b
BACK PROPAGATION
Once we get to the end we do a loss function which is an evaluation of the calculations that was
made.
This is also known as back propagation and it uses gradient descent and learning rates to reduce
the loss that takes place. It looks at a way to update weights and biases.
The iteration of doing forward and back propagation is epochs and this is how it learns.
CAT
DOG
Convolutional Neural Networks (CNN)
Supervised Algorithm
Mainly used for classifications and mostly
image classification and image detection.
The hidden layers inside the network are
known as the convolutional layers within
the network.
Images generally have particular
characteristics such as edges, feathers,
eyes and beak if it was an penguin for
example.
The different layers in the network will
work towards identifying these different
characteristics.
For an image, we would use a
convolutional filter.
We would use the first 9 pixels 3 x 3 and
use the filter on it and calculate the
outcome onto a new image. We continue
this across the whole image.
This particular filter does detection for
the edge. We can use multiple filters
which are pre-trained by others, this is
called transfer learning
HOUR ACTIVITY
6
7
8
MLMODEL
11
12 NOTRNN
13
Supervised Algorithm
This can be used for stock predictions,
time series data and voice recognition.
There is a pattern to the activities.
So on the left would be the input layer and
we would map it to the next layer.
We would imagine they all have a weight, but
the key part is whatever the output is,
becomes the input on the next round.
This robot helps in various scenarios.
Let's say we do these repeated activities at
various times during the day. There is a linear
relationship here of activities.
However the next day, we miss an activity as a
particular time and all the times are about to
change.
The Not RNN model is not going to handle
this very well.
Recurrent Neural Network (RNN)
MLMODEL s
MEMORY
u
The main thing is we take the output and feed it back into the model, it has a
memory to know previous predictions to influence future predictions.
Recurrent neural networks (RNN) can remember a bit
Long short-term memory (LSTM) can remember a lot
I
SVM DECISIONTREES LOGISTIC REGRESSION
Confusion Matrix
Ability to visualise the output from the testing that we do
We can use different algorithms to our data but the question would be, which algorithm is
best suited to our desired inference?
KNOWNTRUTHS
LIKESDOGS LIKESCATS
GoesnotukeDoes
LikesDogs Truepositives falsepositives
1 LIKESCATS FALSENEGATIVES TRUENEGATIVES
LOGISTICREGRESSION 0
E CooesnotukeDoes
KNOWNTRUTHS
LIKESDOGS LIKESCATS
cooesnoiuxeooa.si
LIKESDOGS 120 98
E
E
LIKESCATS
SVM E 109 200E cooesnoiuxeooa.si
KNOWNTRUTHS
LIKESDOGS LIKESCATS
cooesnoiukeooa.si
LIKESDOGS 240 40E
E
EE LIKESCATS
45 202LOGISTIC REGRESSION E cooesnoiukeooa.si
We would do this confusion
matrix across the different
algorithms to be able to see
which algorithm performs
better.
Its not always clear which one
is better unless we
understand our question in
more detail, we would then
choose based on our
particular use case.
We can split our data to training and testing data and use Logistical Regression, SVM or
Decision Trees.
As we have labelled data, we can push the testing data through the models and get a result
but we want to establish which is best suited for our scenario.
One of the tools to be able to do this, is called a confusion matrix. This matrix maps on one
side the model prediction vs known truths such that we can see the accuracy.
You would see TP vs FP and FP vs TP.
So simply put the model predicted they do like animals when they didn't or the model
predicted they dont like animals when they did.
TP
SENSITIVITY
Tp FN
KNOWNTRUTHS
YES NO
E
E YES TruePositives FalsePositives
e
I No Farseneaatives Truenectarines
a
SPECIFICITY
TN
TN FP
Sensitivity and Specificity
True Positive Rate (TPR)
Recall
True Negative Rate (TNR)
Sensitivity Specificity
True Positive Rates (TPR / Recall) and
True Negative Rate (TNR)
TPR is the correct positives out of the
actual positives.
TNR is correct negatives out of the
actual negative results.
Banks are more interested in the sensitivity score since they are looking for fraudulent activities.
It is more important to catch fraud then falsely identifying - this can fixed it or account can be
unblocked if it was not fraud for example. Therefore the ML model will have higher sensitivity.
This could be similar to medical scenarios too, if it turns out to be false identification, the doctor
can use additional methods to verify.
Specificity is used for example when we have a child watching videos on YouTube. False positives
are not acceptable, we can put up with videos that would of been suitable but was not shown but
displaying unsuitable content will cause issues.
Sensitivity = True Positives / ( True
Positives + False Negatives)
The closer the sensitivity value is
to 1 then the most accurate it is.
Specificity = True Negatives / (True
Negatives + False Positives)
Accuracy and precision
Accuracy is the proportion of all the predictions that was correctly identified
Precision is the proportion of actual positives that were.correctly identified
We need to be careful how we frame the question when it comes to identifying and in a technical
manner.
Accuracy = TP + TN / total
Precision = TP / (TP + FP)
Accuracy with 100% means it is likely overfit and needs to be more generalised.
Precision of 1 can be possible to have no false positives
We can calculate the accuracy and precision for Logistic Regression against decision trees for example
then we can see the difference between each
LIKESCOFFEE
PROBABILITY
OFLIKING
COFFEE
go.gg
a
to 20 30 40 so 60 70 80 90 100
LIKESCOFFEE
INCREAJESPECIFICITY
L
PROBABILITY
OFLIKING
COFFEE
go.gg
a
10 20 30 40 so 60 70 80 90 100
KNOWNTRUTHS
LIKESCOFFEE LIKESCOFFEE
LIKESCOFFEE TRUEPOSITIVE FALSEPOSITIVE
a
1
LIKESCOFFEE FALSENEGATIVE TRUENEGATIVE
ROC/AUC
If we consider a logistical regression
graph for a Binary situation i.e. likes
coffee vs does not, we can then model
this behaviour. It must however, be
binary and also we need to identify
where that cut off is actually located.
If we move the line up, we are
increasing specificity, which means you do
not want any of the classifications incorrect.
If we move it down then we are
increasing sensitivity, but we dont mind if
some people are captured who was a false
positive but at least they are captured but
we can address this with further checks and
balances later.
The question is, where do we draw this line and it depends on what we want to show.
The other consideration is where is the best balance balance between sensitivity and specificity.
One extreme to another is not going to be useful as it will always return the same result.
The confusion matrix can be used to
identify where that line should be to
understand TP and TN the same is done
for FN and FP.
In this example there is some test data
that has been labelled as likes to drink
coffee vs do not.
Everything on the right of the vertical line
will be classified as liking coffee and
everything on the left as not liking.
LikesCOFFEE
knownhaunts
probability
ukescaieeukescai.ee
ofukinaco.ee
y.qu.es.oee g z
ukescai.ee O 3
DoesNotLike
COFFEE
to 20 so 40 so 60 70 so go 100
LikesCOFFEE
knownTrunts
probability
ukescaieeukescai.ee
ofukinaco.ee
y.qu.es.oee 4
ukescai.ee I 4
DoesNotLike
COFFEE
to 20 so 40 so 60 70 so go 100
LikesCOFFEE
knowntruths
probability
ukescaieeukescai.ee
aJDoesNotLike
COFFEE
to 20 so 40 so 60 70 so go too
We now have a selection of confusion matrix's. We now need to understand what we
do with these.
What is the best point for our cut off point with all our data?
We could have repeated the above at a variety of different points.
In this example we move the horizontal
line further up.
We ended up capturing 3 true
positives, and all 5 of the true negatives.
We did end up with 2 false negatives
and no false positives.
In this example if we now move the
horizontal line up we can see the
results of the confusion matrix
changes.
We misclassified a single point as
negative and positive
Here we can see we correctly identified
all 5 as liking coffee and we got 3 for
true negatives.
We did however end up with 2 false
positives and 0 false negatives
knownTruths
ukescoeeeeukescai.ee
Truepositiverate TPR Likescoffee
5 z
sensitivity E
ELikescoffee 0 3es
Sto
o FPR I
Falsepositiverate 2
3 2
0.4
ROC
BESTMODEL
TPR WITHMAX
SENSITIVITY
0 FPR I
BESTMODELWITH
TPR
MaySPECIFICITY
0 FPR i
r
TPR AUC
0 FPR I
ROC is useful to understand a balance between sensitivity and
specificity and the AUC for overall separability between the classes.
The line at the top is the ROC which is
Receiver Operating Characteristics and
the point where we go from the upper
slope to the line - this is the cut off point
for max sensitivity the start of the slop
is the best model for max specificity.
In both cases we need to identify where
on the graph the points change
direction effectively.
AUC is the area under the curve and it represents
generally how well the model overall is good at
distinguishing between the different classes. The larger
the area under curve, the better it is at distinguishing.
This is where ROC / AUC comes
into play.
If we have a graph of FPR and
TPR from our calculations of the
confusion matrix, we can then
plot our results.
GINIIMPURITY I PROBABILITYOFDOG
2 PROBABILITY OFCAT
v
WALKING RUNNING COLOR TYPE
LIKESWALKING
NO YES GREEN DOG
NO NO BLUE CAT
120 Y N 98
YES YES RED DOG S
YES NO GREEN CAT TYPE TYPE
YES NO GREEN DOG DOG CAT DOG CAT
YES YES BLUE DOG 97 23 30 68
NO NO RED CAT
LIKESWALKING
LIKESWALKING
120 Y N 98
1 ft
2
e s I Zog
2 2 120 Y N 98
s
TYPE TYPE Type TYPE
0310 DOG CAT DOG CAT DOG CAT DOG CAT
97 23 30 68 0.425 97 23 so 68
WEIGHTEDAVERAGE GINIIMPURITY VALUES
FEATURE GIYESLIKESWALKING GINI
ALLPEOPLE IMPURITY LIKESWALKING 0.362
LIKESRUNNING 0.384
NOLIKESWALKING GINI
AW PEOPLE IMPURITY FAVORITECOLOR 0371
Gini Impurity
In decision trees, the algorithm goes through the data looking for the data that represents
the biggest split. This can be calculated in various ways and Gini impurity is one of them.
What splits the data best? We need to look each of the features.
Likes walking has the lowest weighted Gini impurity, so it best separates people who like dogs
over cats. We will use likes walking as our root node.
FI COMBINATION OFRECALLANDPRECISION
2
I
RECALL
PRECISION
RECALLXPRECISION X 2
RECALL 1 PRECISION
LOGISTIC REGRESSION DECISIONTREES
SENSITIVITY 0.543 SENSITIVITY O 864
SPECIFICITY 0.835 SPECIFICITY O 824
ACCURACY 0.839 ACCURACY O 844
PRECISION 0.857 PRECISION 0.839
F1 Score
F1 is a combination of recall and precision, it takes into consideration
the false positives and false negatives in the calculation
F1 score is a better way to calculate accuracy•
Accuracy = (TP + TN) / Total•
Whenever you see F1, it is discussed as recall rather than sensitivity that's why its mentioned in that•
manner.
If you have an uneven class distribution then this is proved to be a better way to analyse•
Ability put placeholders for values in tf graph capability by running
various models
Deep learning framework built on top of python•
PyTorch better for recurrent neural networks•
TensorFlow has a graph so it can see where jt came from for back•
propagation
PyTorch needs to keep track of what happen so it can improve the model•
The auto grad feature stores where calculations come from•
Used most by SageMaker•
Shares a architecture similar to PyTorch rather than TensorFlow•
Nd array is similar to np array(Numpy) which is a tensor for MXNet•
MXNet is aware of the processes it runs on, so can see gpu and cpu•
We need it to record and watch the tensor for when it comes to back•
propagation and we do this with autograd
MXNet
SciKit-Learn
Algorithm such as CNN along with a framework such as MXNet, these two put together make up•
the model which is then trained to create inferences
TensorFlow has been developed by google and powers suggested videos, spam filtering etc..•
AWS have done considerable work with MXNet and SageMaker. MXNet is very good at scaling•
across cloud infrastructure.
PyTorch is runner up to TensorFlow and established machine learning and SciKit learn is a easier•
framework to use and natively has support for many algorithms.
PyTorch
TensorFlow
AWS Services, ML and DL Frameworks
It has a number of datasets built into it already•
Getting the data, formatted and enough of it, is the biggest challenge in•
ML
Digits dataset•
AWSGLUE DATACATALOG
v r
KINESIS s S3 ATHENA HEAREGEMAKER
O
SQaL s
f c s.ESneetEMAkER
SCHEMA GLUE SCHEMA
n n
v v v
S3 S3 S3
KINESIS
CAMERA VIDEO Rekoanition
streams video
KINESIS
DATA
MOBILE L SNS L LAMBDA L STREAMS
It provides an SQL interface into S3•
Source data from multiple S3 locations•
Athena looks at the schema of the data•
which comes from glue
We can do feature engineering from the•
original dataset to then use for analysis
or train our algorithm
Kinesis
AWS Services
AWS Glue Crawler, can create a database•
definition from the data stored in S3
Glue does not store any data but makes•
connections including JDBC or
DynamoDB
It can perform some ETL tasks and some•
ML capabilities
The ML algorithm can DeDupe table•
record sets
We can load up CSVs and essentially grab•
the schema of the dataset
We can also glue together different•
datasets to have a single view
Athena
Glue
Ingesting large amounts of data and this might be•
from few or many data points
Video streams, data streams, data firehose,•
data analytics
Video streams allows streaming video from•
connected devices for analytics and ML and other
processing.
Data streams is a catch all and general endpoint•
to ingest large quantities of data which might to
send to EC2 instances which can do the logic or
other services like Spark on EMR. However it is
more complex to configure.
Data firehose is an endpoint to stream data into•
S3, RedShift, ES or Splunk.
Data analytics can process streaming data from•
Kinesis Streams or Firehose at scale using SQL.
KINESIS ATHENA
FIREHOSE
g
f S3 EHAEGEMAKER
OTHER
EMR
MASTERNODE
CORE NODE TASK NODES
Cost effective storage for large amounts of data•
Structured data•
CSV◦
JSON◦
Unstructured data•
Text files◦
Images◦
Data lake•
Add data from many sources◦
Define the data schema at the time of◦
analysis
Much lower cost than data warehouse◦
solutions
Unsuitable for transactional systems◦
Needs cataloguing before analysis◦
S3
Business Intelligence (BI) tool•
Visualise data from many sources•
Dashboards◦
Email reports◦
Embedded reports◦
End user targeted•
QuickSight
EMR
Managed service for hosting massively•
parallel compute tasks.
Integrates with storage service S3•
Petabyte scale•
Uses 'big data' tools like Spark, Hadoop,•
HBase
Amazon Rekognition
Image moderation
Facial analysis
Celebrity recognition
Face comparison
Text in image
Use Cases
Create a filter to prevent inappropriate images being sent via a messaging platform. This can
include nudity or offensive text.
Enhance metadata catalog of an image library to include the number of people in each image
Scan an image library to detect instances of famous people
LAMBDA
a
GET
S3
REKOGNITIONc LAMBDA s
L
SNS SQS
Amazon Rekognition Video
In this example we start off by storing a video in S3 bucket
We have a lambda function which is invoked based on new object event
The lambda function uses the Rekognition which would go to S3 bucket and get the data
Rekognition will go through the data and will send a message to SNS Topic on completion
which will be written to an SQS Queue.
Another lambda function will see the message in the queue and go to Recognition to get the
completed job.
Use Cases
Detect people of interest in a live video stream for a public safety application.
Create a metadata catalog for stock video footage library
Detect offensive content within videos uploaded to a social media platform
You can enter some plain text and it will transform into speech
Female or male voices
Custom lexicons which is the ability to create your own specific words and pronunciations.
SSML (Speech Synthesis Markup Language) allows you to add syntax to change the way
something is spoken i.e. you could put an effect like 'whispered' which would say it in a whispered
tone.
There is a variety of languages like French, German, Hindi, Italian, Romanian etc..
Use Cases
Create accessibility tools to 'read' web content
Provide automatically generated announcements via a public address (PA) system.
Create an automated voice response (AVR) solution for a telephony system (including Connect)
Amazon Polly
Amazon Transcribe
You can either speak directly into the mic or pass it files which would be written to text
Use Cases
Create a call centre monitoring solution that integrates with other services to analyse caller
sentiment
Create a solution to enable text search of media with spoken words.
Provide a closed captioning solution for online video training
Amazon Translate
We can either pass in files or do in real-tie
There is a large variety of languages
Ability to add custom terminology also
Provides a variety of metrics, such as the ability to see successful request count, throttled request
count and character count along with others
Use Cases
Enhance an online customer chat application to translate conversations in real-time
Batch translate documents within a multilingual company.
Create a news publishing solution to convert posted stories to multiple languages
Amazon Lex
Automatic speed recognition (ASR)
Natural language understanding (NLU)
Use Cases
Creates a chatbot that triages customer support requests directly on the product page of a website
Create an automated receptionist that directs people as they enter a building
Provide an interactive voice interface to any application
v
LAMBDA LAMBDA s
J s LAMBDA LAMBDA s s
n
v
AMAZON
AMAZON
SPEECH S3 TRANSCRIBE
COMPREHEND
AWS Step Functions
In this example we recorded some audio and uploaded it into S3.
We could then use call a lambda function based on an event which would fall into amazon step
function which will then orchestrate the desired behaviour between the different services.
We are triggering another lambda function which in turn will go and speak with Transcribe and kick
off a job against the s3 bucket.
We could then use another function after a period of time to check if the job has completed or not
and based on the response we can decide what we want to do next.
Once we have the desired response we can then use another lambda function to speak with Amazon
Comprehend which allows you to extract key phases, entitles, sentiment, language amongst other
things.
This data can then be stored in a database or used by a application
AWS Step Functions lets you coordinate multiple AWS services into serverless workflows.
It allows you to stitch together services such as Transcribe, Comprehend along with others lambda
functions and services
S3 EFS FSX
PARAMETER CHANNEL
SageMaker Overview
Ability to build, train and deploy machine learning models quickly
It covers the entire machine learning workflow to label and prepare your data, choose an
algorithm, train the model, tune and optimise it for deployment, make predictions and take action.
Ability to pull data in data from
various different sources and we do
this using Channel Parameters.
AWS Recommend the larger instance sizes for training
Some algorithms only support GPUs
GPU instances are more expensive, but faster
There is also managed spot training and you can keep checkpoints of the model state in S3.
This is 90% cheaper than on-demand instances
SageMaker --> Training Jobs•
In a S3 bucket if we have a collection of data i.e. cats and dogs◦
We could then pick a Algorithm source i.e. a built in algorithm from SageMaker◦
We can also choose the type of algorithm like Image classification◦
Then we need to decide how we wish to input the data i.e. File or Pipe◦
Ability to select here the type of instance sizes, VPC and encryption◦
There is also ability to use hyper parameters here such as batch size, minimum epochs etc. AWS◦
pre-populates a lot of this when doing in the console.
We can then set our training data, validation data and output◦
Once you have the above, you will end up with a model which can then be used to make inferences.
INVOKEENDPOINT
SAGEMAKERENDPOINT
S3 MODELL ECR
S3 s S3
BATCHTRANSFORMATION
S3 SAGEMAKER DOCKER
SageMaker - Batch / Realtime
Real Time•
It is possible to do real-time inferences by allowing the application to invoke the SageMaker◦
Endpoint which would then call on the model.
Batch•
Batch Transform jobs◦
They will put in data that we want to get inference from◦
We could then push that into our classification model for example to understand if we have◦
a high value customer
SageMaker --> Models•
Once we have created our model, we can then set a container to host the model◦
SageMaker --> Endpoint Configuration•
We can then add our model to this endpoint configuration◦
Within this we will know the model, the instance type◦
SageMaker --> Endpoints•
We create a new endpoint here and use and existing configuration which was create◦
At this point you could run a command and give it a new file and use the model to create an
inference.
aws sagemaker-runtime invoke-endpoint --endpoint-name catdog --body filled://cat.png --profile
sandbox ./output.json
SageMaker - Deploy
Visit https://adnan.study for all notebooks

Weitere ähnliche Inhalte

Was ist angesagt?

DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..
butest
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
Simplilearn
 

Was ist angesagt? (19)

Machine Learning Interview Questions Answers
Machine Learning Interview Questions AnswersMachine Learning Interview Questions Answers
Machine Learning Interview Questions Answers
 
Module 2: Machine Learning Deep Dive
Module 2:  Machine Learning Deep DiveModule 2:  Machine Learning Deep Dive
Module 2: Machine Learning Deep Dive
 
Ways to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceWays to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performance
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision Trees
 
1 Supervised learning
1 Supervised learning1 Supervised learning
1 Supervised learning
 
Understanding Basics of Machine Learning
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine Learning
 
Machine Learning Interview Questions and Answers
Machine Learning Interview Questions and AnswersMachine Learning Interview Questions and Answers
Machine Learning Interview Questions and Answers
 
Machine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our WorldMachine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our World
 
Machine learning basics
Machine learning basics Machine learning basics
Machine learning basics
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..
 
Machine Learning Interview Questions
Machine Learning Interview QuestionsMachine Learning Interview Questions
Machine Learning Interview Questions
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
 
Machine learning - AI
Machine learning - AIMachine learning - AI
Machine learning - AI
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparation
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Machine learning_ Replicating Human Brain
Machine learning_ Replicating Human BrainMachine learning_ Replicating Human Brain
Machine learning_ Replicating Human Brain
 
Machine learning interview questions and answers
Machine learning interview questions and answersMachine learning interview questions and answers
Machine learning interview questions and answers
 

Ähnlich wie AWS Certified Machine Learning Specialty

Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
Ajit Ghodke
 
Top 40 Data Science Interview Questions and Answers 2022.pdf
Top 40 Data Science Interview Questions and Answers 2022.pdfTop 40 Data Science Interview Questions and Answers 2022.pdf
Top 40 Data Science Interview Questions and Answers 2022.pdf
Suraj Kumar
 

Ähnlich wie AWS Certified Machine Learning Specialty (20)

Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
ML.pdf
ML.pdfML.pdf
ML.pdf
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Introduction to ml
Introduction to mlIntroduction to ml
Introduction to ml
 
MachineLlearning introduction
MachineLlearning introductionMachineLlearning introduction
MachineLlearning introduction
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
 
INTERNSHIP ON MAcHINE LEARNING.pptx
INTERNSHIP ON MAcHINE LEARNING.pptxINTERNSHIP ON MAcHINE LEARNING.pptx
INTERNSHIP ON MAcHINE LEARNING.pptx
 
Regresión
RegresiónRegresión
Regresión
 
Machine Learning by Rj
Machine Learning by RjMachine Learning by Rj
Machine Learning by Rj
 
Unit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptxUnit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptx
 
Explore ML day 1
Explore ML day 1Explore ML day 1
Explore ML day 1
 
AI.pdf
AI.pdfAI.pdf
AI.pdf
 
An Introduction to Machine Learning
An Introduction to Machine LearningAn Introduction to Machine Learning
An Introduction to Machine Learning
 
Top 40 Data Science Interview Questions and Answers 2022.pdf
Top 40 Data Science Interview Questions and Answers 2022.pdfTop 40 Data Science Interview Questions and Answers 2022.pdf
Top 40 Data Science Interview Questions and Answers 2022.pdf
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learning
 
How to understand and implement regression analysis
How to understand and implement regression analysisHow to understand and implement regression analysis
How to understand and implement regression analysis
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

AWS Certified Machine Learning Specialty

  • 1.
  • 2.
  • 3.
  • 4. These notes follow LinuxAcademy structure which can be found here: https:// linuxacademy.com/course/aws-certified-machine-learning-specialty/. I would recommend viewing the course to gain full and detailed explanations. I have created these notes as part of my personal learning and hope to be able to help and inspire others. As i am also learning, there may well be mistakes, please do reach out and let me know, so i can correct them. Follow me on instagram: https://www.instagram.com/adnans_techie_studies/ All my notes are hosted here: https://adnan.study Connect with me on LinkedIn: https://www.linkedin.com/in/adnanrashid1/ Overview
  • 5.
  • 6. Machine Learning provides the ability to learn without being explicitly programmed.• It focuses on the development of programs that can access data and use it to learn for them• selves Example of ML is AlphaGo which beat one of the best players of Go.• Machine learning is when you load lots of data into a computer program and choose a model to• “fit” the data, which allows the computer (without your help) to come up with predictions. The way the computer makes the model is through algorithms, which can range from a simple• equation (like the equation of a line) to a very complex system of logic/math that gets the computer to the best predictions. What is Machine Learning? Artificial Intelligence Advancements with compute power has brought a new wave of artificial intelligence• AI being used to analyse big data• Machine Learning is an subset of Artificial Intelligence• Deep Learning is a subnet of Machine Learning•
  • 7. 1 HEIGHT v 1 HEIGHT TESTINGDATA r 1 I I r 3 HEIGHT This would then become our training data for our inferences for height and weight given that we have one of the values Once the data is plotted we can then create a trend line that goes through the data set. The trend line can then be used to predict the weight This blue line may appear to be a better fit for predicting weights as opposed to the straight line. What is Machine Learning?
  • 8. 1 I HEIGHT DATA I RAIN MODEL PREDICTION ALGORITHM r LINEN2 REGRESSION We can use our test data to test the line and see how well it fits. These differences would be the actual observed weights vs the predicted line. We then take those differences and add them up together creating the sum of the actual observed weight and predicted weight. We could do the same for the curved line which goes through the actual data however it is overfit to our training data, which means it does not handle new data very well. The green line is better to make predictions as it is more generalised. This green line represents our machine learning model. This particular type of model is called linear regression. We would use other models depending on what we are trying to achieve i.e. Logistic Regression, Support Vector Machines and Decision Trees. This is an very simplified view of what we are doing with machine learning. We have only looked at two dimensions which is easy to visualise, however when we get beyond 3 it becomes more difficult. Having lots of dimensions is more closer to reality and considering we cannot draw a 200 dimension graph, Machine Learning can help towards solving these problems.
  • 9. What is Deep Learning? Deep learning is based on the principles of an organic brain with the aim to get machines to learn in a similar way. Neurons are chained together as a Neural Network with inputs and outputs. Inside the neuron is an Activation Function Function of code◦ It takes the inputs, decides what to do with the data, stores a value and passes it that on◦ through the output The neuron’s are usually all connected together Examples of deep learning: Self Driving Cars, Object detection, decision making◦ Object Classification, visual search, face recognition◦ Natural language processing, spam filters, Siri, Alexa, or Google Assistant◦ Health Care. MRI scans, CT scans, records analysis◦
  • 10.
  • 11. COLLECT PROCESS SPLIT 7 TRAIN DATA DATA DATA 9 IMPROVE TEST C PREDICTIONS L INFER c DEPLOY FEATURE LABEL c v s r AGE STARSIGN DRINKCOFFEE LIKESCATS 20 3 YES YES 25 1 NO YES 33 2 NO NO 42 6 YES YES The data collected in real world will be in various different formats. The first and primary item is to bring it into a format which our ML algorithm will be able to understand. In order to do this, we will also need to organise this data. Machine Learning Lifecycle This is a general process for a machine learning lifecycle. We go through iterations of the lifecycle improving our inference. Process Data
  • 12. 7 I We may do the following to the data depending on the data set. Feature reduction We want as much data as possible when training our model however, we dont want to pass data that is not related. This can be difficult as you may be looking for relationships in the data, that you are not aware of. Encoding In the previous image, the star signs are numeric values, therefore the string has been encoded. We could look up the data in a separate table. Features and Labels We are using these features to try and understand if people like penguins or not and this is our labelled data.Feature Engineering We may map the features down between 0 to 1 such that features can be compared Formatting The file format that we will use for providing the data to the ML algorithm. We believe there is relationship somewhere in this dataset, but we need a deep understanding of the data. Data
  • 13. Once we are happy with the dataset, we would then split into sections Training◦ Validation◦ Testing◦ The Algorithm Can see and is directly influenced by the training data◦ Uses but is indirectly influenced from validation data◦ Does not see testing data during training◦ Perform inference on testing data to see how well the• model fits Is it overfit?• This data wasn't used to train the data, because we are looking to see how well the model works. If the model is overfit, then it would be really good to predict based on data its already seen. However what we want is to make inference on similar data. Host model in execution environment according to the requirements Batch◦ As a service◦ Infrastructure required◦ Split Data Train Test Deploy
  • 14. AGE STARSIGN DRINKCOFFEE LIKESCATS 20 3 YES YES 25 1 No YES 33 2 NO NO 42 6 YES YES 29 3 YES YES 1 HEIGHT DOGS CATS An example of supervised learning is the ability to show what it looks like when someone likes cats This is another example of supervised data where we can infer the weight based on the height. In this example we providing different data as opposed to just cats which will help train the model to be able to make inferences Supervised, Unsupervised, and Reinforcement
  • 15. M En SCORE A REWARD ACTION Reinforcement learning we give the robot a reward based on its actions. As an example, we want it to pick up a cat and if it does selects one, we would give it a reward of +1, however if it picks up something else we would remove that reward, in order to try enforce a preference towards the cat. Unsupervised learning involves finding relationships where we did not know there was one. It is best used when we are trying to analyse data with lots of dimensions in order to find relationships between the data points, where we would not normally find using conventional methods.
  • 16. Variety of algorithms that can be used such as: Recurrent neural network◦ Convolutional Neural Network◦ Linear Regression◦ Latent Dirichlet Allocation◦ Support Vector Machines◦ Summary Unsupervised learning involves looking for patterns when it is not initially evident there are ones. It is best used with hundreds of dimensions where it is not possible to be able to plot on graphs Supervised learning is where we use labeled data to classify unlabelled data using a machine learning algorithm Reinforcement learning involves providing a reward when it does something correct and taking away the reward when it is incorrect. It involves a lot of trial and error to get it right Supervised Unsupervised Reinforcement Learning
  • 17. 1 HEIGHT 1 HEIGHT VE SLOPE OFMODEL LINE Here we can get the sum of the residuals. Some of the differences will be negative and some positive. If we square it before we add it together, such that they will always be positive. We can then add the square of the residuals to get the sum and be able to see how it differs for different lines. This line represents the machine learning model. How we do know that this is the best line to fit the model? We could have drawn at different gradients. This is a plot of the sum of the residuals vs the slope of the line. The job of the machine learning algorithm is to find the lowest point of the parabola. The bottom of this curve would show the line with the best fit as it has the least amount of differences. Optimization
  • 18. MINIMUMSLOPE O O E V EO u n n BEST FIT SLOPEFORMODEL SLOPE OFMODEL LINE e VE SLOPE OFMODEL LINE STEP e L Vs SLOPE OFMODEL LINE It is easy for us to see the bottom of the slope but the computer needs to be able to calculate this. If you pick a point on the parabola you can then calculate the slope at that point. You can then tell if you are heading towards reduction in slope or increase in order to understand the gradient. It is then possible to keep stepping until you get to the bottom of the graph This technique is called gradient descent. To find the bottom of the line it depends on the step size. If it is too large then we might miss the bottom of the graph or too small and it would be inefficient. This technique is used for Linear Regression, Logistical Regression and also Support Vector Machines.
  • 19. Summary Sum of the residuals• Looking at the difference between the line through multiple data points and the difference◦ between the data point and the line Square the values• We then square the values as some of the values are positive and some negative due to◦ being below the line. If it is squared we can get the overall positive number to be the sum of the squared◦ residuals We can then put all those differences on a graph by taking a number of different lines and◦ see which one is the least, which is the line that best fits the data points Graph• If we do sum of the squares vs the slope of the model line, we will end up with a parabola.◦ The algorithm needs to find the lowest point. The bottom of the curve is where the slope is 0 and this is the best fit.◦ Gradient Descent• In order to discover the gradient, the model will pick a point and find the gradient and◦ move in the direction where its less steep This technique is called gradient descent◦ Learning Rate• The step size sets the learning rate◦ If the step size is too large it might miss the bottom of the graph and too small is not◦ efficient. Important• The other thing to bare in mind is that there might be multiple dips in the line◦
  • 20. 1 I HEIGHT Technique when we dont see our dataset fit real world data that well.• Looking at the graph we can see that a small differences can have a larger effect overall.• Regularisation Your sample data may fit well, but real world data generally does not fit so well straight away. Regularisation through regression L1 Regularisation (Lasso Regression) L2 Regularisation (Ridge regression) We apply regularisation when our model is overfit, and it fits the training data really well but does not generalise to real world inference. A method to fix this is to apply regularisation and it is achieved through regression.
  • 21. PARAMETERS HYPERPARAMETERS s MODEL a Hyperparameters These are parameters we can set to tune a model Hyperparameters are external settings we can set before we train a model and influence how the training occurs. Parameters are internal to the algorithm that get tunes during the training Hyperparameters Learning rate◦ Epochs◦ Batch size.◦ Learning Rate Determine the size of the step taken during gradient descent optimisation.◦ It is set between 0 and 1◦ Batch Size Batch size is the number of samples used to train at any one time.◦ It could be all of the data, some of the data or a single sample.◦ Another way to put it is batch, stochastic or mini-batch. It is often 32, 64 or 128.◦ It is possible to calculate based on infrastructure.◦ It is also based on the amount of data you have.◦ If you span over multiple servers then you might use a batch size that splits over that◦ infrastructure Epochs The number of times the algorithm will process the entire data set multiple times.◦ Each time it passes through the data, the intention is to improve the accuracy of the◦ algorithm. Common values of these are high numbers - the number of times the algorithm will◦ sample the data set.
  • 22. COLLECT PROCESS SPLIT 7 TRAIN DATA DATA DATA 9 IMPROVE TEST PREDICTIONSC INFER c DEPLOY TRAINING VALIDATION TESTING TRAINING S TESTING L.ttVALIDATION Cross Validation Training data is seen by the training process and directly influences the model Validation is not seen by the training process but indirectly influences the model in order to tweak it Testing dataset is not seen by the training process but informs the user of success. Cross validation data is where we dont isolate validation data and instead we split our training data into a number of partitions and we use the different sections to perform the validation to get a better fit for our model. As a result we use all data for training and for validation which is called k-fold validation. This technique can also be used to compare different algorithms and validating different data sets.
  • 23.
  • 24. NAME COUNTRY AGE HEIGHT STARSIGN LIKECOFFEE ADNAN UK 33 170 VIRGO YES 133 GEMINI NO SARAH BRAZIL 23 ALEXA USA 5 z PISCES NO ALI INDIA 39 175 SCORPIO YES SPARKY AUSTRALIA 5 35 JEDI YES Feature Selection and Engineering This example dataset can be used to understand if people like coffee or not. The first thing to do is remove anything in the data set which does not have anything to do with the inference we are making, however this does require specific domain knowledge in order to establish if we are taking away the correct features or not. In this data set the name is not relevant and therefore can be removed and also helps towards making the algorithm more efficient as it won't try and make a relationship between someones name and if they like coffee. We need to be careful we do not remove a feature that would of been useful. The result will be a faster trained model and also one that is more accurate.
  • 25. COUNTRY AGE HEIGHT LIKECOFFEE UK 33 170 YES NO BRAZIL 23 133 NOUSA 5 2 INDIA 39 175 YES AUSTRALIA 5 35 YES NAME COUNTRY AGE HEIGHT STARSIGNLIKECOFFEE ADNAN Uk 33 170 VIRGO YES NO SARAH BRAZIL 23 133 GEMINI ALEXA USA 5 2 PISCES NO 9 ALI INDIA 39 175 Scorpio YES SPARKY AUSTRALIA 5 35 JEDI YES LOOKS SUSPICIOUS COUNTRY AGEHEIGHT LIKECOFFEE UK 5.15 YES BRAZIL g78 NO USA 0.4 NO INDIA 4.48 YES AUSTRALIA 7 YES HOUSE CITY DATE COFFEECONSUMED RED BRISBANE 8 2318 1233 NO GREEN LONDON 9 1218 0710 YES BLUE DALLAS 102218 1050 NO GREEN LONDON 11 1018 1235 YES RED BRISBANE 120918 1607 YES We may not be as interested about which day people drink coffee but instead the time might be of more relevance. The other way to establish the relevant is by checking if there is any correlation between the label and the feature. This also needs domain level knowledge and trial and error. Gaps and anomalies will also influence the data set where we can either remove the feature entirely or do imputation. Another strategy is to engineer new features. We may decide that there is a relationship between age and height and dividing them together to create a new column. It would then require running some training to understand if it was effective. It would also reduce the amount of data for the algorithm to analyse.
  • 26. DIMENSION REDUCTION N W Eb g RE 3 SCORE 1 Three features are about the limit on a graph. Beyond that it starts to become difficult to start representing that data visually. PCA looks for aspects of the data which influence it the most by finding the central point of the data set. Once we find that central point all dataset is moved such that it is centered around the origin. PCA allows the ability to see the relationships between the data. It is an unsupervised algorithm which takes place with dimension reduction. Although were may loose some data, we need to maintain the principal components. Using PCA it is possible to look at hundreds of dimensions. Principle Component Analysis (PCA)
  • 27. Eg score3 SCORE1 Eg score3 SCORE1 PCI Pez r t PC3 N PCI PCA generally does this for us but once the data set is captured, we would need to draw around it. The longest length represents the largest variation of the data set which is principle component 1. The next longest is principle 2 followed by 3 which gives us the spread of data which most influences the data set PCA looks for aspects of the data which influence it the most by finding the central point of the data set. Once we find that central point all data set is moved such that it is centered around the origin. We do that by finding the mean value on score 1, score 2 and score 3 We can then leave out the 3rd dataset. We would expect our data to be spread across PC1 PCA is often used as a data preprocessing step PC1 and PC2 are usually used to plot on graphs to see the relationship between the data.
  • 28. NAME COUNTRY AGE HEIGHT STARSIGNLIKECOFFEE ADNAN Uk 33 170 VIRGO YES SARAH BRAZIL 133 GEMINI NO ALEXA USA 5 2 PISCES NO ALI INDIA 39 175 Scorpio YES SPARKY Australia 5 35 JEDI YES IMPUTE 1 E MEAN NAME COUNTRY AGE HEIGHT STARSIGNLIKECOFFEE ADNAN Uk 33 170 VIRGO YES SARAH BRAZIL 21 133 GEMINI NO ALEXA USA 5 z PISCES NO ALI INDIA 39 175 Scorpio YES SPARKY Australia 5 35 JEDI YES 331 5 39 51 4 205 Missing Data Imagine we had surveyed a number of people on the street and got various data. If we have missing data in that data set, we may need to calculate or impute a value. One way we could do this is by taking the mean of all the other values which is part of that particular feature. In this process, we are presenting the data for a ML algorithm to make a inference but we don't skew our data set by having no value or 0 which would impact the ML model. If we have too much data missing, it may be better to remove that feature entirely as it would be of little value or remove the row if that particular row is just missing data. We need domain level knowledge to find outliers, you might have correct data but perhaps mixed up i.e. animal age and heights vs human age and height
  • 29. We may have a dataset where we are looking for faults on a car engine. It could be that it is generally fine, but there is a few reports of something faulty. As this is not a frequent occurrence, it is likely that this particular data will become lost in a sea of other data. As a result, it may not be recognised by the ML model. There is a variety of different strategies which can be taken to help: Try and source more data because the thing you are looking for is not represented as well as you• would like. If it is not possible to get the data, another option is to over sample the data but then faults will• likely to look like whatever you have in your training data We can synthesise the data to understand what can vary and affect the data set. That way the ML• algorithm can approximate the data Finally we can try a different algorithm - often people use the same algorithm frequently since we• know that algorithm and understand it.
  • 30. NAME COUNTRY AGE ADNAN UK 33 SARAH BRAZIL 23 ALEXA USA 5 ALI INDIA 39 SPARKY AUSTRALIA 5 COUNTRY BRAZIL AUSTRALIA USA UK UK 0 O O 1 BRAZIL I 0 O O USA O O I 0 AUSTRALIA 0 I 0 O Label and One Hot Encoding ML algorithms are mathematical constructs therefore it does not work with strings and instead needs to be integers. So we can encode the names and also countries, which means it is a label encoding. The problem doing this is that, the ML wont understand its a country and does not need to look for a relationship but it will still try and find one. In this scenario one hot encoding comes into play, whereby new features are introduced into the data set and therefore each country would become a feature and a table with 0's and 1's . In this case it is important not to have a numerical relationship between the countries and no implied hierarchy between the countries.
  • 31.
  • 32. s NO YES 6570758085 9095100 105110 RESTINGHEARTRATE TO 88 65 89 78 61 69 98 82 LIKES CATS Y N Y N Y Y N N N In the example you can see someones resting heart rate and if they like cats or not. However in the example you can visually see that a low heart rate indicates they do and higher heart rates indicate they do not. Supervised ML algorithm Data is provided along with example inferences Looking for patterns in the data set with examples of what you are looking for which only be a yes or no - a binary outcome. Typically its used to understand if this data is an example of something or is it not. Logistical Regression
  • 33. NO YES 65707580 85 90 95 100 105 110 RESTINGHEARTRATE TO 88 65 89 78 61 69 98 82 LIKES CATS Y N Y N Y Y N N N SIGMOIDFUNCTION YES 6570 75 80 85 90 95 100 105 110 A way to do this would be to draw a line using linear regression to find the best fit. A problem with this is that there may be outliers which can skew the data set and therefore make the wrong inferences Instead we could fit a sigmoid function which does not skew the line like linear regression but instead looks for the cut off point between the yes and no There are methods to fine tune this to understand what is most important.
  • 34. LATITUDE COFFEE CONSUMED 4 6 60 7 2 50 20 O 40 28 O 30 38 24 20 45 35 to 59 18 10 20 30 40 50 60 70 80 90 100 70 49 76 24 Linear Regression The example data set might be latitude for where you live on the planet and then the amount of coffee consumed. We can then do techniques to understand exactly where that line sits Although it does not go through any green points, it provides a generalised statistically valid answer. Supervised model We dont just provide the core data but also the output value that we would want to infer later on. An example inference is numeric where the output is a range This can be used for financial forecasting, marketing effectiveness, risk evaluation and more related to business
  • 35. 40 30 SUPPORTVECTOR 20 10 10 20 30 40 so 60 70 80 90 100 40 30 20 10 10 20 30 40 so 60 70 80 90 100 Support Vector Machines (SVM) How do we best identify where we should draw lines by identifying the boundaries of our data sets. We would draw a hyper plane between the support vectors such that when we have a new data point we can allocate it appropriately. Supervised model It would be used to classify data. It can be used for customer classification. As an example, if we already had a classified data set we might want to identify the high value customers. In this example we have 2 classifications but we need to somehow draw a line in order to identify new classifications.
  • 36. I NODE 11200 LIKEWALKING INTERNAL NODE Y N LIKERUNNING CATPERSON Y N LEAF CATPERSON DOGPERSON NODE BINARY NUMERIC CHOICE WALKS DISTACE COLOR Y N 4km 31km RED GREEN v a v a u DOG CAT DOG CAT DOG CAT Decision Trees Supervised algorithm We provide training data along with labels and they can be considered example inferences We ask the algorithm to look at this data and find the patterns, when we give it unlabelled data, we ask it to discover how it fits. It can be used for customer analysis and medical conditions Decision trees are essentially flow diagrams which has root nodes, internal nodes and leaf nodes Root nodes are where things start, the internal node asks another questions which flows down to another node which is the leaf node We dont need to have the same number of leafs across the branches Decision tree outputs can be binary or numeric but is generally based on a numeric question We can also use decision tree decision points to find out a choice like fav colour.
  • 37. RUNNING LIKEWALKINGLIKERUNNING Km'sWALKED FAI8 E Type WALKING KMS 4 No yes 1 GREEN DOG No No 2 Blue CAT e s Yes YES 1 RED DOG s KMS Z WALKµq YES NO 1 GREEN CAT YES No 3 GREEN DOG YES yes 4 BLUE Dok c s e s No No 3 RED CAT Decision Trees How do we start our root? We would need to understand which feature assigns most closely to the question we are asking. In this example when analysing the data set, we see that it is, 'likes running' for who is a dog person vs cat person. We would then filter the data based on that data and identify what is the next most important feature which in turn would make up the next node and go through other branches. You may find some of the features was not selected as it had no correlation between the question. We wont see the actual decision tree when it is created but we can give it new data and categorise the data to see its behaviour
  • 38. RUNNING s WALKING Kms 4 e s s Kms 2 WALKING c s e s u u u u DOG WINS We repeat this entire process a random number of times and then surveying all the data set and run it into the ML algorithm, and see the output and then based on the majority output, we would label the new data. Random Forest When you create a decision tree you need to know what question you will place in the root node. The random forest will check 2 different features and follow down the branch - it is chosen randomly. We build the decision tree in this way and continue until we have a decision tree with a random variance. Random forests are supervised algorithms We have pre-labelled the data and ask it to infer a binary, classification and numeric outputs It is essentially a collection of decision trees The problem with decision trees on its own, it can be inaccurate. Random forest is a way to make decision trees more accurate.
  • 39. so so so so 40go soso so so lo lo lo w so 40 so so o so go coo to zo so 40 so so yo so go too 60 so 40 so to 20 30 40 so Go to so go coo TOTALVARIATION TOTALVARIATION Supervised ML algorithm Data is provided and example inferences. Looking for patterns in the data set with examples of what you are looking for which can only be a yes or no - a binary outcome. Typically its used to understand if this data is an example of something or not. If we want to find 3 classes of data, the algorithm makes some random guesses and places 3 points across the dataset. It then goes through each data point and checks which centre point it is closest to. The next step is to figure out all the closest data points. At this point the data classification will be wrong so it will move the centre point to the middle of its classes. The algorithm will then go through the cycle again including moving the central point until the distributions make sense. We need to find equilibrium where moving to the central point does not effect the classification K-Means
  • 40. ELBOW PLOT I Few 7 er I 2 3 4 5 6 7 8 CLUSTERS K To find out how many times we cycle through, we can graph the number of clusters and reduction in variation. The first cluster variation will be at 0 and as we increase the number of clusters we will eventually see a elbow plot where the variation does not change much. So after a certain number of clusters, it is ineffective to do more as the variation is minimal.
  • 41. Go so 40 so 20 o to 20 30 40 so 60 To so 90 too K-nearest neighbours takes into account, the number of nearest neighbours to consider. The 'k' means number and therefore considers the number of data points to take into account in order to establish the new data point. It should be large enough to reduce influence from others however small enough such that small clusters do not get overlooked. Supervised algorithm Used for classifying data thats already classified K-means would have found some clusters within the dataset already, however the challenge is to know which class to associate the new data point to. K-Nearest Neighbour
  • 42. Document WORD WORD WORD Document word Topic word Topic word wordTopic wordDocument word word word Unsupervised Algorithm It is used for classification and sentiment analysis. It is a description of the way documents are constructed. If you have a number of documents, those documents are made of a number of different topics along with multiple words which can also be in multiple topics. LDA does not understand what is written in the document but it does statistical analysis to get some idea of the content. Latent Dirichlet Allocation (LDA) There is data analysis steps which are done before any processing is done which involves removing particular words like 'stop words' and words such as 'and'. These words do not help towards understanding the content. We then apply stemming to words such as, learned, learning, and learn are all condensed into a single word i.e learn. Once this is complete, we can then tokenise the words into an array. Finally we choose the number of topics we would want LDA to find and this is K. So we take all the words in our array, if we select 3 topics to find, the algorithm will randomly assign a topic number to all the words.
  • 43. WORD TOPIC TOPIC2 Topless Topic WORD 1 MACHINELEARNING 22 33 43 WORD 2 FUNRUN 32 34 23 WORD 3 DEEPLEARNING 44 23 34 WORD 1 LAMBDA 51 43 23 WORD 2 WORD 3 STORAGE 33 64 54 WORD 2 ARTIFICIALINTELLIGENCE 45 33 23 WORD 3 WORD 1 WORD 2 DOCUMENT TOPIC7 TOPICZTOPICS WORD 2 WORD 3 STORAGE 123 23 34 WORD 1 MACHINELEARNING 43 143 45 LAMBDA 24 35 132 Topic WORD TOPICITopic2Topless WORD 1 MACHINELEARNING 22 33 43 WORD 2 FUNRUN 32 34 23 WORD 3 DEEPLEARNING 44 23 34WORD 1 WORD 2 PYTHON 51 43 23 WORD 3 STORAGE 33 64 54 WORD 2 ARTIFICIALINTELLIGENCE 45 33 23 WORD 3 WORD 1 WORD 2 DOCUMENT Topic7TopiczTopics 51 24 1224 WORD 2 43 35 1505STORAGE 123 23 34WORD 3 MACHINELEARNING 43 143 45 23 X 132 3036WORD I LAMBDA 24 35 132 We then calculate each word and how often they appear in each topic Once that is complete we can then check each document and how often each topic appears there. We take the number of times a word appears in a topic and how many times it appeared for a particular document and multiply them together. Whichever one comes out higher, we then reallocate to that topic. This happens as many times as necessary until all the topics and words are complete across the documents. We can then see what those documents are mostly about.
  • 44.
  • 45. I 0 6 12 8 In 3 0.3 05in 2 5 2 t BIAS 0.3in v ACTIVATIONFUNCTION 3 WEIGHTS Neural Networks On the left hand side is input layer, some hidden layers and then an output layer Data is processed at each layer on the network and activated in order to get an inference On first layer, which is the input layer, we need to load data into all of those inputs. As an example, if it was an image, each pixel would be put into every input. Random values are then allocated to the input neurons and these are referred to as weights. These are the factors used to adjust before it gets to the next layer. The weights are multiplied together and we add a value to the next neuron. We also add a bias to the sum and this is applied to an activation function.
  • 46. ACTIVATION FUNCTIONS x RELU 2.5 25 y SIGMOID if NH x 2s b w b w w w w b s b HOW CORRECT AM I There are 3 types of activation functions which are ReLU• Does not consider any negative values◦ Sigmoid• Generally places values between 0 and 1◦ Tanh• Is similar to Sigmoid but also trends to negative 1 on the y axis.◦ If we plot the x value on the function, the y value is the activation function which is provided. We do not tend to use Sigmoid or TanH generally, ReLU is most commonly used. The bias is there to prevent our neuron from being deactivated. If the result was 0 then it would not influence anything -the more neurons you have turned off, the less effective the network is. At this point the output will be wrong because everything will be random and this is called forward propagation.
  • 47. FORWARD PROPAGATION b w N w b w LOSSFUNCTION s w E H w b s EEEL EEE b BACK PROPAGATION Once we get to the end we do a loss function which is an evaluation of the calculations that was made. This is also known as back propagation and it uses gradient descent and learning rates to reduce the loss that takes place. It looks at a way to update weights and biases. The iteration of doing forward and back propagation is epochs and this is how it learns.
  • 48. CAT DOG Convolutional Neural Networks (CNN) Supervised Algorithm Mainly used for classifications and mostly image classification and image detection. The hidden layers inside the network are known as the convolutional layers within the network. Images generally have particular characteristics such as edges, feathers, eyes and beak if it was an penguin for example. The different layers in the network will work towards identifying these different characteristics. For an image, we would use a convolutional filter. We would use the first 9 pixels 3 x 3 and use the filter on it and calculate the outcome onto a new image. We continue this across the whole image. This particular filter does detection for the edge. We can use multiple filters which are pre-trained by others, this is called transfer learning
  • 49. HOUR ACTIVITY 6 7 8 MLMODEL 11 12 NOTRNN 13 Supervised Algorithm This can be used for stock predictions, time series data and voice recognition. There is a pattern to the activities. So on the left would be the input layer and we would map it to the next layer. We would imagine they all have a weight, but the key part is whatever the output is, becomes the input on the next round. This robot helps in various scenarios. Let's say we do these repeated activities at various times during the day. There is a linear relationship here of activities. However the next day, we miss an activity as a particular time and all the times are about to change. The Not RNN model is not going to handle this very well. Recurrent Neural Network (RNN)
  • 50. MLMODEL s MEMORY u The main thing is we take the output and feed it back into the model, it has a memory to know previous predictions to influence future predictions. Recurrent neural networks (RNN) can remember a bit Long short-term memory (LSTM) can remember a lot
  • 51.
  • 52. I SVM DECISIONTREES LOGISTIC REGRESSION Confusion Matrix Ability to visualise the output from the testing that we do We can use different algorithms to our data but the question would be, which algorithm is best suited to our desired inference?
  • 53. KNOWNTRUTHS LIKESDOGS LIKESCATS GoesnotukeDoes LikesDogs Truepositives falsepositives 1 LIKESCATS FALSENEGATIVES TRUENEGATIVES LOGISTICREGRESSION 0 E CooesnotukeDoes KNOWNTRUTHS LIKESDOGS LIKESCATS cooesnoiuxeooa.si LIKESDOGS 120 98 E E LIKESCATS SVM E 109 200E cooesnoiuxeooa.si KNOWNTRUTHS LIKESDOGS LIKESCATS cooesnoiukeooa.si LIKESDOGS 240 40E E EE LIKESCATS 45 202LOGISTIC REGRESSION E cooesnoiukeooa.si We would do this confusion matrix across the different algorithms to be able to see which algorithm performs better. Its not always clear which one is better unless we understand our question in more detail, we would then choose based on our particular use case. We can split our data to training and testing data and use Logistical Regression, SVM or Decision Trees. As we have labelled data, we can push the testing data through the models and get a result but we want to establish which is best suited for our scenario. One of the tools to be able to do this, is called a confusion matrix. This matrix maps on one side the model prediction vs known truths such that we can see the accuracy. You would see TP vs FP and FP vs TP. So simply put the model predicted they do like animals when they didn't or the model predicted they dont like animals when they did.
  • 54. TP SENSITIVITY Tp FN KNOWNTRUTHS YES NO E E YES TruePositives FalsePositives e I No Farseneaatives Truenectarines a SPECIFICITY TN TN FP Sensitivity and Specificity True Positive Rate (TPR) Recall True Negative Rate (TNR) Sensitivity Specificity True Positive Rates (TPR / Recall) and True Negative Rate (TNR) TPR is the correct positives out of the actual positives. TNR is correct negatives out of the actual negative results. Banks are more interested in the sensitivity score since they are looking for fraudulent activities. It is more important to catch fraud then falsely identifying - this can fixed it or account can be unblocked if it was not fraud for example. Therefore the ML model will have higher sensitivity. This could be similar to medical scenarios too, if it turns out to be false identification, the doctor can use additional methods to verify. Specificity is used for example when we have a child watching videos on YouTube. False positives are not acceptable, we can put up with videos that would of been suitable but was not shown but displaying unsuitable content will cause issues. Sensitivity = True Positives / ( True Positives + False Negatives) The closer the sensitivity value is to 1 then the most accurate it is. Specificity = True Negatives / (True Negatives + False Positives)
  • 55. Accuracy and precision Accuracy is the proportion of all the predictions that was correctly identified Precision is the proportion of actual positives that were.correctly identified We need to be careful how we frame the question when it comes to identifying and in a technical manner. Accuracy = TP + TN / total Precision = TP / (TP + FP) Accuracy with 100% means it is likely overfit and needs to be more generalised. Precision of 1 can be possible to have no false positives We can calculate the accuracy and precision for Logistic Regression against decision trees for example then we can see the difference between each
  • 56. LIKESCOFFEE PROBABILITY OFLIKING COFFEE go.gg a to 20 30 40 so 60 70 80 90 100 LIKESCOFFEE INCREAJESPECIFICITY L PROBABILITY OFLIKING COFFEE go.gg a 10 20 30 40 so 60 70 80 90 100 KNOWNTRUTHS LIKESCOFFEE LIKESCOFFEE LIKESCOFFEE TRUEPOSITIVE FALSEPOSITIVE a 1 LIKESCOFFEE FALSENEGATIVE TRUENEGATIVE ROC/AUC If we consider a logistical regression graph for a Binary situation i.e. likes coffee vs does not, we can then model this behaviour. It must however, be binary and also we need to identify where that cut off is actually located. If we move the line up, we are increasing specificity, which means you do not want any of the classifications incorrect. If we move it down then we are increasing sensitivity, but we dont mind if some people are captured who was a false positive but at least they are captured but we can address this with further checks and balances later. The question is, where do we draw this line and it depends on what we want to show. The other consideration is where is the best balance balance between sensitivity and specificity. One extreme to another is not going to be useful as it will always return the same result. The confusion matrix can be used to identify where that line should be to understand TP and TN the same is done for FN and FP. In this example there is some test data that has been labelled as likes to drink coffee vs do not. Everything on the right of the vertical line will be classified as liking coffee and everything on the left as not liking.
  • 57. LikesCOFFEE knownhaunts probability ukescaieeukescai.ee ofukinaco.ee y.qu.es.oee g z ukescai.ee O 3 DoesNotLike COFFEE to 20 so 40 so 60 70 so go 100 LikesCOFFEE knownTrunts probability ukescaieeukescai.ee ofukinaco.ee y.qu.es.oee 4 ukescai.ee I 4 DoesNotLike COFFEE to 20 so 40 so 60 70 so go 100 LikesCOFFEE knowntruths probability ukescaieeukescai.ee aJDoesNotLike COFFEE to 20 so 40 so 60 70 so go too We now have a selection of confusion matrix's. We now need to understand what we do with these. What is the best point for our cut off point with all our data? We could have repeated the above at a variety of different points. In this example we move the horizontal line further up. We ended up capturing 3 true positives, and all 5 of the true negatives. We did end up with 2 false negatives and no false positives. In this example if we now move the horizontal line up we can see the results of the confusion matrix changes. We misclassified a single point as negative and positive Here we can see we correctly identified all 5 as liking coffee and we got 3 for true negatives. We did however end up with 2 false positives and 0 false negatives
  • 58. knownTruths ukescoeeeeukescai.ee Truepositiverate TPR Likescoffee 5 z sensitivity E ELikescoffee 0 3es Sto o FPR I Falsepositiverate 2 3 2 0.4 ROC BESTMODEL TPR WITHMAX SENSITIVITY 0 FPR I BESTMODELWITH TPR MaySPECIFICITY 0 FPR i r TPR AUC 0 FPR I ROC is useful to understand a balance between sensitivity and specificity and the AUC for overall separability between the classes. The line at the top is the ROC which is Receiver Operating Characteristics and the point where we go from the upper slope to the line - this is the cut off point for max sensitivity the start of the slop is the best model for max specificity. In both cases we need to identify where on the graph the points change direction effectively. AUC is the area under the curve and it represents generally how well the model overall is good at distinguishing between the different classes. The larger the area under curve, the better it is at distinguishing. This is where ROC / AUC comes into play. If we have a graph of FPR and TPR from our calculations of the confusion matrix, we can then plot our results.
  • 59. GINIIMPURITY I PROBABILITYOFDOG 2 PROBABILITY OFCAT v WALKING RUNNING COLOR TYPE LIKESWALKING NO YES GREEN DOG NO NO BLUE CAT 120 Y N 98 YES YES RED DOG S YES NO GREEN CAT TYPE TYPE YES NO GREEN DOG DOG CAT DOG CAT YES YES BLUE DOG 97 23 30 68 NO NO RED CAT LIKESWALKING LIKESWALKING 120 Y N 98 1 ft 2 e s I Zog 2 2 120 Y N 98 s TYPE TYPE Type TYPE 0310 DOG CAT DOG CAT DOG CAT DOG CAT 97 23 30 68 0.425 97 23 so 68 WEIGHTEDAVERAGE GINIIMPURITY VALUES FEATURE GIYESLIKESWALKING GINI ALLPEOPLE IMPURITY LIKESWALKING 0.362 LIKESRUNNING 0.384 NOLIKESWALKING GINI AW PEOPLE IMPURITY FAVORITECOLOR 0371 Gini Impurity In decision trees, the algorithm goes through the data looking for the data that represents the biggest split. This can be calculated in various ways and Gini impurity is one of them. What splits the data best? We need to look each of the features. Likes walking has the lowest weighted Gini impurity, so it best separates people who like dogs over cats. We will use likes walking as our root node.
  • 60. FI COMBINATION OFRECALLANDPRECISION 2 I RECALL PRECISION RECALLXPRECISION X 2 RECALL 1 PRECISION LOGISTIC REGRESSION DECISIONTREES SENSITIVITY 0.543 SENSITIVITY O 864 SPECIFICITY 0.835 SPECIFICITY O 824 ACCURACY 0.839 ACCURACY O 844 PRECISION 0.857 PRECISION 0.839 F1 Score F1 is a combination of recall and precision, it takes into consideration the false positives and false negatives in the calculation F1 score is a better way to calculate accuracy• Accuracy = (TP + TN) / Total• Whenever you see F1, it is discussed as recall rather than sensitivity that's why its mentioned in that• manner. If you have an uneven class distribution then this is proved to be a better way to analyse•
  • 61.
  • 62. Ability put placeholders for values in tf graph capability by running various models Deep learning framework built on top of python• PyTorch better for recurrent neural networks• TensorFlow has a graph so it can see where jt came from for back• propagation PyTorch needs to keep track of what happen so it can improve the model• The auto grad feature stores where calculations come from• Used most by SageMaker• Shares a architecture similar to PyTorch rather than TensorFlow• Nd array is similar to np array(Numpy) which is a tensor for MXNet• MXNet is aware of the processes it runs on, so can see gpu and cpu• We need it to record and watch the tensor for when it comes to back• propagation and we do this with autograd MXNet SciKit-Learn Algorithm such as CNN along with a framework such as MXNet, these two put together make up• the model which is then trained to create inferences TensorFlow has been developed by google and powers suggested videos, spam filtering etc..• AWS have done considerable work with MXNet and SageMaker. MXNet is very good at scaling• across cloud infrastructure. PyTorch is runner up to TensorFlow and established machine learning and SciKit learn is a easier• framework to use and natively has support for many algorithms. PyTorch TensorFlow AWS Services, ML and DL Frameworks It has a number of datasets built into it already• Getting the data, formatted and enough of it, is the biggest challenge in• ML Digits dataset•
  • 63. AWSGLUE DATACATALOG v r KINESIS s S3 ATHENA HEAREGEMAKER O SQaL s f c s.ESneetEMAkER SCHEMA GLUE SCHEMA n n v v v S3 S3 S3 KINESIS CAMERA VIDEO Rekoanition streams video KINESIS DATA MOBILE L SNS L LAMBDA L STREAMS It provides an SQL interface into S3• Source data from multiple S3 locations• Athena looks at the schema of the data• which comes from glue We can do feature engineering from the• original dataset to then use for analysis or train our algorithm Kinesis AWS Services AWS Glue Crawler, can create a database• definition from the data stored in S3 Glue does not store any data but makes• connections including JDBC or DynamoDB It can perform some ETL tasks and some• ML capabilities The ML algorithm can DeDupe table• record sets We can load up CSVs and essentially grab• the schema of the dataset We can also glue together different• datasets to have a single view Athena Glue Ingesting large amounts of data and this might be• from few or many data points Video streams, data streams, data firehose,• data analytics Video streams allows streaming video from• connected devices for analytics and ML and other processing. Data streams is a catch all and general endpoint• to ingest large quantities of data which might to send to EC2 instances which can do the logic or other services like Spark on EMR. However it is more complex to configure. Data firehose is an endpoint to stream data into• S3, RedShift, ES or Splunk. Data analytics can process streaming data from• Kinesis Streams or Firehose at scale using SQL.
  • 64. KINESIS ATHENA FIREHOSE g f S3 EHAEGEMAKER OTHER EMR MASTERNODE CORE NODE TASK NODES Cost effective storage for large amounts of data• Structured data• CSV◦ JSON◦ Unstructured data• Text files◦ Images◦ Data lake• Add data from many sources◦ Define the data schema at the time of◦ analysis Much lower cost than data warehouse◦ solutions Unsuitable for transactional systems◦ Needs cataloguing before analysis◦ S3 Business Intelligence (BI) tool• Visualise data from many sources• Dashboards◦ Email reports◦ Embedded reports◦ End user targeted• QuickSight EMR Managed service for hosting massively• parallel compute tasks. Integrates with storage service S3• Petabyte scale• Uses 'big data' tools like Spark, Hadoop,• HBase
  • 65.
  • 66. Amazon Rekognition Image moderation Facial analysis Celebrity recognition Face comparison Text in image Use Cases Create a filter to prevent inappropriate images being sent via a messaging platform. This can include nudity or offensive text. Enhance metadata catalog of an image library to include the number of people in each image Scan an image library to detect instances of famous people
  • 67. LAMBDA a GET S3 REKOGNITIONc LAMBDA s L SNS SQS Amazon Rekognition Video In this example we start off by storing a video in S3 bucket We have a lambda function which is invoked based on new object event The lambda function uses the Rekognition which would go to S3 bucket and get the data Rekognition will go through the data and will send a message to SNS Topic on completion which will be written to an SQS Queue. Another lambda function will see the message in the queue and go to Recognition to get the completed job. Use Cases Detect people of interest in a live video stream for a public safety application. Create a metadata catalog for stock video footage library Detect offensive content within videos uploaded to a social media platform
  • 68. You can enter some plain text and it will transform into speech Female or male voices Custom lexicons which is the ability to create your own specific words and pronunciations. SSML (Speech Synthesis Markup Language) allows you to add syntax to change the way something is spoken i.e. you could put an effect like 'whispered' which would say it in a whispered tone. There is a variety of languages like French, German, Hindi, Italian, Romanian etc.. Use Cases Create accessibility tools to 'read' web content Provide automatically generated announcements via a public address (PA) system. Create an automated voice response (AVR) solution for a telephony system (including Connect) Amazon Polly Amazon Transcribe You can either speak directly into the mic or pass it files which would be written to text Use Cases Create a call centre monitoring solution that integrates with other services to analyse caller sentiment Create a solution to enable text search of media with spoken words. Provide a closed captioning solution for online video training
  • 69. Amazon Translate We can either pass in files or do in real-tie There is a large variety of languages Ability to add custom terminology also Provides a variety of metrics, such as the ability to see successful request count, throttled request count and character count along with others Use Cases Enhance an online customer chat application to translate conversations in real-time Batch translate documents within a multilingual company. Create a news publishing solution to convert posted stories to multiple languages
  • 70. Amazon Lex Automatic speed recognition (ASR) Natural language understanding (NLU) Use Cases Creates a chatbot that triages customer support requests directly on the product page of a website Create an automated receptionist that directs people as they enter a building Provide an interactive voice interface to any application
  • 71. v LAMBDA LAMBDA s J s LAMBDA LAMBDA s s n v AMAZON AMAZON SPEECH S3 TRANSCRIBE COMPREHEND AWS Step Functions In this example we recorded some audio and uploaded it into S3. We could then use call a lambda function based on an event which would fall into amazon step function which will then orchestrate the desired behaviour between the different services. We are triggering another lambda function which in turn will go and speak with Transcribe and kick off a job against the s3 bucket. We could then use another function after a period of time to check if the job has completed or not and based on the response we can decide what we want to do next. Once we have the desired response we can then use another lambda function to speak with Amazon Comprehend which allows you to extract key phases, entitles, sentiment, language amongst other things. This data can then be stored in a database or used by a application AWS Step Functions lets you coordinate multiple AWS services into serverless workflows. It allows you to stitch together services such as Transcribe, Comprehend along with others lambda functions and services
  • 72. S3 EFS FSX PARAMETER CHANNEL SageMaker Overview Ability to build, train and deploy machine learning models quickly It covers the entire machine learning workflow to label and prepare your data, choose an algorithm, train the model, tune and optimise it for deployment, make predictions and take action. Ability to pull data in data from various different sources and we do this using Channel Parameters. AWS Recommend the larger instance sizes for training Some algorithms only support GPUs GPU instances are more expensive, but faster There is also managed spot training and you can keep checkpoints of the model state in S3. This is 90% cheaper than on-demand instances SageMaker --> Training Jobs• In a S3 bucket if we have a collection of data i.e. cats and dogs◦ We could then pick a Algorithm source i.e. a built in algorithm from SageMaker◦ We can also choose the type of algorithm like Image classification◦ Then we need to decide how we wish to input the data i.e. File or Pipe◦ Ability to select here the type of instance sizes, VPC and encryption◦ There is also ability to use hyper parameters here such as batch size, minimum epochs etc. AWS◦ pre-populates a lot of this when doing in the console. We can then set our training data, validation data and output◦ Once you have the above, you will end up with a model which can then be used to make inferences.
  • 73. INVOKEENDPOINT SAGEMAKERENDPOINT S3 MODELL ECR S3 s S3 BATCHTRANSFORMATION S3 SAGEMAKER DOCKER SageMaker - Batch / Realtime Real Time• It is possible to do real-time inferences by allowing the application to invoke the SageMaker◦ Endpoint which would then call on the model. Batch• Batch Transform jobs◦ They will put in data that we want to get inference from◦ We could then push that into our classification model for example to understand if we have◦ a high value customer
  • 74. SageMaker --> Models• Once we have created our model, we can then set a container to host the model◦ SageMaker --> Endpoint Configuration• We can then add our model to this endpoint configuration◦ Within this we will know the model, the instance type◦ SageMaker --> Endpoints• We create a new endpoint here and use and existing configuration which was create◦ At this point you could run a command and give it a new file and use the model to create an inference. aws sagemaker-runtime invoke-endpoint --endpoint-name catdog --body filled://cat.png --profile sandbox ./output.json SageMaker - Deploy