Machine Learning Guide to AI

Learning from Data
A fast-paced guide to machine learning and artificial intelligence
by Thomas Holloway
Co-Founder/Software Engineer @ Nuvi (http://www.nuviapp.com)

Thanks to our Sponsors!
To connect to wireless
1. Choose Uguest in the wireless list
2. Open a browser. This will open a Uof U website
3. Choose Login

– H.B. BARLOW
“Intelligence is the art of good guesswork”

General Intelligence Goals
• Deduction, Reasoning, Problem
Solving
• Knowledge Representation
• Planning
• Learning
• Natural Language Processing
• Motion and Manipulation
• Perception
• Social Intelligence
• Creativity
• Early AI research began in the study
of logic itself - leading to the
algorithms that imitate step-by-step
reasoning used to solve puzzles and
problems. (heuristics)
• Contrast to methods pulled from
economics and probability in the late
80’s/90’s led to very successful
approaches for dealing with
uncertainty or incompleteness.
• Statistical Approaches, Neural
Networks (Probabilistic Nature of
Humans to Guess)

• Deduction, Reasoning, Problem Solving
• Planning
• Learning
• Perception
• Creativity
• Represent conceptually about objects,
places, things, situations, events, things,
times, language
• What they look like
• Categorical features
• Properties
• Relationships between each other
• Meta-knowledge (knowledge of what other
people know)
• Causes, effects and lots of other less
known research fields
• “what exists” = Ontology

• Planning
• Learning
• Perception
• Creativity
• Difficult Problems
• Working assumptions, default reasoning,
qualification problem
• Commonsense Knowledge
• Major goal is to automatically acquire this largely
through unsupervised learning
• Ontology Engineering
• Subsymbolic Form of Commonsense Knowledge
• Not all knowledge can be represented as facts or
statements. (As such, intuition to avoid a
decision, i.e. “feels too exposed” in a chess
match)

• Planning
• Learning
• Perception
• Creativity
• Set Goals and achieve them
• (visualize the representation of the
world, predict how actions will change
it, make choices to maximize utility)
• Requires reasoning under uncertainty
(as a result of the world/environment
matches its predictions) -> error
correction
• Move chess piece here, player
responds to put me in a seemingly
poor position, act accordingly

• Planning
• Learning
• Perception
• Creativity
• Machine Learning is the study of algorithms that
automatically improve through experience.
• Probably the most central role to Artificial
Intelligence.
• Unsupervised Learning - finding patterns
• Supervised Learning - classify categorically what
something is/belongs and producing a function
to represent input -> output
• Reinforcement Learning - rewards
• Developmental Learning - self-exploration, active
learning, imitation, guidance, entropy

• Planning
• Learning
• Perception
• Creativity
• Read and understand text
• Listen and understand speech
• Information Retrieval
• Machine Translation
• Sentiment Analysis
• Category Theory (Quantum Logic in Information Flow Theory)
• Common techniques in semantic indexing, parse trees, syntactic
and semantic analysis
• Major Goal to automatically build ontology (for knowledge
representation) by scanning books, wikipedia, dictionaries… etc
• Recently used wikitionary and wikipedia to automatically build a
part of speech tagger and sentiment analysis engine for multiple
languages. *http://www.nuviapp.com/* <— PLUG

• Entropic Force (Alex Wissner-Gross argument for
intelligence)
• Language Discovery
• Automated Trading Systems
• Machine Translation
• Spam Detection
• Self-Driving Cars
• Facial Recognition
• Gesture Recognition
• Speech Recognition
• Nest
• Shazam
Statistical Machine Learning is the art of taking lots of data and turning it
into statistically known probabilities.
• Spotify
• Netflix, Amazon Recommendations
• Duolingo
• Robot Movement
• Fraud Detection
• Intrusion Detection / State Anomaly
• DNA Sequence Alignment
• Siri, Google Voice, Google Now, Xinect
• Sentiment Analysis
• Text/Character Recognition (Scanning books)
• Health Monitoring (Healthcare)
• Pandora, iTunes / iGenius

Types of Machine Learning
• Supervised Learning
• Unsupervised Learning
• Recommendation Systems
• Reinforcement Learning
• (rewards for good responses, punished for bad ones)
• Developmental Learning
• (self-exploration, entropic force, cumulative acquisition of novel skills typical of robot
movement - autonomous interaction with environment and “teachers”, imitation,
maturation)

Supervised Learning
• Two types that we will discuss within
supervised learning:
• Regression analysis (single-
valued real output)
• Classification

Optimization Objectives
• Hypothesis:
• Parameters:
• Cost Function:
• Goal:
m = number of samples
x(i)
= x at sample i
y(i)
= y at sample i
Our cost function is effectively taking the
square error difference between all
predictions from our hypothesis and
the actual values y - and finally summing
the error up to a total “cost” error.
Minimize the error produced
from the cost function by manipulating
the parameters theta.

Gradient Descent
• First-Order Optimization Algorithm
• Finds Local Minimum of a function by taking steps proportional to the negative of the gradient of the
function at the current point.
• Popular for large-scale optimization problems
• easy to implement
• works on just about any black-box function
• each iteration is relatively cheap

Gradient Descent
repeat until
convergence {
for ( j = 1 and j = 0)
}
Hypothesis
Cost Function

Gradient Descent
repeat until
convergence {
for ( j = 1 and j = 0)
}
repeat until
convergence {
}

Gradient Descent
repeat until
convergence {
for ( j = 1 and j = 0)
}

Gradient Descent
repeat until
convergence {
}
Hypothesis
1
* note: sometimes referred to as batch gradient descent (given that we iterate
over all training examples to perform a single update on our parameters)

Multivariate Linear Regression
TV Budget Online Ads Billboards Sales
230.1 37.8 63.1 22.1
44.5 39.9 45.1 10.4
17.2 45.8 69.3 9.3
180.8 41.3 58.5 18.5

• Hypothesis:
• Think of x as our example with
features in a vector up to n features with
Multivariate Linear Regression

Gradient Descent
repeat until
convergence {
for ( j = 0…n)
}
Hypothesis
Cost Function

Techniques in managing input
• Mean-Normalization (make sure all your input have similar ranges)
• FFT for audio
• Mean / Average / Range
• Graph your Cost Function over the number of iterations (make sure it is decreasing)
• Separate data sets (cross validation, test set)
• Train on a given set of data, manipulate regularization / extra features..etc and graph your cost
function against the cross validation set
• Finally test against unseen data against your test set
• Typically this is 60-30-10, or even 70-20-10, depending on how you wish to split things up.

Normal Equation
• Analytically solves the parameters
• Useful when n is relatively small (n of features < 5000 or so)
• Uses the entire matrix of input
• Each sample = vector of features

Supervised Learning - Classification• Spam/Not Spam
• Benign/Malignant
• Biometric Identification
• Speech Recognition
• Fraudulent Transactions
• Pattern Recognition
• 0 = Negative Class
• 1 = Positive Class

Logistic Regression / Classification
• What we want is a function that will produce a value between 0 and 1
for all weighted input we provide.
• Sigmoid Activation Unit

Logistic Regression Cost Function
• Hypothesis:
• Cost Function:
Linear Regression Cost
Function

• Hypothesis:
• Cost Function:

Logistic Regression Cost Function Intuition

In other words, if we predicted 0 when
we should of predicted 1, in this case we
are going to return back a very large cost.

In other words, if we predicted 1 when
we should of predicted 0, in this case we
are going to return back a very large cost.

• This is the “simplified” formula —>

Logistic Regression Decision Boundaries
• The threshold or line at which input data is favoring one class or
another. This is usually the same point where we see our sigmoid
function cross the 0.5 mark.

Logistic Regression Decision Boundaries
• The threshold or line at which input data is favoring one class or
another. This is usually the same point where we see our sigmoid
function cross the 0.5 mark.
(Sunny, Rainy, Cloudy..etc)
Multi class classification is
dealing with multiple categories
of classification. Typically done
as a one-vs-all classification.
Where each class is trained as
(1 = positive for a given class, 0
for everything else).
Find max probability of all
classes tested against.

Overfitting and Regularization

Regularization
• Regularized Logistic Regression
• Regularized Gradient Descent
[ ]
Regularization Parameter

Sophisticated Neural
Networks can do
some really amazing
things
Multi-layered (deep) neural networks can be built
to identify extremely complex things with
potentially millions of features to train on.
Neural Networks can auto-encode (learn from the
input itself/self-learn), classify into many
categories at once, can be trained to output real-
values, they can even be built to retain memory or
long-term state (such as in the case of hidden
markov models or finite state automatons)

Types of Neural Networks
• Feedforward
• Recurrent
• Echo-State
• Long-Short-Term Memory
• Stochastic
• Bidirectional (propagates in both directions)

Feed Forward Network
+ 1 = bias unit+1
+1
Input Features /
Input Layer
Output

+1 +1 +1
What is the value of ?
Answer: the sigmoid activation of the sum of its weighted
inputs

+1 +1 +1
What is the output of ?
Answer: the activation (sigmoid) of it the sum of its weighted
inputs

+1 +1 +1
What is the output of ?

+1 +1 +1
Feed Forward Propagation

Backpropagation
• Gradient computation is done by computing the derivative gradient
of our expected output versus our actual output and propagating
that error backwards through the network.
• Calculate:

Backpropagation
+1 +1 +1
let y = 1 for this sample

Recurrent Neural Networks
• Connections units form a directed cycle
• Allows items to exhibit dynamic temporal behavior
• Useful for maintaining internal memory or state over time
• Ex: unsegmented hand writing recognition
• At any given time step, each non-input unit computes its current activation as a nonlinear
function of the weighted sum of the activations of all units from which it receives
connections.
• Training is done with back propagation through time
• vanishing gradient (solved with LSTM networks)

Recurrent Neural Networks
http://www.manoonpong.com/AMOSWD08.html

LSTM Recurrent Neural Network
• Long Short Term Memory
• Well suited for classifying, predicting
and processing time series data with
very long range dependencies.
• Achieves best known results in
unsegmented handwriting recognition
• Traps error within a memory block (often
referred to as an error carousel)
• Amazing applications in rhythm
learning, grammar learning, music
composition, robot control…etc

Other classification techniques
• SVM (support vector machines)
• constructs a hyperplane in a high/infinite-dimensional space used for
training/classification, regression..etc
• by defining a kernel function (or some function that will tell us similarity) svm
will allow us to perform simple dot products between high-dimensional
features
• high-margin (decision boundary has good separation between training
points) which benefits good generalization
• Naive Bayes

Unsupervised Learning
• Categorization
• Clustering (density estimation)
• Selecting top clusters (k-means) and updating average centroid, assign data points to a cluster
and iterate
• Blind Signal Separation
• Feature Extraction for Dimensionality Reduction
• Hidden Markov Models
• Non-normal & normal distribution analysis (finding the distributions of data)
• Self-Organizing Maps

Autoencoders
Unsupervised Learning from
Neural Networks

Knowing what to do next
• Build your algorithm quick and dirty, don’t spend a lot of time on it until you have something to use
• Split up your training, cross validation and test sets (don’t test on your training data!)
• Move on to PCA or unsupervised pre-training for your supervised algorithms to help improve performance after: —>
• Don’t just try and get a lot of data to train on, implement your algorithm quick and dirty, use smaller data sets initially and
determine bias/variance
• High variance: get more training data
• High variance: try fewer features
• High bias: add additional features
• High bias: add polynomial features
• High bias: decrease regularization
• High variance: increase regularization

Follow me @ @nyxtom
Thank you!
Thank you!
Thank you!
Thank you!
Questions?
http://ml-class.org/
https://www.coursera.org/course/bluebrain
https://www.coursera.org/course/neuralnets

Machine Learning Guide to AI

Recommended

Recommended

More Related Content

Similar to Machine Learning Guide to AI

Similar to Machine Learning Guide to AI (20)

Recently uploaded

Recently uploaded (20)

Machine Learning Guide to AI

Editor's Notes