The document provides an overview of machine learning and artificial intelligence goals including deduction, reasoning, problem solving, knowledge representation, planning, learning, natural language processing, motion and manipulation, perception, social intelligence and creativity. It discusses different machine learning techniques like supervised learning, unsupervised learning, reinforcement learning and developmental learning. It also covers topics like linear regression, logistic regression, neural networks, overfitting, regularization and more.
1. Learning from Data
A fast-paced guide to machine learning and artificial intelligence
by Thomas Holloway
Co-Founder/Software Engineer @ Nuvi (http://www.nuviapp.com)
2. Thanks to our Sponsors!
To connect to wireless
1. Choose Uguest in the wireless list
2. Open a browser. This will open a Uof U website
3. Choose Login
4. General Intelligence Goals
• Deduction, Reasoning, Problem
Solving
• Knowledge Representation
• Planning
• Learning
• Natural Language Processing
• Motion and Manipulation
• Perception
• Social Intelligence
• Creativity
• Early AI research began in the study
of logic itself - leading to the
algorithms that imitate step-by-step
reasoning used to solve puzzles and
problems. (heuristics)
• Contrast to methods pulled from
economics and probability in the late
80’s/90’s led to very successful
approaches for dealing with
uncertainty or incompleteness.
• Statistical Approaches, Neural
Networks (Probabilistic Nature of
Humans to Guess)
5. General Intelligence Goals
• Deduction, Reasoning, Problem Solving
• Knowledge Representation
• Planning
• Learning
• Natural Language Processing
• Motion and Manipulation
• Perception
• Social Intelligence
• Creativity
• Represent conceptually about objects,
places, things, situations, events, things,
times, language
• What they look like
• Categorical features
• Properties
• Relationships between each other
• Meta-knowledge (knowledge of what other
people know)
• Causes, effects and lots of other less
known research fields
• “what exists” = Ontology
6. General Intelligence Goals
• Deduction, Reasoning, Problem Solving
• Knowledge Representation
• Planning
• Learning
• Natural Language Processing
• Motion and Manipulation
• Perception
• Social Intelligence
• Creativity
• Difficult Problems
• Working assumptions, default reasoning,
qualification problem
• Commonsense Knowledge
• Major goal is to automatically acquire this largely
through unsupervised learning
• Ontology Engineering
• Subsymbolic Form of Commonsense Knowledge
• Not all knowledge can be represented as facts or
statements. (As such, intuition to avoid a
decision, i.e. “feels too exposed” in a chess
match)
7. General Intelligence Goals
• Deduction, Reasoning, Problem Solving
• Knowledge Representation
• Planning
• Learning
• Natural Language Processing
• Motion and Manipulation
• Perception
• Social Intelligence
• Creativity
• Set Goals and achieve them
• (visualize the representation of the
world, predict how actions will change
it, make choices to maximize utility)
• Requires reasoning under uncertainty
(as a result of the world/environment
matches its predictions) -> error
correction
• Move chess piece here, player
responds to put me in a seemingly
poor position, act accordingly
8. General Intelligence Goals
• Deduction, Reasoning, Problem Solving
• Knowledge Representation
• Planning
• Learning
• Natural Language Processing
• Motion and Manipulation
• Perception
• Social Intelligence
• Creativity
• Machine Learning is the study of algorithms that
automatically improve through experience.
• Probably the most central role to Artificial
Intelligence.
• Unsupervised Learning - finding patterns
• Supervised Learning - classify categorically what
something is/belongs and producing a function
to represent input -> output
• Reinforcement Learning - rewards
• Developmental Learning - self-exploration, active
learning, imitation, guidance, entropy
9. General Intelligence Goals
• Deduction, Reasoning, Problem Solving
• Knowledge Representation
• Planning
• Learning
• Natural Language Processing
• Motion and Manipulation
• Perception
• Social Intelligence
• Creativity
• Read and understand text
• Listen and understand speech
• Information Retrieval
• Machine Translation
• Sentiment Analysis
• Category Theory (Quantum Logic in Information Flow Theory)
• Common techniques in semantic indexing, parse trees, syntactic
and semantic analysis
• Major Goal to automatically build ontology (for knowledge
representation) by scanning books, wikipedia, dictionaries… etc
• Recently used wikitionary and wikipedia to automatically build a
part of speech tagger and sentiment analysis engine for multiple
languages. *http://www.nuviapp.com/* <— PLUG
10. • Entropic Force (Alex Wissner-Gross argument for
intelligence)
• Language Discovery
• Automated Trading Systems
• Machine Translation
• Spam Detection
• Self-Driving Cars
• Facial Recognition
• Gesture Recognition
• Speech Recognition
• Nest
• Shazam
Statistical Machine Learning is the art of taking lots of data and turning it
into statistically known probabilities.
• Spotify
• Netflix, Amazon Recommendations
• Duolingo
• Robot Movement
• Fraud Detection
• Intrusion Detection / State Anomaly
• DNA Sequence Alignment
• Siri, Google Voice, Google Now, Xinect
• Sentiment Analysis
• Text/Character Recognition (Scanning books)
• Health Monitoring (Healthcare)
• Pandora, iTunes / iGenius
11. Types of Machine Learning
• Supervised Learning
• Unsupervised Learning
• Recommendation Systems
• Reinforcement Learning
• (rewards for good responses, punished for bad ones)
• Developmental Learning
• (self-exploration, entropic force, cumulative acquisition of novel skills typical of robot
movement - autonomous interaction with environment and “teachers”, imitation,
maturation)
12. Supervised Learning
• Two types that we will discuss within
supervised learning:
• Regression analysis (single-
valued real output)
• Classification
14. Optimization Objectives
• Hypothesis:
• Parameters:
• Cost Function:
• Goal:
m = number of samples
x(i)
= x at sample i
y(i)
= y at sample i
Our cost function is effectively taking the
square error difference between all
predictions from our hypothesis and
the actual values y - and finally summing
the error up to a total “cost” error.
Minimize the error produced
from the cost function by manipulating
the parameters theta.
15.
16. Gradient Descent
• First-Order Optimization Algorithm
• Finds Local Minimum of a function by taking steps proportional to the negative of the gradient of the
function at the current point.
• Popular for large-scale optimization problems
• easy to implement
• works on just about any black-box function
• each iteration is relatively cheap
20. Gradient Descent
repeat until
convergence {
}
Hypothesis
1
* note: sometimes referred to as batch gradient descent (given that we iterate
over all training examples to perform a single update on our parameters)
22. • Hypothesis:
• Think of x as our example with
features in a vector up to n features with
Multivariate Linear Regression
23. Optimization Objectives
• Hypothesis:
• Parameters:
• Cost Function:
• Goal:
m = number of samples
x(i)
= x at sample i
y(i)
= y at sample i
Our cost function is effectively taking the
square error difference between all
predictions from our hypothesis and
the actual values y - and finally summing
the error up to a total “cost” error.
Minimize the error produced
from the cost function by manipulating
the parameters theta.
27. Techniques in managing input
• Mean-Normalization (make sure all your input have similar ranges)
• FFT for audio
• Mean / Average / Range
• Graph your Cost Function over the number of iterations (make sure it is decreasing)
• Separate data sets (cross validation, test set)
• Train on a given set of data, manipulate regularization / extra features..etc and graph your cost
function against the cross validation set
• Finally test against unseen data against your test set
• Typically this is 60-30-10, or even 70-20-10, depending on how you wish to split things up.
28. Normal Equation
• Analytically solves the parameters
• Useful when n is relatively small (n of features < 5000 or so)
• Uses the entire matrix of input
• Each sample = vector of features
30. Logistic Regression / Classification
• What we want is a function that will produce a value between 0 and 1
for all weighted input we provide.
• Sigmoid Activation Unit
31. Logistic Regression / Classification
• What we want is a function that will produce a value between 0 and 1
for all weighted input we provide.
• Sigmoid Activation Unit
32. Logistic Regression Cost Function
• Hypothesis:
• Cost Function:
Linear Regression Cost
Function
35. Logistic Regression Cost Function Intuition
In other words, if we predicted 0 when
we should of predicted 1, in this case we
are going to return back a very large cost.
37. Logistic Regression Cost Function Intuition
In other words, if we predicted 1 when
we should of predicted 0, in this case we
are going to return back a very large cost.
40. Logistic Regression Decision Boundaries
• The threshold or line at which input data is favoring one class or
another. This is usually the same point where we see our sigmoid
function cross the 0.5 mark.
41. Logistic Regression Decision Boundaries
• The threshold or line at which input data is favoring one class or
another. This is usually the same point where we see our sigmoid
function cross the 0.5 mark.
42. Logistic Regression Decision Boundaries
• The threshold or line at which input data is favoring one class or
another. This is usually the same point where we see our sigmoid
function cross the 0.5 mark.
(Sunny, Rainy, Cloudy..etc)
Multi class classification is
dealing with multiple categories
of classification. Typically done
as a one-vs-all classification.
Where each class is trained as
(1 = positive for a given class, 0
for everything else).
Find max probability of all
classes tested against.
46. Sophisticated Neural
Networks can do
some really amazing
things
Multi-layered (deep) neural networks can be built
to identify extremely complex things with
potentially millions of features to train on.
Neural Networks can auto-encode (learn from the
input itself/self-learn), classify into many
categories at once, can be trained to output real-
values, they can even be built to retain memory or
long-term state (such as in the case of hidden
markov models or finite state automatons)
47. Types of Neural Networks
• Feedforward
• Recurrent
• Echo-State
• Long-Short-Term Memory
• Stochastic
• Bidirectional (propagates in both directions)
60. Backpropagation
• Gradient computation is done by computing the derivative gradient
of our expected output versus our actual output and propagating
that error backwards through the network.
• Calculate:
64. Recurrent Neural Networks
• Connections units form a directed cycle
• Allows items to exhibit dynamic temporal behavior
• Useful for maintaining internal memory or state over time
• Ex: unsegmented hand writing recognition
• At any given time step, each non-input unit computes its current activation as a nonlinear
function of the weighted sum of the activations of all units from which it receives
connections.
• Training is done with back propagation through time
• vanishing gradient (solved with LSTM networks)
67. LSTM Recurrent Neural Network
• Long Short Term Memory
• Well suited for classifying, predicting
and processing time series data with
very long range dependencies.
• Achieves best known results in
unsegmented handwriting recognition
• Traps error within a memory block (often
referred to as an error carousel)
• Amazing applications in rhythm
learning, grammar learning, music
composition, robot control…etc
68. Other classification techniques
• SVM (support vector machines)
• constructs a hyperplane in a high/infinite-dimensional space used for
training/classification, regression..etc
• by defining a kernel function (or some function that will tell us similarity) svm
will allow us to perform simple dot products between high-dimensional
features
• high-margin (decision boundary has good separation between training
points) which benefits good generalization
• Naive Bayes
69. Unsupervised Learning
• Categorization
• Clustering (density estimation)
• Selecting top clusters (k-means) and updating average centroid, assign data points to a cluster
and iterate
• Blind Signal Separation
• Feature Extraction for Dimensionality Reduction
• Hidden Markov Models
• Non-normal & normal distribution analysis (finding the distributions of data)
• Self-Organizing Maps
73. Knowing what to do next
• Build your algorithm quick and dirty, don’t spend a lot of time on it until you have something to use
• Split up your training, cross validation and test sets (don’t test on your training data!)
• Move on to PCA or unsupervised pre-training for your supervised algorithms to help improve performance after: —>
• Don’t just try and get a lot of data to train on, implement your algorithm quick and dirty, use smaller data sets initially and
determine bias/variance
• High variance: get more training data
• High variance: try fewer features
• High bias: add additional features
• High bias: add polynomial features
• High bias: decrease regularization
• High variance: increase regularization
Today I’m going to talk to you about a branch of artificial intelligence known as machine learning. We’re going to go through lots of topics, techniques and tools used in this field to help us understand, manipulate, model, classify and predict on all sorts of datasets. This talk is meant to be a very quick overview so I won’t go too much into detail on any one specific topic, but I will definitely leave time for questions after. Now neural networks is not the entire intent of the talk I’m about to give, but it does play a pivotal role as it highlights an area of renewed interest that has popped up all over the place what with robotics companies getting acquired, business intelligence and analytics being the norm surrounding big data.
So my goal is to cover how you apply machine learning itself, and one of the topics is about neural networks. To get there, we need to build up some intuition first and along the way I hope to show you some techniques to make sure you don’t over think anything. This is pretty much the way I’ve learned it so I hope you enjoy :)
General Intelligence refers to the goal of simulating or creating intelligence by a means of solving many sub-problems.
The first pillar is in deduction, reasoning and problem solving itself. Much of the early research began with the study of logic itself and attempting to simulate step-by-step instructions for algorithms that deal with certain types of problems. This is somewhat on the heuristics side of simulated intelligence where rules, instructions and applied algorithms give the appearance of deduction.
Next, you have knowledge representation and knowledge engineering. This includes things like knowing how to represent conceptually what objects, places, situations, events, even language have, what they might look like, what categories they fall into and the relationships between these things. Causes, effects, knowledge about knowledge (what we know about what other people know), and lots of other less known research.
-
Next, you have knowledge representation and knowledge engineering. This includes things like knowing how to represent conceptually what objects, places, situations, events, even language have, what they might look like, what categories they fall into and the relationships between these things. Causes, effects, knowledge about knowledge (what we know about what other people know), and lots of other less known research.
-
Planning is all about being able to take your representation of the world and environment - set goals and attempt to achieve them. This means making lots of predictions about the environment about how you will act and what set of useful and available actions can you take to maximize potential. When the environment, changes you will need determine whether your predictions were matched and if not, you’ll need to perform some error analysis to either get better for the future or simply make new decisions as things change.
In machine learning we are mainly dealing with the ability to improve through some optimization objective at each experience or example we receive. This is true of any type of problem in machine learning - including unsupervised learning itself - where the problem there is dealing with how to find patterns from the given set of experiences. Common tools and techniques in unsupervised learning is having the ability to reduce noise, outliers and represent data with fewer dimensions.
We’ll talk a bit about auto encoders - which is a type of artificial neural network that when provided an input stream of data, it will effectively - through several techniques - mess with or hide away bits of information in order for the neural network to try to recreate the original input as best as it can. This reconstruction process is an error correction step and an optimization objective that is central to discovering patterns and relationships between important bits of information.
NLP is the part of the problem where reading and understanding text and/or speech is central to any intelligent system. This includes the ability to take random strings of characters and be able to graph them into various parts of speech, semantically index them by their utility and function, aspects of category theory and information flow - (how sentences that seem to trail on and on can somehow still make some attenuation of eliciting a positive emotion, counter arguments and even yes — #sarcasm).
But that’s not all, we’ve got the grandiose of them all - machine translation - which in general intelligence forms is what is called an AI-Complete problem, in that to solve it you must solve all problems - (intention, nlp, knowledge representation, deductive reasoning, social intelligence, perception..etc)
Finally, like other pillars, one of the major goals to help build ontologies and represent knowledge better is in the ability to scour the internet, books and information online to automatically read, index and represent knowledge in a very dense manner.
One of the recent projects I developed at Nuvi was tailored towards using some of these automated techniques by parsing dictionary sources, wikipedia, thesaurus databases to build several important layers from part of speech tagging to sentiment analysis - with the central goal that it will be applied to multiple languages. - Lucky for me, this technique has worked wonders and I’m now working on adding more languages to our application.
Let’s start off with supervised learning
- There are 2 types of sub-categories within supervised learning. This includes regression analysis and classification.
I’ll describe more formally what supervised learning details once I’ve given a concrete example to kick us off:
Let’s say you have a graph like the one shown on the right.
- This is showing us an example of two variables, some x and some y.
- Let’s say x represents number of people employed, and the y represents GDP.
- For now we’re only using 1 variable, but we’ll get to more in a few
- We know that when no one works, then the GDP is obviously low
while when everyone works, it seems to be the case that our GDP is up as well.
- perhaps the other points are due to other fluctuating variables we haven’t accounted for.
- Now I can just draw a line through the middle here to get what I think is a general trend line that seems to “best fit” the dataset.
- Doing that means that I’ve drawn a conclusion about the relative error between all the various points on the graph and the line I’ve drawn is the one that should maintain the least amount of error between all the points.
Now statistical inference differs from machine learning in that machine learning will not attempt to make any assumptions about the model. While I could easily use the application of least squares, this would be an assumption that I don’t care about outliers or noise. We’ll get to noise later on in the presentation and how other machine learning models have a threshold or tolerance for outliers or anomalies (and we’ll also touch on anomaly detection)..
nevertheless, with machine learning, particular in the application of linear regression our goal is to produce a function that best fits the dataset.
Let’s say that function is y = x, that means that for all points of x, y is equal to x.
Our goal is to generate a function that can reduce the amount of error between the actual results of y and the predicted results of y.
This is done through a techniques called parameterization. In this case we have exactly 1 feature we’ve defined, x and it is parameterized by theta1 and theta 0. The graph on the left demonstrates various versions of our hypothesis where the parameters are of varying values. Each function produced has its own total error from each of the points in our dataset. The total error is what we want to attempt to reduce.
All machine learning algorithms have some kind of optimization objective. Typically this optimization objective is either related to maximizing entropy (or probabilistic outcome - i.e. typically called hill climbing) or it’s related to descending and minimizing error (lowest point). Most often, algorithms are implemented to adjust the parameters slightly using a learning rate to adjust the initial parameters used in the algorithm.
For our linear regression problem, our optimization objective is shown above, which is to minimize J w.r.t theta0,theta1. The cost function is shown here as a summation of the square error difference between all the sample points as predicted by our hypothesis and the actual values y.
Issues often arise with finding minimum due to the fact that you can get stuck in local optimum. It is usually a good idea to randomly initialize your parameters several times before to get varying results. In the future, unsupervised learning can help us pre-train our parameters prior to running them on supervised learning algorithms.
The algorithm shown above is a simultaneous update. This means that for all parameters we want to update them with the current iteration state for each parameter of theta.
An alternative way to write this update is shown here. Now if you know your calculus a bit, you’ll also understand that the derivative refers to the slope at a point of intersection.
alpha controls our learning rate, if the learning rate is too high, then the step is going to be too drastic, if it’s too small then it’s going to take a long time to converge. The idea here, is that at each step of gradient descent, we are essentially calculating the slope and making a slight adjustment based on the derivative of our cost function w.r.t. the given parameter. As we get closer and closer to the local minimum, the steps are smaller and smaller (even when the learning rate is the same).
Calculating the derivative we will get a function that looks like this:
You’ll notice that theta 0 has no x(i) while theta 1 does. That’s because if we look at our hypothesis again:
Well, really, what we have is a value of 1. In terms of our features, we’ve simply stated that there is a hidden feature x0 that is always equal to 1 and the value of x as shown in our hypothesis is really the single variable x(1). Sometimes that extra feature x(0) = 1 is called our bias.
This simple difference in defining a value attached to each of our parameters is nothing more than defining a set of features used to calculate a broader function. More formally, our hypothesis is the sum of our weighted input. We use this formal definition all the way through neural networks. The sum of our weighted input.
Before I go into multivariable linear regression, are there any questions?
No ok awesome, I thought you guys were perfect.
In multivariate (multiple linear regression) we are dealing with multiple variables. Our previous hypothesis in single-variate / single variable linear regression only had 2 parameters, theta 0 and theta 1 - Cooresponding to a single variable or single feature. In this we have several.
Building on our previous hypothesis, we’ve just simply expanded it out to include the other features and weight them with their corresponding parameters.
If we begin to think of an set of features for a particular example as a vector of features and our parameters as one vector of all the parameters. Then we can simplify our sum of weighted inputs to a simple linear algebra operations.
The process of simplifying our composed formula in this way is called vectorization and its quite common to attempt to do this when it is advantageous to do so. Many numerical libraries, especially ones that can run on the GPU, will take advantage of multiple cores to run lots of linear algebra operations where doing a simple transpose multiplication of vectors can be done much more efficiently.
Now that we’ve got our hypothesis, we can see that the cost function simplifies to be over all of the parameters theta and our new goal is just to minimize the our parameters w.r.t. the vector theta.
The algorithm shown above is a simultaneous update. This means that for all parameters we want to update them with the current iteration state for each parameter of theta.
The partial derivative is the same as it was before, however I’ve just simplified the step as instead of having a separate update at j = 0, we are just stating that for all samples feature 0 will be equal to 1. This falls in line with our hypothesis in that again, x_0 = 1.
The only difference now is that you won’t be able to visualize your data, but a common technique for ensuring that your algorithm is working correctly is to visualize your cost after each iteration. With a good alpha learning rate your cost should always be decreasing after every iteration.
Some things to keep in mind related to graphing your cost function, making sure you have similar input ranges and using things like cross validation and test data sets to compare against. This should keep you from testing against the same data you train on.
There is a direct way to solve for theta which is called the normal equation. With that we use the entire matrix and all outputs y and plug it into the equation shown. We don’t need to run gradient descent, use an alpha rate or anything. The only issue is that it can run very slowly for when we have a large number of features. As well, there are some matrices that have no inverse.
Summarize for parameters (work backwards to correct for large error/cost until you have fine tuned those parameters for better predictions)
Moving on from supervised regression analysis we have classification problems which deal with being able to detect any specific type of pattern or group. Other than the ones listed above here I can say that one such example in NLP that I’ve recently dealt with is the ability to accurately detect when someone is asking questions. A seemingly trivial task but when you work with Twitter data, people stop using question marks, you have words out of order, prefixed terms and most common question terms can also function in other uses in language.
Our goal in classification is to have a function that will always produce some value between 0 and 1. A possible function to use for this is called the sigmoid activation function. This function bounds all real input values as the result of this function is an S curve graph.
Now we can simply pass in our weighted values just like we produced from multivariate linear regression into the sigmoid function and get our output in the range we are looking for. Now we know that the sum of our weighted inputs is passed into this function to get an output between 0 and 1. Awesome!
We’ve got our hypothesis for logistic regression, what about the cost function, well remember that the linear regression looks something like this (i’ve just pulled in the 1/2 into the inside of the summation here). That inside is basically our cost of a single example. So we can just re-write this to be the sum over all our training examples (the cost).
Our cost function has two separate possible values depending on the class value of y. If y is 1, then we use the first one, if y = 0 then we use the second logarithm. The intuition behind this is that if we are wrong about something that should be 1, then we need to penalize the parameters by saying we are wayyy off, whereas when y = 0 and we get that wrong, then we need to penalize the parameters in a different way. If you graph these two logarithms you will get something like this:
This is the “simplified” version of the formula. All it does is combine the two conditions into one cohesive formula that can be used to perform… you guessed it gradient descent on.
Hey look at that, it looks the exact same as multivariable linear regression. How bout that :D
You can see in multi-class classification, where you are needing to classify one of many possible states, there are multiple decision boundaries. Using the 1-vs-all approach you can easily determine the most probable class just by finding the max of all predictions - meaning you train a set of parameters (theta) for each possible state then use those specific parameters to determine the probability of it being that particular state. The highest probability wins.
Very quickly, there is the possibility of overfitting which means you can have too many features in which the learned hypothesis will fit the data too well (failing to generalize to new examples). Regularization is an addition you can do to ensure that your hypothesis doesn’t overfit your data. But also remember, that if you have too few features or too much regularization you can under fit the data as well. Plotting your cost over time, as well as manipulating your regularization parameters on a cross validation set will help you determine whether you need more/less regularization or more or less features or even just more training data.
The regularization parameter is denoted here as an addition to the original cost function. We can apply this same regularization parameter to the linear regression cost function as well (it is the same except + that parameter).
Applying the same regularization in gradient descent, we just add all regularized values by adding just a little bit of variance to each parameter (except parameter 0) before we iterate over it.
Figuring out what to set lambda (regularization parameter) is done typically through cross validation.
Neural networks are an artificial learning device that were primarily inspired by the brain and how it learns. A sort of revolution in the 1980’s discovered that you could build these networks of lots of non-linear feature detectors. The only downside was that in the 1980’s things were a bit slow and doing things like speech recognition and object recognition, image retrieval and really good recommendations were not that great. Some preliminary results have shown the ability for self-driving cars even as far back as the 1980’s despite the slow computing availability (mostly due to constraints, the self-driving vehicle consisted of training a network to learn to follow a path of a road and stop at an intersection, detect when traffic is coming and cross the path appropriately). Pretty impressive for back then!
Since then, computing as gotten a lot faster and the numerical analysis methods have gotten a lot more sophisticated, allowing for better algorithms to be implemented.
And given the boom of data we have with the internet, training on lots of data is a piece of cake now :) Things like Siri or Google Voice simply wouldn’t be possible without the vast improvements that have been made over the years and now Neural Networks have become widely popularized due in large fact to its use in “deep learning” scenarios.
I’m going to talk to you guys about the simplest neural network that was designed in the 80’s and briefly cover the types of networks that are seen in the wild today.
Think of every feature of input that we saw in previous systems as a distinct unit. That all our features are units. Now remember that previously we were talking about having the sum of our weighted inputs define our prediction of some hypothesis. In linear regression the sum of the weighted inputs gives us the predicted real-value, while putting that sum through the sigmoid activation function - this will give us a probability value (between 0 and 1).
Now take a look at this network diagram. Layer 1 is going to signify our inputs.
Remember that our features always contains one extra unit (x0 = 1). This is called our bias unit and it is in every layer after.
A neural network is something that encompasses lots of logistic units. A logistic unit in this diagram is each of of those yellow circles. Everything past the layer 1 will take input from everything else in the previous layer (or the sum of the weighted input from the previous layer) and activate it through some logistic function (particularly the sigmoid activation function).
Essentially, think of a neural network like a bunch of logistic regression units connected to each other. The point of the network is to learn lots of individual feature detectors and eventually come to a conclusion in the end. Our parameters get a lot more complex as there are theta weights assigned for each connection in one layer and the previous.
Now a network can have many layers, it can have many units in between, it can have fewer units than the input layer or it can have more units. All the layers between our input layer and out output layer are called hidden layers. These are the layers that ultimately model lots of complex hidden feature detections that aren’t exactly something you can point to and say - yes that layer does this. It’s really about training the network to identify patterns between each layer until it finds some complex series of patterns that leads to the final conclusion.
So we’ve got all these units. What would be the value of our first unit in the second layer?
Answer: the activation of the sum of its weighted inputs
We denote Theta as a n-matrix of weights associated with every unit in every layer. We have weights that go from all the inputs to the first unit in layer 2 (first vector Theta10(1)). We have all the weights/parameters from all the inputs to the second unit in layer 2….etc As well, we have all the weights/parameters of the weighted outputs of all the units in layer 2 to first unit in layer 3.. and so on.
So we aren’t just initializing one vector of parameters to learn. We’re initializing a full matrix of parameters in every layer.
We denote Theta as a n-matrix of weights associated with every unit in every layer. We have weights that go from all the inputs to the first unit in layer 2 (first vector Theta10(1)). We have all the weights/parameters from all the inputs to the second unit in layer 2….etc As well, we have all the weights/parameters of the weighted outputs of all the units in layer 2 to first unit in layer 3.. and so on.
So we aren’t just initializing one vector of parameters to learn. We’re initializing a full matrix of parameters in every layer.
This movement of weighted values as they are provided and activated and weighted at each unit in the network is what allows us to eventually get to our final prediction hTheta(x). The process of doing this is aptly called forward propagation, in that we are propagating the weighted values and their associated activations as they move through the network all the way to the end.
Similar to multi-class logistic regression, we can just have many output nodes that symbolize the various states that we wish to identify. The collective output is a vector and we can then use that output to directly tell us which is the most probable. In scenarios where we are training, we would set our output to be a 1 vs all scenario type vector where some units should have an activation output near 0 while the one should have an activation output near 1.
Now remember that in logistic regression our cost function was essentially penalized for misclassifying and set close to 0 when it predicts correctly. The error is inherit due to the nature of the logarithm.
Now in neural networks, as I’ve explained, every unit in the network is just a logistic unit (or a unit that essentially mimics a logistic regression function). Not surprisingly, we’ve just copied the same equation over but we are simply accounting for the multi-class classification scenario with K = number of output classes. Additionally, we are adding a regularization term that attempts to regularize over all weights/parameters across all layers.
This sort of back propagation occurs by using the delta error from the more recent layer and multiplying them to the parameters theta that already have associated weights with those connections between the two layers and various nodes. Essentially this will weight the error across the same paths backwards and then carry the sigmoid gradient of the original input with it.
Think of it like like a network of ropes and strings. Some ropes have greater weight associated with it (stronger, sturdier). When we say how far we are off from the original predicted values (calculating our error) and then apply the weight of the connection to those errors, once we’ve done that we will carry the sigmoid gradient (or attempt to find the slope of our activation) with us to ensure that the activated sum of our original weighted inputs give us additional guidance in the direction of how far we are off on the hill.
This is a modular neural controller of a walking machine using a recurrent neural network model. The controller essentially generates omnidirectional walking and drives reflex behavior.
So autoencoders are pretty cool, the idea is to generate a set of parameters that best represent your input. So initially you feed the network the input features with the initialized parameter weights basically randomly set. The network then constrains itself through a middle layer to essentially remove/dropout or hide away features by having to map itself to a lower dimension.
There’s a technique for this called Principal Component Analysis which is equivalent of projecting a higher dimensional hyperplane onto a lower one.
In PCA, we are effectively attempting attempting to map the general distribution of points wherever it sits and map it to a lower dimension. So in the above example we are taking some 3d space, let’s say it’s a piece of paper with some outliers and noise floating around it, we can effectively say that this piece of paper can be mapped to a 2d space without hurting anything too much since it’s relatively flat. Now apply that to n number of dimensions and you’ve got PCA.
The idea is that some features probably aren’t all that important whereas other features might actually provide some good similarity between examples. Things become so similar that it’s completely unnecessary to have that many features to begin with.
With the autoencoders, it’s basically the same thing. The added benefit, however, is that at the end of the network the goal is to reproduce the input. Such that, once it’s effectively mapped itself, it will know how to represent itself in fewer dimensions - thus creating parameters that more accurately represent the patterns it comes across.