The data deluge we currently witnessing presents both opportunities and challenges. Never before have so many aspects of our world been so thoroughly quantified as now and never before has data been so plentiful. On the other hand, the complexity of the analyses required to extract useful information from these piles of data is also rapidly increasing rendering more traditional and simpler approaches simply unfeasible or unable to provide new insights.
In this tutorial we provide a practical introduction to some of the most important algorithms of machine learning that are relevant to the field of Complex Networks in general, with a particular emphasis on the analysis and modeling of empirical data. The goal is to provide the fundamental concepts necessary to make sense of the more sophisticated data analysis approaches that are currently appearing in the literature and to provide a field guide to the advantages an disadvantages of each algorithm.
In particular, we will cover unsupervised learning algorithms such as K-means, Expectation-Maximization, and supervised ones like Support Vector Machines, Neural Networks and Deep Learning. Participants are expected to have a basic understanding of calculus and linear algebra as well as working proficiency with the Python programming language.
4. www.bgoncalves.com@bgoncalves
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf
E X P E R T O P I N I O N
Contact Editor: Brian Brannon, bbrannon@computer.org
behavior. So, this corpus could serve as the basis of
a complete model for certain tasks—if only we knew
how to extract the model from the data.
E
ugene Wigner’s article “The Unreasonable Ef-
fectiveness of Mathematics in the Natural Sci-
ences”1 examines why so much of physics can be
Alon Halevy, Peter Norvig, and Fernando Pereira, Google
The Unreasonable
Effectiveness of Data
Big Data
7. www.bgoncalves.com@bgoncalves
From Data To Information
Statistics are numbers that summarize raw facts and figures in some
meaningful way. They present key ideas that may not be immediately
apparent by just looking at the raw data, and by data, we mean facts or figures
from which we can draw conclusions. As an example, you don’t have to
wade through lots of football scores when all you want to know is the league
position of your favorite team. You need a statistic to quickly give you the
information you need.
The study of statistics covers where statistics come from, how to calculate them,
and how you can use them effectively.
Gather data
Analyze
Draw conclusions
When you’ve analyzed
your data, you make
decisions and predictions.
Once you have data, you can analyze itand generate statistics. You can calculateprobabilities to see how likely certain eventsare, test ideas, and indicate how confidentyou are about your results.
At the root of statistics is data.
Data can be gathered by looking
through existing sources, conducting
experiments, or by conducting surveys.
11. www.bgoncalves.com@bgoncalves
Central Limit Theorem
• As the random variables:
• with:
• converge to a normal distribution:
• after some manipulations, we find:
n ! 1
Sn =
1
n
X
i
xi
N 0, 2
p
n (Sn µ)
Sn ⇠ µ +
N 0, 2
p
n
The estimation of the mean converges to the true
mean with the square root of the number of samples
! SE = p
n
13. www.bgoncalves.com@bgoncalves
Experimental Measurements
• Experimental errors commonly assumed gaussian distributed
• Many experimental measurements are actually averages:
• Instruments have a finite response time and the quantity of
interest varies quickly over time
• Stochastic Environmental factors
• Etc
14. www.bgoncalves.com@bgoncalves
• In an experimental measurement, we expect (CLT) the experimental values to be
normally distributed around the theoretical value with a certain variance. Mathematically,
this means:
• where are the experimental values and the theoretical ones. The likelihood is
then:
• Where we see that to maximize the likelihood we must minimize the sum of squares
MLE - Fitting a theoretical function to experimental data
y f (x) ⇡
1
p
2 2
exp
"
(y f (x))
2
2 2
#
y f (x)
Least Squares Fitting
L =
N
2
log
⇥
2 2
⇤ X
i
"
(yi f (xi))
2
2 2
#
15. www.bgoncalves.com@bgoncalves
MLE - Linear Regression
• Let’s say we want to fit a straight line to a set of points:
• The Likelihood function then becomes:
• With partial derivatives:
• Setting to zero and solving for and :
y = w · x + b
L =
N
2
log
⇥
2 2
⇤ X
i
"
(yi w · xi b)
2
2 2
#
@L
@w
=
X
i
[2xi (yi w · xi b)]
@L
@b
=
X
i
[(yi w · xi b)]
ˆw =
P
i (xi hxi) (yi hyi)
P
i (xi hxi)
2
ˆb = hyi ˆwhxi
ˆw ˆb
16. @bgoncalves
MLE for Linear Regression
from __future__ import print_function
import sys
import numpy as np
from scipy import optimize
data = np.loadtxt(sys.argv[1])
x = data.T[0]
y = data.T[1]
meanx = np.mean(x)
meany = np.mean(y)
w = np.sum((x-meanx)*(y-meany))/np.sum((x-meanx)**2)
b = meany-w*meanx
print(w, b)
#We can also optimize the Likelihood expression directly
def likelihood(w):
global x, y
sigma = 1.0
w, b = w
return np.sum((y-w*x-b)**2)/(2*sigma)
w, b = optimize.fmin_bfgs(likelihood, [1.0, 1.0])
print(w, b)
MLElinear.py
20. www.bgoncalves.com@bgoncalves
3 Types of Machine Learning
• Unsupervised Learning
• Autonomously learn an good representation of the dataset
• Find clusters in input
• Supervised Learning
• Predict output given input
• Training set of known inputs and outputs is provided
• Reinforcement Learning
• Learn sequence of actions to maximize payoff
• Discount factor for delayed rewards
23. www.bgoncalves.com@bgoncalves
K-Means
• Choose k randomly chosen points to be the centroid of each
cluster
• Assign each point to belong the cluster whose centroid is closest
• Recompute the centroid positions (mean cluster position)
• Repeat until convergence
25. www.bgoncalves.com@bgoncalves
K-Means: Convergence
• How to quantify the “quality” of the solution found at each iteration, ?
• Measure the “Inertia”, the square intra-cluster distance:
where are the coordinates of the centroid of the cluster to which is assigned.
• Smaller values are better
• Can stop when the relative variation is smaller than some value
µi xi
In+1 In
In
< tol
In =
NX
i=0
kxi µik
2
n
29. www.bgoncalves.com@bgoncalves
Silhouettes
• For each point define as:
the average distance between point and every other point within cluster .
• Let be:
the minimum value of excluding
• The silhouette of is then:
ac (xi)
ac (xi) =
1
Nc
X
j2c
kxi xjk
b (xi) = min
c6=ci
ac (xi)
s (xi) =
b (xi) aci (xi)
max {b (xi) , aci
(xi)}
xi
xi c
b (xi)
ciac (xi)
xi
36. www.bgoncalves.com@bgoncalves
Principle Component Analysis
• The Principle Component projection, T, of a matrix A is defined as:
• where W is the eigenvector matrix of:
• and corresponds to the right singular vectors of A obtained by Singular Value
Decomposition (SVD):
• So we can write:
• Showing that the Principle Component projection corresponds to the left
singular vectors of A scaled by the respective singular values
• Columns of T are ordered in order of decreasing variance.
T = AW
AT
A
T = U⌃WT
W ⌘ U⌃
A = U⌃WT
Generalization of
Eigenvalue/
Eigenvector
decomposition
for non-square
matrices.
⌃
37. @bgoncalves
Principle Component Analysis
from __future__ import print_function
import sys
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt(sys.argv[1])
x = data.T[0]
y = data.T[1]
pca = PCA()
pca.fit(data)
meanX = np.mean(x)
meanY = np.mean(y)
plt.style.use('ggplot')
plt.plot(x, y, 'r*')
plt.plot([meanX, meanX+pca.components_[0][0]*pca.explained_variance_[0]],
[meanY, meanY+pca.components_[0][1]*pca.explained_variance_[0]], 'b-')
plt.plot([meanX, meanX+pca.components_[1][0]*pca.explained_variance_[1]],
[meanY, meanY+pca.components_[1][1]*pca.explained_variance_[1]], 'g-')
plt.title('PCA Visualization')
plt.legend(['data', 'PCA1', 'PCA2'], loc=2)
plt.savefig('PCA.png')
PCA.py
40. www.bgoncalves.com@bgoncalves
Supervised Learning - Classification
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample N
Feature1
Feature3
Feature2
…
Label
FeatureM
• Dataset formatted as an NxM matrix of N samples and M
features
• Each sample belongs to a specific class or has a specific label.
• The goal of classification is to predict to which class a
previously unseen sample belongs to by learning defining
regularities of each class
• K-Nearest Neighbor
• Support Vector Machine
• Neural Networks
• Two fundamental types of problems:
• Classification
• Regression (see Linear Regression above)
41. www.bgoncalves.com@bgoncalves
Supervised Learning - Overfitting
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample N
Feature1
Feature3
Feature2
…
Label
FeatureM
• “Learning the noise”
• “Memorization” instead of “generalization”
• How can we prevent it?
• Split dataset into two subsets: Training and Testing
• Train model using only the Training dataset and evaluate
results in the previously unseen Testing dataset.
• Different heuristics on how to split:
• Single split
• k-fold cross validation: split dataset in k parts, train in k-1
and evaluate in 1, repeat k times and average results.
TrainingTesting
42. www.bgoncalves.com@bgoncalves
Supervised Learning - Overfitting
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample N
Feature1
Feature3
Feature2
…
Label
FeatureM
• “Learning the noise”
• “Memorization” instead of “generalization”
• How can we prevent it?
• Split dataset into two subsets: Training and Testing
• Train model using only the Training dataset and evaluate
results in the previously unseen Testing dataset.
• Different heuristics on how to split:
• Single split
• k-fold cross validation: split dataset in k parts, train in k-1
and evaluate in 1, repeat k times and average results.
TrainingTesting
45. www.bgoncalves.com@bgoncalves
K-nearest neighbors
• Perhaps the simplest of supervised learning algorithms
• Effectively memorizes all previously seen data
• Intuitively takes advantage of natural data clustering
• Define that the class of any datapoint is given by the plurality of it’s k
nearest neighbors
• It’s not obvious how to find the right value of k
49. www.bgoncalves.com@bgoncalves
How the Brain “Works” (Cartoon version)
• Each neuron receives input from other neurons
• 1011 neurons, each with with 104 weights
• Weights can be positive or negative
• Weights adapt during the learning process
• “neurons that fire together wire together” (Hebb)
• Different areas perform different functions using same structure (Modularity)
62. www.bgoncalves.com@bgoncalves
Forward Propagation
• The output of a perceptron is determined by a sequence of steps:
• obtain the inputs
• multiply the inputs by the respective weights
• calculate output using the activation function
• To create a multi-layer perceptron, you can simply use the output of
one layer as the input to the next one.
• But how can we propagate back the errors and update the weights?
x1
x2
x3
xN
w
1j
w2j
w3j
wN
j
aj
w
0j
1
wT
x
1
w
0k
w
1k
w2k
w3k
wNk
ak
wT
a
a1
a2
aN
63. www.bgoncalves.com@bgoncalves
Backward Propagation of Errors (BackProp)
• BackProp operates in two phases:
• Forward propagate the inputs and calculate the deltas
• Update the weights
• The error at the output is the squared difference between predicted
output and the observed one:
• Where is the real output and is the predicted one.
• For inner layers there is no “real output”!
t
E = (t y)
2
y
64. www.bgoncalves.com@bgoncalves
Chain-rule
• From the forward propagation described above, we know that the
output of a neuron is:
• But how can we calculate how to modify the weights ?
• We take the derivative of the error with respect to the weights!
• Using the chain rule:
• And finally we can update each weight in the previous layer:
• where is the learning rate
yj = wT
x
@E
@wij
=
@E
@yj
@yj
@wij
wij
@E
@wij
wij wij ↵
@E
@wij
yj
↵
66. www.bgoncalves.com@bgoncalves
Backprop
• Back propagation solved the fundamental problem underlying neural networks
• Unfortunately, computers were still too slow for large networks to be trained successfully
• Also, in many cases, there wasn’t enough available data
71. www.bgoncalves.com@bgoncalves
Support Vector Machines
• Decision plane has the form:
• We want for points in the “positive” class and for points in the negative
class. Where the “margin” is as large as possible.
• Normalize such that and solve the optimization problem:
subject to:
• The margin is:
wT
x = 0
wT
x b wT
x b
2b
min
w
||w||2
yi wT
x > 1
2
||w||
b = 1
72. www.bgoncalves.com@bgoncalves
Kernel “trick”
• SVM procedure uses only the dot products of vectors and never the vectors themselves.
• We can redefine the dot product in any way we wish.
• In effect we are mapping from a non-linear input space to a linear feature space