This document discusses Python for scientific computing. It provides notes on NumPy, the fundamental package for scientific computing in Python. NumPy allows vectorized mathematical operations on multidimensional arrays in a simple and efficient manner. The notes cover common NumPy operations and syntax as compared to MATLAB and R. Pandas is also introduced as a package for data manipulation and analysis based on the concept of data frames from R. Examples are given of generating fake data to demonstrate modeling capabilities in Python.
2. Introduction
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
? ? ?
? ? ?
? ? ?
What is the result of this operation?
3. Introduction
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
? ? ?
? ? ?
? ? ?
What is the result of this operation?
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
g d a
h e b
i f c
4. Introduction
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
? ? ?
? ? ?
? ? ?
What is the result of this operation?
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
g d a
h e b
i f c
1 from numpy import *
2 cout = p r i n t
3
4 A = random.random ((3, 3));
5 B = fliplr(eye (3));
6 C = dot(A, B);
7 cout(C);
What programming language is this?
5. Introduction
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
? ? ?
? ? ?
? ? ?
What is the result of this operation?
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
g d a
h e b
i f c
1 from numpy import *
2 cout = p r i n t
3
4 A = random.random ((3, 3));
5 B = fliplr(eye (3));
6 C = dot(A, B);
7 cout(C);
It’s Python!
1 from numpy import *
2 cout = p r i n t
3
4 A = random.random ((3, 3));
5 B = fliplr(eye (3));
6 C = dot(A, B);
7 cout(C);
What programming language is this?
6. Introduction
1 #i n c l u d e <armadillo >
2 using namespace arma;
3 using namespace std;
4
5 mat A(3 ,3), B(3 ,3);
6 A.randu ();
7 B = fliplr(B.eye ());
8 M3 = M1 * M2;
9 cout << M3 << endl;
What about this programming language?
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
g d a
h e b
i f c
1 from numpy import *
2 cout = p r i n t
3
4 A = random.random ((3, 3));
5 B = fliplr(eye (3));
6 C = dot(A, B);
7 cout(C);
It’s Python!
1 from numpy import *
2 cout = p r i n t
3
4 A = random.random ((3, 3));
5 B = fliplr(eye (3));
6 C = dot(A, B);
7 cout(C);
What programming language is this?
7. Introduction
1 #i n c l u d e <armadillo >
2 using namespace arma;
3 using namespace std;
4
5 mat A(3 ,3), B(3 ,3);
6 A.randu ();
7 B = fliplr(B.eye ());
8 M3 = M1 * M2;
9 cout << M3 << endl;
What about this programming language?
Why use Python?
More important than the programming
language is the ecosystem – and Python
has a great scientific community
Python has good interoperability with
other systems
The entire stack can be developed in
Python: machine learning, flask, etc
Computations do not run in Python; the
slow stuff is implemented in Fortran and
C
1 from numpy import *
2 cout = p r i n t
3
4 A = random.random ((3, 3));
5 B = fliplr(eye (3));
6 C = dot(A, B);
7 cout(C);
It’s Python!
1 from numpy import *
2 cout = p r i n t
3
4 A = random.random ((3, 3));
5 B = fliplr(eye (3));
6 C = dot(A, B);
7 cout(C);
What programming language is this?
12. Numpy Notes
Let A and B be matrices,
Python/Numpy MATLAB R
A.dot(B) A * B A %*% B
A * B A .* B A * B
Operations are elementwise by default (like
R)
13. Numpy Notes
Let A and B be matrices,
Python/Numpy MATLAB R
A.dot(B) A * B A %*% B
A * B A .* B A * B
Operations are elementwise by default (like
R)
Python/Numpy MATLAB R
A.shape size(A) length, nrow, ncol
A[0:4,:] or
A[0:4] or A[:4] A(1:4,:) A[1:4,]
A[0:10:2] A[seq(0, 9, 2)]
A[-4:] A(end-4:end,:) A[nrow(A)-4:nrow(A),]
A.T A.’ t(A)
Numpy in general allows for more succinct
writing.
Furthermore:
Indexing starts at zero.
Intervals are of the form [i, j[
14. Numpy Notes
Let A and B be matrices,
Python/Numpy MATLAB R
A.dot(B) A * B A %*% B
A * B A .* B A * B
Operations are elementwise by default (like
R)
Python/Numpy MATLAB R
A.shape size(A) length, nrow, ncol
A[0:4,:] or
A[0:4] or A[:4] A(1:4,:) A[1:4,]
A[0:10:2] A[seq(0, 9, 2)]
A[-4:] A(end-4:end,:) A[nrow(A)-4:nrow(A),]
A.T A.’ t(A)
Numpy in general allows for more succinct
writing.
Furthermore:
Indexing starts at zero.
Intervals are of the form [i, j[
This is further aided by the fact that Numpy
supports arithmetic broadcasting. (unlike
MATLAB or R.)
That is, you can do the following element-
wise multiplication: (6,3) * (6,1). It auto-
matically assumes you want to multiply by
column. In MATLAB, you would have to use
bsxfun(@times,r,A) or first use repmat().
15. Numpy Notes
Let A and B be matrices,
Python/Numpy MATLAB R
A.dot(B) A * B A %*% B
A * B A .* B A * B
Operations are elementwise by default (like
R)
Python/Numpy MATLAB R
A.shape size(A) length, nrow, ncol
A[0:4,:] or
A[0:4] or A[:4] A(1:4,:) A[1:4,]
A[0:10:2] A[seq(0, 9, 2)]
A[-4:] A(end-4:end,:) A[nrow(A)-4:nrow(A),]
A.T A.’ t(A)
Numpy in general allows for more succinct
writing.
Furthermore:
Indexing starts at zero.
Intervals are of the form [i, j[
Something like the following is valid in
Numpy...
1 import skimage.data
2 img1 = skimage.data.astronaut ()
3 img2 = skimage.data.moon ()
4 p r i n t (img1.shape) # (512 , 512 , 3)
5 p r i n t (img2.shape) # (512 , 512)
6
7 import matplotlib.pyplot as plt
8 plt.subplot (1, 2, 1)
9 plt.imshow(img1)
10 plt.subplot (1, 2, 2)
11 plt.imshow(img2 , cmap=’gray ’)
12 plt.show ()
This is further aided by the fact that Numpy
supports arithmetic broadcasting. (unlike
MATLAB or R.)
That is, you can do the following element-
wise multiplication: (6,3) * (6,1). It auto-
matically assumes you want to multiply by
column. In MATLAB, you would have to use
bsxfun(@times,r,A) or first use repmat().
16. Numpy Notes
Python/Numpy MATLAB R
A.shape size(A) length, nrow, ncol
A[0:4,:] or
A[0:4] or A[:4] A(1:4,:) A[1:4,]
A[0:10:2] A[seq(0, 9, 2)]
A[-4:] A(end-4:end,:) A[nrow(A)-4:nrow(A),]
A.T A.’ t(A)
Numpy in general allows for more succinct
writing.
Furthermore:
Indexing starts at zero.
Intervals are of the form [i, j[
Something like the following is valid in
Numpy...
1 import skimage.data
2 img1 = skimage.data.astronaut ()
3 img2 = skimage.data.moon ()
4 p r i n t (img1.shape) # (512 , 512 , 3)
5 p r i n t (img2.shape) # (512 , 512)
6
7 import matplotlib.pyplot as plt
8 plt.subplot (1, 2, 1)
9 plt.imshow(img1)
10 plt.subplot (1, 2, 2)
11 plt.imshow(img2 , cmap=’gray ’)
12 plt.show ()
This is further aided by the fact that Numpy
supports arithmetic broadcasting. (unlike
MATLAB or R.)
That is, you can do the following element-
wise multiplication: (6,3) * (6,1). It auto-
matically assumes you want to multiply by
column. In MATLAB, you would have to use
bsxfun(@times,r,A) or first use repmat().
17. Numpy Notes
Arithmetic mean
1 img2 = img2[:, :, np.newaxis] #(512 ,512 ,1)
2 img1 = img1.astype(np.uint32)
3 img2 = img2.astype(np.uint32)
4 img3 = (img1 + img2)//2
5 img3 = img3.astype(np.uint8)
6 plt.imshow(img3)
7 plt.show ()
Something like the following is valid in
Numpy...
1 import skimage.data
2 img1 = skimage.data.astronaut ()
3 img2 = skimage.data.moon ()
4 p r i n t (img1.shape) # (512 , 512 , 3)
5 p r i n t (img2.shape) # (512 , 512)
6
7 import matplotlib.pyplot as plt
8 plt.subplot (1, 2, 1)
9 plt.imshow(img1)
10 plt.subplot (1, 2, 2)
11 plt.imshow(img2 , cmap=’gray ’)
12 plt.show ()
This is further aided by the fact that Numpy
supports arithmetic broadcasting. (unlike
MATLAB or R.)
That is, you can do the following element-
wise multiplication: (6,3) * (6,1). It auto-
matically assumes you want to multiply by
column. In MATLAB, you would have to use
bsxfun(@times,r,A) or first use repmat().
18. Numpy Notes
Arithmetic mean
1 img2 = img2[:, :, np.newaxis] #(512 ,512 ,1)
2 img1 = img1.astype(np.uint32)
3 img2 = img2.astype(np.uint32)
4 img3 = (img1 + img2)//2
5 img3 = img3.astype(np.uint8)
6 plt.imshow(img3)
7 plt.show ()
Something like the following is valid in
Numpy...
1 import skimage.data
2 img1 = skimage.data.astronaut ()
3 img2 = skimage.data.moon ()
4 p r i n t (img1.shape) # (512 , 512 , 3)
5 p r i n t (img2.shape) # (512 , 512)
6
7 import matplotlib.pyplot as plt
8 plt.subplot (1, 2, 1)
9 plt.imshow(img1)
10 plt.subplot (1, 2, 2)
11 plt.imshow(img2 , cmap=’gray ’)
12 plt.show ()
21. Pandas and Data Visualization –
Python for Scientific Computing
Jo˜ao Machado • Ricardo Cruz
22. Pandas
What is Pandas?
A package for data manipulation and
analysis, based on the concept of data
frame in the R language
Optimized for performance, with critical
code paths written in C
Originally developed by Wes McKinney,
while working for AQR Capital (a
quantitative finance firm)
23. Pandas
What is Pandas?
A package for data manipulation and
analysis, based on the concept of data
frame in the R language
Optimized for performance, with critical
code paths written in C
Originally developed by Wes McKinney,
while working for AQR Capital (a
quantitative finance firm)
Given the previous point, it makes sense
to demonstrate some of the
functionalities of Pandas with a dataset
comprised of financial stocks :)
26. Models
Let us produce fake data...
y(x) = 2x + 10 + ε1 + ε2
ε1 ∼ N(0, 2)
ε2 ∼
|N(0, 25)| with p = 0.1,
0 otherwise.
27. Models
Let us produce fake data...
y(x) = 2x + 10 + ε1 + ε2
ε1 ∼ N(0, 2)
ε2 ∼
|N(0, 25)| with p = 0.1,
0 otherwise.
Let us produce fake data...
y(x) = 2x + 10 + ε1 + bε2
ε1 ∼ N(0, 2)
b ∼ B(2, 0.1)
ε2 ∼ |N(0, 25)|
28. Models
Let us produce fake data...
y(x) = 2x + 10 + ε1 + ε2
ε1 ∼ N(0, 2)
ε2 ∼
|N(0, 25)| with p = 0.1,
0 otherwise.
Translation to numpy:
1 import numpy as np
2 N = 50
3 x = np.linspace (0, 25, N)
4 y = 2*x + 10
5 y += np.random.randn(N)*2
6 y += np.random.binomial (2, 0.10 , N)*np. abs
(np.random.randn(N)*25)
Let us produce fake data...
y(x) = 2x + 10 + ε1 + bε2
ε1 ∼ N(0, 2)
b ∼ B(2, 0.1)
ε2 ∼ |N(0, 25)|
29. Models
1 import matplotlib.pyplot as plt
2 plt.plot(x, y)
3 plt.title(’Data ’)
4 plt.show ()
Let us produce fake data...
y(x) = 2x + 10 + ε1 + ε2
ε1 ∼ N(0, 2)
ε2 ∼
|N(0, 25)| with p = 0.1,
0 otherwise.
Translation to numpy:
1 import numpy as np
2 N = 50
3 x = np.linspace (0, 25, N)
4 y = 2*x + 10
5 y += np.random.randn(N)*2
6 y += np.random.binomial (2, 0.10 , N)*np. abs
(np.random.randn(N)*25)
Let us produce fake data...
y(x) = 2x + 10 + ε1 + bε2
ε1 ∼ N(0, 2)
b ∼ B(2, 0.1)
ε2 ∼ |N(0, 25)|
30. Models
1 import matplotlib.pyplot as plt
2 plt.plot(x, y)
3 plt.title(’Data ’)
4 plt.show ()
What model could we create to explain this
data?
Translation to numpy:
1 import numpy as np
2 N = 50
3 x = np.linspace (0, 25, N)
4 y = 2*x + 10
5 y += np.random.randn(N)*2
6 y += np.random.binomial (2, 0.10 , N)*np. abs
(np.random.randn(N)*25)
Let us produce fake data...
y(x) = 2x + 10 + ε1 + bε2
ε1 ∼ N(0, 2)
b ∼ B(2, 0.1)
ε2 ∼ |N(0, 25)|
31. Models
1 import matplotlib.pyplot as plt
2 plt.plot(x, y)
3 plt.title(’Data ’)
4 plt.show ()
What model could we create to explain this
data?
Translation to numpy:
1 import numpy as np
2 N = 50
3 x = np.linspace (0, 25, N)
4 y = 2*x + 10
5 y += np.random.randn(N)*2
6 y += np.random.binomial (2, 0.10 , N)*np. abs
(np.random.randn(N)*25)
Linear Regression
Model: ˆy = β0 + β1x
Minimize: i (yi − ˆyi )2
32. Models
1 import matplotlib.pyplot as plt
2 plt.plot(x, y)
3 plt.title(’Data ’)
4 plt.show ()
What model could we create to explain this
data?
1 from sklearn. linear_model import
LinearRegression
2 m = LinearRegression ()
3 m.fit(x[:, np.newaxis], y)
4 yp = m.predict(x[:, np.newaxis ])
5
6 plt.plot(x, y)
7 plt.plot(x, yp)
8 plt.title(’Linear regression ’)
9 plt.text(0, 70, ’m=%.1f b=%.1f’ % (m.coef_
[0], m.intercept_))
10 plt.show ()
Linear Regression
Model: ˆy = β0 + β1x
Minimize: i (yi − ˆyi )2
33. Models
What model could we create to explain this
data?
1 from sklearn. linear_model import
LinearRegression
2 m = LinearRegression ()
3 m.fit(x[:, np.newaxis], y)
4 yp = m.predict(x[:, np.newaxis ])
5
6 plt.plot(x, y)
7 plt.plot(x, yp)
8 plt.title(’Linear regression ’)
9 plt.text(0, 70, ’m=%.1f b=%.1f’ % (m.coef_
[0], m.intercept_))
10 plt.show ()
Linear Regression
Model: ˆy = β0 + β1x
Minimize: i (yi − ˆyi )2
34. Models
y(x) = 2x + 10 + ε1 + bε2
ˆy(x) = 2x + 18
What if I want to explain only the trend?
How can I avoid the impact of these spikes?
1 from sklearn. linear_model import
LinearRegression
2 m = LinearRegression ()
3 m.fit(x[:, np.newaxis], y)
4 yp = m.predict(x[:, np.newaxis ])
5
6 plt.plot(x, y)
7 plt.plot(x, yp)
8 plt.title(’Linear regression ’)
9 plt.text(0, 70, ’m=%.1f b=%.1f’ % (m.coef_
[0], m.intercept_))
10 plt.show ()
Linear Regression
Model: ˆy = β0 + β1x
Minimize: i (yi − ˆyi )2
35. Models
y(x) = 2x + 10 + ε1 + bε2
ˆy(x) = 2x + 18
What if I want to explain only the trend?
How can I avoid the impact of these spikes?
1 from sklearn. linear_model import
LinearRegression
2 m = LinearRegression ()
3 m.fit(x[:, np.newaxis], y)
4 yp = m.predict(x[:, np.newaxis ])
5
6 plt.plot(x, y)
7 plt.plot(x, yp)
8 plt.title(’Linear regression ’)
9 plt.text(0, 70, ’m=%.1f b=%.1f’ % (m.coef_
[0], m.intercept_))
10 plt.show ()
What would a statistician do?
1 res = yp -y
2 plt.boxplot(res)
3 plt.show ()
36. Models
y(x) = 2x + 10 + ε1 + bε2
ˆy(x) = 2x + 18
What if I want to explain only the trend?
How can I avoid the impact of these spikes?
1 q1 = np.percentile(res , 25)
2 q3 = np.percentile(res , 75)
3 t = np.logical_and(res > q1 , res < q3)
4 x2 = x[t]
5 y2 = y[t]
6
7 m = LinearRegression ()
8 m.fit(x2[:, np.newaxis], y2)
9 yp = m.predict(x[:, np.newaxis ])
What would a statistician do?
1 res = yp -y
2 plt.boxplot(res)
3 plt.show ()
37. Models
y(x) = 2x + 10 + ε1 + bε2
ˆy(x) = 2x + 18
What if I want to explain only the trend?
How can I avoid the impact of these spikes?
1 q1 = np.percentile(res , 25)
2 q3 = np.percentile(res , 75)
3 t = np.logical_and(res > q1 , res < q3)
4 x2 = x[t]
5 y2 = y[t]
6
7 m = LinearRegression ()
8 m.fit(x2[:, np.newaxis], y2)
9 yp = m.predict(x[:, np.newaxis ])
What would a statistician do?
1 res = yp -y
2 plt.boxplot(res)
3 plt.show ()
38. Models
Approach #2: What would a statistician
with some computer science knowledge do?
1 q1 = np.percentile(res , 25)
2 q3 = np.percentile(res , 75)
3 t = np.logical_and(res > q1 , res < q3)
4 x2 = x[t]
5 y2 = y[t]
6
7 m = LinearRegression ()
8 m.fit(x2[:, np.newaxis], y2)
9 yp = m.predict(x[:, np.newaxis ])
What would a statistician do?
1 res = yp -y
2 plt.boxplot(res)
3 plt.show ()
39. Models
Approach #2: What would a statistician
with some computer science knowledge do?
1 q1 = np.percentile(res , 25)
2 q3 = np.percentile(res , 75)
3 t = np.logical_and(res > q1 , res < q3)
4 x2 = x[t]
5 y2 = y[t]
6
7 m = LinearRegression ()
8 m.fit(x2[:, np.newaxis], y2)
9 yp = m.predict(x[:, np.newaxis ])
Model: ˆy = β0 + β1x
Minimize: i |yi − ˆyi |
40. Models
Approach #2: What would a statistician
with some computer science knowledge do?
1 from statsmodels.regression.
quantile_regression import QuantReg
2
3 m = QuantReg(y, np.c_[np.ones(N), x])
4 m = m.fit (0.5)
5 yp = m.predict ()
Model: ˆy = β0 + β1x
Minimize: i |yi − ˆyi |
41. Models
Approach #2: What would a statistician
with some computer science knowledge do?
1 from statsmodels.regression.
quantile_regression import QuantReg
2
3 m = QuantReg(y, np.c_[np.ones(N), x])
4 m = m.fit (0.5)
5 yp = m.predict ()
Model: ˆy = β0 + β1x
Minimize: i |yi − ˆyi |
42. Models
Approach #3: What would a crazy com-
puter scientist do?
1 from statsmodels.regression.
quantile_regression import QuantReg
2
3 m = QuantReg(y, np.c_[np.ones(N), x])
4 m = m.fit (0.5)
5 yp = m.predict ()
Model: ˆy = β0 + β1x
Minimize: i |yi − ˆyi |
43. Models
Approach #3: What would a crazy com-
puter scientist do?
1 from statsmodels.regression.
quantile_regression import QuantReg
2
3 m = QuantReg(y, np.c_[np.ones(N), x])
4 m = m.fit (0.5)
5 yp = m.predict ()
1 plt.plot(x, y)
2 f o r it i n range (10):
3 t = np.random.choice(N, N//10 , replace
=False)
4 x2 = x[t]
5 y2 = y[t]
6 m.fit(x2[:, np.newaxis], y2)
7 yp = m.predict(x[:, np.newaxis ])
8 plt.plot(x, yp , color=’black ’, alpha
=0.4)
9 plt.show ()
44. Models
Approach #3: What would a crazy com-
puter scientist do?
1 plt.plot(x, y)
2 f o r it i n range (10):
3 t = np.random.choice(N, N//10 , replace
=False)
4 x2 = x[t]
5 y2 = y[t]
6 m.fit(x2[:, np.newaxis], y2)
7 yp = m.predict(x[:, np.newaxis ])
8 plt.plot(x, yp , color=’black ’, alpha
=0.4)
9 plt.show ()
45. Models
Sklearn already comes with this crazy model
too:
1 from sklearn. linear_model import
RANSACRegressor
2 m = RANSACRegressor ()
3 m.fit(x[:, np.newaxis], y)
4
5 plt.plot(x, y)
6 plt.plot(x, m.predict(x[:, np.newaxis ]))
7 plt.title(’RANSAC ’)
8 plt.show ()
Approach #3: What would a crazy com-
puter scientist do?
1 plt.plot(x, y)
2 f o r it i n range (10):
3 t = np.random.choice(N, N//10 , replace
=False)
4 x2 = x[t]
5 y2 = y[t]
6 m.fit(x2[:, np.newaxis], y2)
7 yp = m.predict(x[:, np.newaxis ])
8 plt.plot(x, yp , color=’black ’, alpha
=0.4)
9 plt.show ()
46. Models
Sklearn already comes with this crazy model
too:
1 from sklearn. linear_model import
RANSACRegressor
2 m = RANSACRegressor ()
3 m.fit(x[:, np.newaxis], y)
4
5 plt.plot(x, y)
6 plt.plot(x, m.predict(x[:, np.newaxis ]))
7 plt.title(’RANSAC ’)
8 plt.show ()
1 plt.plot(x, y)
2 f o r it i n range (10):
3 t = np.random.choice(N, N//10 , replace
=False)
4 x2 = x[t]
5 y2 = y[t]
6 m.fit(x2[:, np.newaxis], y2)
7 yp = m.predict(x[:, np.newaxis ])
8 plt.plot(x, yp , color=’black ’, alpha
=0.4)
9 plt.show ()
47. What kind of things can we use data mining /
machine learning for?
48. Data Mining Problems
Regression: predict a continuous
variable
e.g.
House Price = 100 + 20 × Land Size
In scikit-learn, LinearRegression, Gradient-
BoostingRegressor, etc (:: RegressorMixin)
.fit(X, y)
.predict(X) -> yp
49. Data Mining Problems
Regression: predict a continuous
variable
e.g.
House Price = 100 + 20 × Land Size
In scikit-learn, LinearRegression, Gradient-
BoostingRegressor, etc (:: RegressorMixin)
.fit(X, y)
.predict(X) -> yp
Classification: predict a discrete variable
e.g. House Price =
Expensive if in the city center
Cheap if outside the city
In scikit-learn, LogisticRegression, Gradient-
BoostingClassifier, etc (:: ClassifierMixin)
.fit(X, y)
.predict(X) -> yp
50. Data Mining Problems
Regression: predict a continuous
variable
e.g.
House Price = 100 + 20 × Land Size
In scikit-learn, LinearRegression, Gradient-
BoostingRegressor, etc (:: RegressorMixin)
.fit(X, y)
.predict(X) -> yp
Classification: predict a discrete variable
e.g. House Price =
Expensive if in the city center
Cheap if outside the city
In scikit-learn, LogisticRegression, Gradient-
BoostingClassifier, etc (:: ClassifierMixin)
.fit(X, y)
.predict(X) -> yp
Clustering: not predict, aggregate
In scikit-learn, KMeans, LatentDirichletAllo-
cation, etc (:: ClusterMixin)
.fit(X)
.transform(X) -> X’
.fit transform(X) -> X’
51. Data Mining Problems
Regression: predict a continuous
variable
e.g.
House Price = 100 + 20 × Land Size
In scikit-learn, LinearRegression, Gradient-
BoostingRegressor, etc (:: RegressorMixin)
.fit(X, y)
.predict(X) -> yp
Classification: predict a discrete variable
e.g. House Price =
Expensive if in the city center
Cheap if outside the city
In scikit-learn, LogisticRegression, Gradient-
BoostingClassifier, etc (:: ClassifierMixin)
.fit(X, y)
.predict(X) -> yp
Re-inforcement learning: (predict best
move)
Clustering: not predict, aggregate
In scikit-learn, KMeans, LatentDirichletAllo-
cation, etc (:: ClusterMixin)
.fit(X)
.transform(X) -> X’
.fit transform(X) -> X’
54. Text Mining w/ Twitter
Packages:
tweepy
numpy
matplotlib
scikit-learn
55. Text Mining
1 import tweepy
2 auth = tweepy. OAuthHandler (api_key ,
api_secret)
3 auth. set_access_token (access_token ,
access_secret )
4 api = tweepy.API(auth)
5
6 timeline = api. user_timeline (’
realDonaldTrump ’, count =100)
7 texts = [tweet.text f o r tweet i n timeline]
56. Text Mining
1 import tweepy
2 auth = tweepy. OAuthHandler (api_key ,
api_secret)
3 auth. set_access_token (access_token ,
access_secret )
4 api = tweepy.API(auth)
5
6 timeline = api. user_timeline (’
realDonaldTrump ’, count =100)
7 texts = [tweet.text f o r tweet i n timeline]
1 from sklearn. feature_extraction .text
import CountVectorizer
2 m = CountVectorizer (stop_words=’english ’,
min_df =5, max_df =16)
3 X = m. fit_transform (texts)
4 words = sorted (m.vocabulary_ , key=m.
vocabulary_.get)
5
6 import pandas as pd
7 p r i n t (pd.DataFrame(X.todense (), columns=
words).ix[:5, :5]. to_latex ())
america big comey day dems
0 0 0 0 0 0
1 0 1 0 0 0
2 1 0 0 0 0
57. Text Mining
1 import tweepy
2 auth = tweepy. OAuthHandler (api_key ,
api_secret)
3 auth. set_access_token (access_token ,
access_secret )
4 api = tweepy.API(auth)
5
6 timeline = api. user_timeline (’
realDonaldTrump ’, count =100)
7 texts = [tweet.text f o r tweet i n timeline]
1 from sklearn. feature_extraction .text
import CountVectorizer
2 m = CountVectorizer (stop_words=’english ’,
min_df =5, max_df =16)
3 X = m. fit_transform (texts)
4 words = sorted (m.vocabulary_ , key=m.
vocabulary_.get)
5
6 import pandas as pd
7 p r i n t (pd.DataFrame(X.todense (), columns=
words).ix[:5, :5]. to_latex ())
america big comey day dems
0 0 0 0 0 0
1 0 1 0 0 0
2 1 0 0 0 0
1 import matplotlib.pyplot as plt
2 counts = np.asarray(X.sum(0))[0]
3 plt.barh( range ( len (counts)), counts)
4 plt.xticks( range (0, 14, 2))
5 plt.yticks( range ( len (counts)), words)
6 plt.show ()
58. Text Mining
1 import tweepy
2 auth = tweepy. OAuthHandler (api_key ,
api_secret)
3 auth. set_access_token (access_token ,
access_secret )
4 api = tweepy.API(auth)
5
6 timeline = api. user_timeline (’
realDonaldTrump ’, count =100)
7 texts = [tweet.text f o r tweet i n timeline]
1 from sklearn. feature_extraction .text
import CountVectorizer
2 m = CountVectorizer (stop_words=’english ’,
min_df =5, max_df =16)
3 X = m. fit_transform (texts)
4 words = sorted (m.vocabulary_ , key=m.
vocabulary_.get)
5
6 import pandas as pd
7 p r i n t (pd.DataFrame(X.todense (), columns=
words).ix[:5, :5]. to_latex ())
america big comey day dems
0 0 0 0 0 0
1 0 1 0 0 0
2 1 0 0 0 0
1 import matplotlib.pyplot as plt
2 counts = np.asarray(X.sum(0))[0]
3 plt.barh( range ( len (counts)), counts)
4 plt.xticks( range (0, 14, 2))
5 plt.yticks( range ( len (counts)), words)
6 plt.show ()
59. Text Mining
1 from sklearn. decomposition import
LatentDirichletAllocation
2 lda = LatentDirichletAllocation (2,
learning_method =’online ’)
3 lda.fit(X)
4 topics = lda. components_
newword1 = β11word1 + β12word2 + . . .
newword2 = β21word1 + β22word2 + . . .
1 from sklearn. feature_extraction .text
import CountVectorizer
2 m = CountVectorizer (stop_words=’english ’,
min_df =5, max_df =16)
3 X = m. fit_transform (texts)
4 words = sorted (m.vocabulary_ , key=m.
vocabulary_.get)
5
6 import pandas as pd
7 p r i n t (pd.DataFrame(X.todense (), columns=
words).ix[:5, :5]. to_latex ())
america big comey day dems
0 0 0 0 0 0
1 0 1 0 0 0
2 1 0 0 0 0
1 import matplotlib.pyplot as plt
2 counts = np.asarray(X.sum(0))[0]
3 plt.barh( range ( len (counts)), counts)
4 plt.xticks( range (0, 14, 2))
5 plt.yticks( range ( len (counts)), words)
6 plt.show ()
60. Text Mining
1 from sklearn. decomposition import
LatentDirichletAllocation
2 lda = LatentDirichletAllocation (2,
learning_method =’online ’)
3 lda.fit(X)
4 topics = lda. components_
newword1 = β11word1 + β12word2 + . . .
newword2 = β21word1 + β22word2 + . . .
1 topics = topics / topics.max(1)[:, np.
newaxis]
2 topics += np.random.randn (* topics.shape)
*0.02
3 f o r i, word i n enumerate(words):
4 plt.text(topics [0, i], topics [1, i],
word , ha=’center ’)
5 plt.show ()
1 import matplotlib.pyplot as plt
2 counts = np.asarray(X.sum(0))[0]
3 plt.barh( range ( len (counts)), counts)
4 plt.xticks( range (0, 14, 2))
5 plt.yticks( range ( len (counts)), words)
6 plt.show ()
61. Text Mining
1 from sklearn. decomposition import
LatentDirichletAllocation
2 lda = LatentDirichletAllocation (2,
learning_method =’online ’)
3 lda.fit(X)
4 topics = lda. components_
newword1 = β11word1 + β12word2 + . . .
newword2 = β21word1 + β22word2 + . . .
1 topics = topics / topics.max(1)[:, np.
newaxis]
2 topics += np.random.randn (* topics.shape)
*0.02
3 f o r i, word i n enumerate(words):
4 plt.text(topics [0, i], topics [1, i],
word , ha=’center ’)
5 plt.show ()
62. Text Mining
1 from sklearn. decomposition import
LatentDirichletAllocation
2 lda = LatentDirichletAllocation (2,
learning_method =’online ’)
3 lda.fit(X)
4 topics = lda. components_
newword1 = β11word1 + β12word2 + . . .
newword2 = β21word1 + β22word2 + . . .
1 topics = topics / topics.max(1)[:, np.
newaxis]
2 topics += np.random.randn (* topics.shape)
*0.02
3 f o r i, word i n enumerate(words):
4 plt.text(topics [0, i], topics [1, i],
word , ha=’center ’)
5 plt.show ()
1 timeline = api. user_timeline (’
marcelorebelo_ ’, count =100)
63. Traditional Learning vs Deep Learning
Traditionally, hand-crafted features would be extracted from the dataset and learning
would happen on top of those features. Deep learning learns from the raw data.
Packages:
scikit-image
numpy
keras
64. Traditional Learning
Cats vs Dogs – Kaggle Competition – https:
//www.kaggle.com/c/dogs-vs-cats
25,000 images of cats and dogs
65. Traditional Learning
Cats vs Dogs – Kaggle Competition – https:
//www.kaggle.com/c/dogs-vs-cats
25,000 images of cats and dogs
Feature #1: Extract histogram of colors
1 from skimage.io import imread
2 from skimage.transform import rgb2gray
3
4 f o r filename i n os.listdir(’train ’):
5 im = imread(os.path.join(’train ’,
filename))
6 im = rgb2gray(im)
7 f1 = np.histogram(im.flatten (), 10) [0]
8 f1 = (f1/f1.sum()).cumsum ()
66. Traditional Learning
Cats vs Dogs – Kaggle Competition – https:
//www.kaggle.com/c/dogs-vs-cats
25,000 images of cats and dogs
Feature #1: Extract histogram of colors
1 from skimage.io import imread
2 from skimage.transform import rgb2gray
3
4 f o r filename i n os.listdir(’train ’):
5 im = imread(os.path.join(’train ’,
filename))
6 im = rgb2gray(im)
7 f1 = np.histogram(im.flatten (), 10) [0]
8 f1 = (f1/f1.sum()).cumsum ()
Feature #2: Histogram of Oriented Gradi-
ents
1 im2 = resize(im , (32, 32) , mode=’reflect
’)
2 im2 = np.sqrt(im2)
3 f2 = hog(im2 , block_norm=’L2 -Hys ’)
67. Traditional Learning
Cats vs Dogs – Kaggle Competition – https:
//www.kaggle.com/c/dogs-vs-cats
25,000 images of cats and dogs
Feature #1: Extract histogram of colors
1 from skimage.io import imread
2 from skimage.transform import rgb2gray
3
4 f o r filename i n os.listdir(’train ’):
5 im = imread(os.path.join(’train ’,
filename))
6 im = rgb2gray(im)
7 f1 = np.histogram(im.flatten (), 10) [0]
8 f1 = (f1/f1.sum()).cumsum ()
1 from sklearn.tree import
DecisionTreeClassifier ,
export_graphviz
2 m = DecisionTreeClassifier (max_depth =3)
3 m.fit(X, y)
Feature #2: Histogram of Oriented Gradi-
ents
1 im2 = resize(im , (32, 32) , mode=’reflect
’)
2 im2 = np.sqrt(im2)
3 f2 = hog(im2 , block_norm=’L2 -Hys ’)
68. Traditional Learning
1 from sklearn. model_selection import
cross_val_score
2 from sklearn.ensemble import
RandomForestClassifier
3 p r i n t ( cross_val_score (
RandomForestClassifier (100) , X, y))
1 [ 0.69642429 0.70086393 0.69851176]
Feature #1: Extract histogram of colors
1 from skimage.io import imread
2 from skimage.transform import rgb2gray
3
4 f o r filename i n os.listdir(’train ’):
5 im = imread(os.path.join(’train ’,
filename))
6 im = rgb2gray(im)
7 f1 = np.histogram(im.flatten (), 10) [0]
8 f1 = (f1/f1.sum()).cumsum ()
1 from sklearn.tree import
DecisionTreeClassifier ,
export_graphviz
2 m = DecisionTreeClassifier (max_depth =3)
3 m.fit(X, y)
Feature #2: Histogram of Oriented Gradi-
ents
1 im2 = resize(im , (32, 32) , mode=’reflect
’)
2 im2 = np.sqrt(im2)
3 f2 = hog(im2 , block_norm=’L2 -Hys ’)
76. Conclusions
Packages to know:
Numpy: basic linear algebra
Scipy: extensions to numpy
sparse matrices, pdfs, hypothesis tests
Statsmodels: several statistics models,
incl. timeseries
Pandas: extension to numpy for
dataframes support
Matplotlib, seaborn: drawing graphics
77. Conclusions
Packages to know:
Numpy: basic linear algebra
Scipy: extensions to numpy
sparse matrices, pdfs, hypothesis tests
Statsmodels: several statistics models,
incl. timeseries
Pandas: extension to numpy for
dataframes support
Matplotlib, seaborn: drawing graphics
scikit-learn: complete machine learning
toolkit
xgboost: famous gradient boosting
model
Keras: deep learning (and TensorFlow,
Theano, Lasagne)
OpenCV, scikit-image: image
processing
NLTK: natural language toolkit
Gensim: natural language models
78. Final remarks
Python’s a “jack of all trades” type of language;
Its speed and ease of development is really apt for scientific computing;
Ever increasingly adopted by scientists and engineers, due to the available third-party
scientific libraries contributed by a large community;
Has become a ’de-facto’ language present in advances in some fields, such as Deep
Learning.
79. About us
Jo˜ao Machado
machadojpf@gmail.com
Fraunhofer Portugal research engineer
Masters in Electrical and Computer Engineering
http://www.linkedin.com/in/machadojpf
Ricardo Cruz
rpcruz@inesctec.pt
INESC TEC researcher
Computer Science & Applied Mathematics graduate
https://rpmcruz.github.io/
Subscribe workshops:
http://tinyurl.com/cruz-workshops