SlideShare ist ein Scribd-Unternehmen logo
1 von 85
Downloaden Sie, um offline zu lesen
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python for R developers and data scientists
Artur Matos
http://www.lambdatree.com
June 8, 2016
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Outline
1 Getting Up and Running
2 Vectors
3 Data Frames
4 Analysis
5 Visualization
6 I/O
7 Conclusion
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Section 1
Getting Up and Running
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Which Python?
Python runtimes
Several available: CPython, PyPy, Jython. . .
CPython is the official runtime written in C.
PyPy is a JIT-based runtime that runs significantly faster than CPython.
For scientific computing, CPython is the only choice.
Python 2 vs Python 3
Python 3 is not backwards compatible
Answer today is Python 3 (might have answered differently last year)
Unless you have other teams using Python 2. . .
But all major packages support Python 3 already
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Installation
Several options, ranging from simple to complex.
We will use Anaconda here, which will get you up and running quickly.
On Linux and Mac you can also install Python with your package manager.
Use virtualenv to isolate Python environments (not covered here).
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Installing Anaconda
https://www.continuum.io/downloads
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Installing Anaconda (2)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Jupyter
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes
Convert input into uppercase;
Cuts anything longer than 10 characters;
Adds extra spaces if shorter than 10 characters;
Add single quotes.
>>> quote_pad_string("This is rather long")
’THIS IS RA’
>>> quote_pad_string("Short")
’SHORT ’
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes (2)
Function:
# WARNING: This is purely to cover some basic python syntax
# there are better ways to do this in Python
def quote_pad_string(a_string):
maximum_length = 10
num_missing_characters = maximum_length - len(a_string)
if num_missing_characters < 0:
num_missing_characters = 0
if num_missing_characters:
for i in range(num_missing_characters):
a_string = a_string + " "
else:
a_string = a_string[:maximum_length]
return "’" + a_string.upper() + "’"
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes (2)
def defines the body of a function. Python is dynamically typed:
# WARNING: This is purely to cover some basic python syntax
# there are better ways to do this in Python
def quote_pad_string(a_string):
maximum_length = 10
num_missing_characters = maximum_length - len(a_string)
if num_missing_characters < 0:
num_missing_characters = 0
if num_missing_characters:
for i in range(num_missing_characters):
a_string = a_string + " "
else:
a_string = a_string[:maximum_length]
return "’" + a_string.upper() + "’"
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes (2)
Python uses indentation for code blocks instead of curly braces:
# WARNING: This is purely to cover some basic python syntax
# there are better ways to do this in Python
def quote_pad_string(a_string):
maximum_length = 10
num_missing_characters = maximum_length - len(a_string)
if num_missing_characters < 0:
num_missing_characters = 0
if num_missing_characters:
for i in range(num_missing_characters):
a_string = a_string + " "
else:
a_string = a_string[:maximum_length]
return "’" + a_string.upper() + "’"
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes (2)
‘=’ for assignment:
# WARNING: This is purely to cover some basic python syntax
# there are better ways to do this in Python
def quote_pad_string(a_string):
maximum_length = 10
num_missing_characters = maximum_length - len(a_string)
if num_missing_characters < 0:
num_missing_characters = 0
if num_missing_characters:
for i in range(num_missing_characters):
a_string = a_string + " "
else:
a_string = a_string[:maximum_length]
return "’" + a_string.upper() + "’"
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes (2)
‘if’ statement:
# WARNING: This is purely to cover some basic python syntax
# there are better ways to do this in Python
def quote_pad_string(a_string):
maximum_length = 10
num_missing_characters = maximum_length - len(a_string)
if num_missing_characters < 0:
num_missing_characters = 0
if num_missing_characters:
for i in range(num_missing_characters):
a_string = a_string + " "
else:
a_string = a_string[:maximum_length]
return "’" + a_string.upper() + "’"
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes (2)
‘for’ statement:
# WARNING: This is purely to cover some basic python syntax
# there are better ways to do this in Python
def quote_pad_string(a_string):
maximum_length = 10
num_missing_characters = maximum_length - len(a_string)
if num_missing_characters < 0:
num_missing_characters = 0
if num_missing_characters:
for i in range(num_missing_characters):
a_string = a_string + " "
else:
a_string = a_string[:maximum_length]
return "’" + a_string.upper() + "’"
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes (2)
len(a_string) is a function call, a_string.upper() is a method invocation:
# WARNING: This is purely to cover some basic python syntax
# there are better ways to do this in Python
def quote_pad_string(a_string):
maximum_length = 10
num_missing_characters = maximum_length - len(a_string)
if num_missing_characters < 0:
num_missing_characters = 0
if num_missing_characters:
for i in range(num_missing_characters):
a_string = a_string + " "
else:
a_string = a_string[:maximum_length]
return "’" + a_string.upper() + "’"
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Section 2
Vectors
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Scalars - R
In R there are no real scalar types. They are just vectors of length 1:
> a <- 5 # Equivalent to a <- c(5)
> a
[1] 5
> length(a)
[1] 1
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Scalars - Python
In Python scalars and vectors are not the same thing:
>>> a = 5 # Scalar
5
>>> b = np.array([5]) # Array with one element
array([5])
>>> len(b) # Equivalent to ’length’ in R
1
This won’t work:
>>> len(a)
TypeError: object of type ’int’ has no len()
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Vectors, matrices and arrays - R
In R, there’s ‘c’ for 1d vectors, ‘matrix’ for 2 dimensions, and ‘array’ for higher-order
dimensions:
> c(1,2,3,4)
[1] 1 2 3 4
> matrix(1:4, nrow=2,ncol=2)
[,1] [,2]
[1,] 1 3
[2,] 2 4
> array(1:3, c(2,4,6))
...
Strangely enough, a 1d array is not the same as a vector:
> a <- as.array(1:3)
[1] 1 2 3
> is.vector(a)
[1] FALSE
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Vectors, matrices and arrays - Python
Python has no builtin vector or matrix type. You will need numpy:
>>> import numpy as np # Equivalent to ’library(numpy)’ in R.
>>> np.array([1, 2, 3, 4]) # 1d vector
array([1, 2, 3, 4])
>>> np.array([[1, 2], [3, 4]]) # matrix
array([[1, 2],
[3, 4]])
np.array works with any dimension and it’s a single type (ndarray).
(There’s also a matrix type specifically for two dimensions but it should be avoided.
Always use ndarray.)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Generating regular sequences
R
> 5:10 # Shortend for seq
[1] 5 6 7 8 9 10
> seq(0, 1, length.out = 11)
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Python
>>> np.arange(3.0)
array([ 0., 1., 2.])
>>> np.arange(3, 7)
array([3, 4, 5, 6])
>>> np.arange(3, 7, 2)
array([3, 5])
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Vector operations - Python
For the most part, vector operations in Python work just like in R:
>>> a = np.arange(5.0)
array([ 0., 1., 2., 3., 4.])
>>> 1.0 + a # Adding
array([ 1., 2., 3., 4., 5.])
>>> a * a # Multiplying element wise
array([ 0., 1., 4., 9., 16.])
>>> a ** 3 # to the power of 3
array([ 0., 1., 8., 27., 64.])
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Vector operations - Python (2)
For matrix multiplication use the ‘@’ operator:
>> a = np.array([[1, 0], [0, 1]])
array([[1, 0],
[0, 1]])
>> b = np.array([[4, 1], [2, 2]])
array([[4, 1],
[2, 2]])
>> a @ b
array([[4, 1],
[2, 2]])
(In Python 2 use np.dot(a,b).)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Vector operations - Python (3)
Numpy also has the usual mathematical operations that work on vectors:
>> a = np.arange(5.0)
array([ 0., 1., 2., 3., 4.])
>> np.sin(a)
array([ 0., 0.84147098, 0.90929743, 0.14112001, -0.7568025 ])
Full reference here:
http://docs.scipy.org/doc/numpy/reference/routines.math.html
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Recycling - R
When doing vector operations, R automatically extends the smallest element to be as
large as the other:
> c(1,2) + c(1,2,3,4) # Equivalent to c(1,2,1,2) + c(1,2,3,4)
[1] 2 4 4 6
You can do this even if the lengths aren’t multiples of one another, albeit with a
warning:
> c(1,2) + c(1,2,3,4,5)
[1] 2 4 4 6 6
Warning message:
In c(1, 2) + c(1,2,3,4,5) :
longer object length is not a multiple of shorter object length
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Recycling - Python
This won’t work in Python however:
>> np.arange(2.0) + np.arange(4.0)
----------------------------------------
ValueError: operands could not be broadcast together with shapes
(2,) (4,)
Numpy has much more strict recycling (aka broadcasting) rules.
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Broadcasting Rules - Python
2x3 and 2x1:
0 1 2
3 4 5
+
0
1
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Broadcasting Rules - Python
2x3 and 2x1:
0 1 2
3 4 5
+
0
1
=
0 1 2
3 4 5
+
0 0 0
1 1 1
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Broadcasting Rules - Python
2x3 and 2x1:
0 1 2
3 4 5
+
0
1
=
0 1 2
3 4 5
+
0 0 0
1 1 1
2x3 and 1x3:
0 1 2
3 4 5
+ 0 1 2
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Broadcasting Rules - Python
2x3 and 2x1:
0 1 2
3 4 5
+
0
1
=
0 1 2
3 4 5
+
0 0 0
1 1 1
2x3 and 1x3:
0 1 2
3 4 5
+ 0 1 2 =
0 1 2
3 4 5
+
0 1 2
0 1 2
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Broadcasting Rules - Python
2x3 and 2x1:
0 1 2
3 4 5
+
0
1
=
0 1 2
3 4 5
+
0 0 0
1 1 1
2x3 and 1x3:
0 1 2
3 4 5
+ 0 1 2 =
0 1 2
3 4 5
+
0 1 2
0 1 2
Adding a single element array or a scalar always works:
0 1 2
3 4 5
+ 0
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Broadcasting Rules - Python
2x3 and 2x1:
0 1 2
3 4 5
+
0
1
=
0 1 2
3 4 5
+
0 0 0
1 1 1
2x3 and 1x3:
0 1 2
3 4 5
+ 0 1 2 =
0 1 2
3 4 5
+
0 1 2
0 1 2
Adding a single element array or a scalar always works:
0 1 2
3 4 5
+ 0 =
0 1 2
3 4 5
+
0 0 0
0 0 0
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Broadcasting Rules - Python
2x3 and 2x1:
0 1 2
3 4 5
+
0
1
=
0 1 2
3 4 5
+
0 0 0
1 1 1
2x3 and 1x3:
0 1 2
3 4 5
+ 0 1 2 =
0 1 2
3 4 5
+
0 1 2
0 1 2
Adding a single element array or a scalar always works:
0 1 2
3 4 5
+ 0 =
0 1 2
3 4 5
+
0 0 0
0 0 0
This won’t work (the dimensions need to match exactly or be 1):
0 1 2 3 + 0 1
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9
Indexing in Python starts from 0 (not 1):
>>> a[0]
0.0
Indexing on a single value returns a scalar (not an array!)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9
Use ‘i:j’ to index from position i to j-1:
>>> a[1:3]
array([ 1, 2])
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9
An optional ‘k’ element defines the step:
>>> a[1:7:2]
array([1, 3, 5])
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9
i and j can be negative, which means they will start counting from the last:
>>> a[1:-3]
array([1, 2, 3, 4, 5, 6])
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9
i and j can be negative, which means they will start counting from the last:
>>> a[-3:-1]
array([7, 8])
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9
While a negative k will go in the opposite direction:
>>> a[-3:-9:-1]
array([7, 6, 5, 4, 3, 2])
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9
Not all need to be included:
>>> a[::-1]
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - multiple dimensions
Use ‘,’ for additional dimensions:
x =
1 2 3
4 5 6
>>> x[0:2, 0:1]
array([[1],
[4]])
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing using conditions
Operators like ‘>’ or ‘<=’ operate element-wise and return a logical vector:
>>> a > 4
array([False, False, False, False, False, True, True,
True, True, True], dtype=bool)
These can be combined into more complex expressions:
>> (a > 2) && (b ** 2 <= a)
...
And used as indexing too:
>> a[a > 4]
array([5, 6, 7, 8, 9])
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Assignment
Any index can be used together with ‘=’ for assignment:
>>> a[0] = 10
array([10, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Conditions work as well:
>>> a[a > 4] = 99
array([99, 1, 2, 3, 4, 99, 99, 99, 99, 99])
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Section 3
Data Frames
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Pandas
Data frames in Python aren’t builtin. You will need pandas:
import pandas as pd
Loading the iris dataset:
>>> iris = pd.read_csv("""https://raw.githubusercontent.com/pydata/
pandas/master/pandas/tests/data/iris.csv""")
>>> iris.head()
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
‘.<columnName>’ returns only that column:
>>> iris.SepalLength
0 5.1
1 4.9
2 4.7
...
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
This also works:
>>> iris["SepalLength"]
...
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
You can select multiple columns by passing a list:
>>> iris[["SepalWidth", "SepalLength"]]
SepalWidth SepalLength
0 3.5 5.1
1 3.0 4.9
...
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
Or if you pass a slice you can select rows:
>>> iris[1:3]
SepalLength SepalWidth PetalLength PetalWidth Name
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
With loc you can slice both rows and columns:
>>> iris.loc[1:3, ["SepalLength", "SepalWidth"]]
SepalLength SepalWidth
1 4.9 3.0
2 4.7 3.2
3 4.6 3.1
loc is inclusive at the end.
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
As well as only one single row:
>>> iris.loc[3]
SepalLength 4.6
SepalWidth 3.1
PetalLength 1.5
PetalWidth 0.2
Name Iris-setosa
Name: 3, dtype: object
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
iloc works with integer indices, similar to numpy arrays:
>>> iris.iloc[0:2, 0:3]
SepalLength SepalWidth PetalLength
0 5.1 3.5 1.4
1 4.9 3.0 1.4
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
‘:’ will include all the rows (or all the columns):
>>> iris.iloc[0:2, :]
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
Or you can use conditions like numpy:
>>> iris[iris.SepalLength < 5]
SepalLength SepalWidth PetalLength PetalWidth Name
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
...
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
Picking a single value returns a scalar:
>>> iris.iloc[0,0]
5.0999999999999996
Normally it’s better to use at or iat (faster).
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Assignment
Assignment works as expected:
>>> iris.loc[iris.SepalLength > 7.6, "Name"] = "Iris-orlando"
Beware that this doesn’t work:
>> iris[iris.SepalLength > 7.6].Name = "Iris-orlando"
SettingWithCopyWarning: A value is trying to be set on a copy
of a slice from a DataFrame.
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Operations
Operators work as expected:
>>> iris["SepalD"] = iris["SepalLength"] * iris["SepalWidth"]
There’s also apply:
>>> iris[["SepalLength", "SepalWidth"]].apply(np.sqrt)
SepalLength SepalWidth
0 2.258318 1.870829
1 2.213594 1.732051
...
Use axis=1 to apply function to each row.
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
SQL-like operations - group by
>>> iris.groupby("Name").mean()
SepalLength SepalWidth PetalLength PetalWidth
Name
Iris-setosa 5.006 3.418 1.464 0.244
Iris-versicolor 5.936 2.770 4.260 1.326
Iris-virginica 6.588 2.974 5.552 2.026
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
SQL-like operations (2) - join
>> left
key lval
0 foo 1
1 foo 2
>> right
key rval
0 foo 4
1 foo 5
>> pd.merge(left, right, on=’key’)
key lval rval
0 foo 1 4
1 foo 1 5
...
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Time Series
Pandas data frames can also work as time series, replacing R’s ts, xts or zoo.
Downloading some stock data from Google finance:
>>> import pandas.io.data as web
>>> import datetime
>>> aapl = web.DataReader("AAPL", ’google’,
datetime.datetime(2013, 1, 1),
datetime.datetime(2014, 1, 1))
Open High Low Close Volume
Date
2013-01-02 79.12 79.29 77.38 78.43 140124866
2013-01-03 78.27 78.52 77.29 77.44 88240950
...
Time series are just regular pandas data frames but with time stamps as indices.
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Time Series (2)
Use loc to select based on dates:
>>> aapl.loc[’20130131’:’20130217’]
Open High Low Close Volume
Date
2013-01-31 65.28 65.61 65.00 65.07 79833215
2013-02-01 65.59 65.64 64.05 64.80 134867089
...
Use iloc as before for selecting based on numerical indices:
>>> aapl.iloc[1:3]
Open High Low Close Volume
Date
2013-01-03 78.27 78.52 77.29 77.44 88240950
2013-01-04 76.71 76.95 75.12 75.29 148581860
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Section 4
Analysis
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Statistical tests
Use scipy.stats for common statistical tests:
>>> from scipy import stats
>>> iris_virginica = iris[iris.Name == ’Iris-virginica’].SepalLength.values
>>> iris_setosa = iris[iris.Name == ’Iris-setosa’].SepalLength.values
>>> t_test = stats.ttest_ind(iris_virginica, iris_setosa)
>>> t_test.pvalue
6.8925460606740589e-28
Use scikits.bootstrap for bootstrapped confidence intervals.
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Ordinary Least Squares
Use statsmodels:
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
The formula API is very similar to R:
>>> results = smf.ols("PetalWidth ~ Name + PetalLength", data=iris).fit()
It automatically includes an intercept (just like R).
Use smf.glm for generalized linear models.
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Formula API
Very similar to R:
You can include arbitrary transformations, e.g. “np.log(PetalWidth)”.
To remove the intercept add a “- 1” or “0 +”
Use “C(a)” to coerce a number to a factor
Use “a:b” for modelling interactions between a and b.
“a*b” means “a + b + a:b”
Strings are automatically coerced to factors (more on this later)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Decision trees
Use scikit-learn:
>>> from sklearn import tree
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(sk_iris.data, sk_iris.target)
After being fitted, the model can then be used to predict the class of samples:
>>> clf.predict([[5.1, 3.5, 1.4, 0.2]])
array([0])
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Support Vector Machines
scikit-learn has a very regular API. Here’s the same example using an SVM:
>>> from sklearn import svm
>>> clf_svm = svm.SVC()
>>> clf_svm = clf_svm.fit(sk_iris.data, sk_iris.target)
>>> clf_svm.predict([[5.1, 3.5, 1.4, 0.2]])
array([0])
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
K-Means clustering
Clustering follows the same pattern:
>>> from sklearn import cluster
>>> k_means = cluster.KMeans(n_clusters=3)
>>> k_means.fit(sk_iris.data)
KMeans(copy_x=True, init=’k-means++’, ...
labels_ contains the assigned categories, following the same order as the data:
>>> k_means.labels_
array([1, 1, 1, 1, 1...
predict works the same as for the other models, and returns the predicted category.
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Principal Component Analysis
from sklearn import decomposition
pca = decomposition.PCA(n_components=3)
pca = pca.fit(sk_iris.data)
explained_variance_ratio_ and components_ will include the explained variance
and the PCA components respectively:
>>> pca.explained_variance_ratio_
array([ 0.92461621, 0.05301557, 0.01718514])
>>>pca.components_
array([[ 0.36158968, -0.08226889, 0.85657211, 0.35884393],
[-0.65653988, -0.72971237, 0.1757674 , 0.07470647],
[ 0.58099728, -0.59641809, -0.07252408, -0.54906091]])
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Cross validation
scikit-learn also includes extensive support for cross-validation. Here’s a simple split
into training and out-of-sample:
>>> from sklearn import cross_validation
>>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(
... sk_iris.data, sk_iris.target, test_size=0.4, random_state=0)
>>> X_train.shape, y_train.shape
((90, 4), (90,))
>>> X_test.shape, y_test.shape
((60, 4), (60,))
It also supports K-fold, stratified K-fold, shuffling, etc. . .
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
NAs
There’s no builtin NA in Python. You normally use NaN for NAs. numpy has a bunch
of builtin functions to ignore NaNs:
>>> a = np.array([1.0, 3.0, np.NaN, 5.0])
>>> a.sum()
nan
>>> np.nansum(a)
9.0
Pandas usually ignores NaNs when computing sums, means, etc.. but propagates
them accordingly.
scikit-learn assumes there’s no missing data so be sure to pre-process them, e.g.
remove them or set them to 0. Look at sklearn.preprocessing.Imputer
statsmodels also use NaNs for missing data, but only has basic support for
handling them (it can only ignore them or raise an error). See the missing
attribute in the model class.
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Factors
Similarly to NAs, Python has no builtin factor data type.
Different packages handle them differently:
numpy has no support for factors. Use integers.
Pandas has categoricals, which work fairly similar to factors
statsmodels convert strings to their own internal factor type, very similar to R.
There’s also the ‘C’ operator.
scikit-learn doesn’t support factors internally, but has some tools to convert strings
into dummy variables, e.g. DictVectorizer
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Notorious Omissions
Bayesian modelling
Time series analysis
Econometrics
Signal processing, i.e. filter design
Natural language processing
. . .
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Section 5
Visualization
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Visualization
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Section 6
I/O
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Pandas
read_csv Reads data from CSV files
>>> pd.read_csv(’foo.csv’)
Unnamed: 0 A B C D
0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860
1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
...
Conversely there is to_csv to write CSV files:
In [136]: df.to_csv(’foo.csv’)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Other options
For data frames:
HDF5: read_hdf5, to_hdf5
Excel: read_excel, to_excel
SQL: read_sql, to_sql
Stata: read_stata, to_stata
SAS: read_sas, to_sas
REST APIs: read_json or alternatively use requests
For numpy arrays:
You can use load and save for saving into .npy format
Normally I prefer to use HDF5 with the h5py library
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
h5py - datasets
Creating a data set:
>>> import h5py
>>> import numpy as np
>>>
>>> f = h5py.File("mytestfile.hdf5", "w")
>>> dset = f.create_dataset("mydataset", (100,), dtype=’i’)
Datasets work similarly to numpy arrays:
>>> dset[...] = np.arange(100)
>>> dset[0]
0
>>> dset[10]
10
>>> dset[0:100:10]
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Other options
pickle - Python standard serialization format. See also shelve.
tinydb - local document-oriented database (good for NLP tasks)
sqlalchemy - Heavy-duty SQL to relational mapper.
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Section 7
Conclusion
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Things I haven’t covered:
Python data structures: dicts, lists
Python - R interoperability: RPy
Parallel computing: IPython.parallel, pyspark
Optimizing python code: Cython, numba, numexpr
Hope you’ve enjoyed. Feel free to get in touch: amatos@lambdatree.com

Weitere ähnliche Inhalte

Was ist angesagt?

Python Pandas
Python PandasPython Pandas
Python PandasSunil OS
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in RFlorian Uhlitz
 
Scientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuanScientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuanWei-Yuan Chang
 
RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisYanchang Zhao
 
Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Ram Narasimhan
 
Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching moduleSander Timmer
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 
Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Piotr Milanowski
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyrRomain Francois
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimizationg3_nittala
 
Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010Qiangning Hong
 
Multiple file programs, inheritance, templates
Multiple file programs, inheritance, templatesMultiple file programs, inheritance, templates
Multiple file programs, inheritance, templatesSyed Zaid Irshad
 
Monad presentation scala as a category
Monad presentation   scala as a categoryMonad presentation   scala as a category
Monad presentation scala as a categorysamthemonad
 
Python3 cheatsheet
Python3 cheatsheetPython3 cheatsheet
Python3 cheatsheetGil Cohen
 

Was ist angesagt? (20)

Pandas
PandasPandas
Pandas
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 
Statistical computing 01
Statistical computing 01Statistical computing 01
Statistical computing 01
 
Scientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuanScientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuan
 
RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysis
 
Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)
 
Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching module
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
 
Rsplit apply combine
Rsplit apply combineRsplit apply combine
Rsplit apply combine
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010
 
Multiple file programs, inheritance, templates
Multiple file programs, inheritance, templatesMultiple file programs, inheritance, templates
Multiple file programs, inheritance, templates
 
Numpy Talk at SIAM
Numpy Talk at SIAMNumpy Talk at SIAM
Numpy Talk at SIAM
 
Monad presentation scala as a category
Monad presentation   scala as a categoryMonad presentation   scala as a category
Monad presentation scala as a category
 
Python3 cheatsheet
Python3 cheatsheetPython3 cheatsheet
Python3 cheatsheet
 

Ähnlich wie Python for R developers and data scientists

Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyKimikazu Kato
 
Advanced Web Technology ass.pdf
Advanced Web Technology ass.pdfAdvanced Web Technology ass.pdf
Advanced Web Technology ass.pdfsimenehanmut
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnArnaud Joly
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovFwdays
 
An Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAn Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAdam Getchell
 
UNIT III_Python Programming_aditya COllege
UNIT III_Python Programming_aditya COllegeUNIT III_Python Programming_aditya COllege
UNIT III_Python Programming_aditya COllegeRamanamurthy Banda
 
UNIT III_Python Programming_aditya COllege
UNIT III_Python Programming_aditya COllegeUNIT III_Python Programming_aditya COllege
UNIT III_Python Programming_aditya COllegeRamanamurthy Banda
 
Programming python quick intro for schools
Programming python quick intro for schoolsProgramming python quick intro for schools
Programming python quick intro for schoolsDan Bowen
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Guy Lebanon
 
Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks Marcel Caraciolo
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Datatypes in python
Datatypes in pythonDatatypes in python
Datatypes in pythoneShikshak
 
DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)EMRE AKCAOGLU
 
Python For Scientists
Python For ScientistsPython For Scientists
Python For Scientistsaeberspaecher
 
An overview of Python 2.7
An overview of Python 2.7An overview of Python 2.7
An overview of Python 2.7decoupled
 

Ähnlich wie Python for R developers and data scientists (20)

Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPy
 
Advanced Web Technology ass.pdf
Advanced Web Technology ass.pdfAdvanced Web Technology ass.pdf
Advanced Web Technology ass.pdf
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
 
An Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAn Overview Of Python With Functional Programming
An Overview Of Python With Functional Programming
 
UNIT III_Python Programming_aditya COllege
UNIT III_Python Programming_aditya COllegeUNIT III_Python Programming_aditya COllege
UNIT III_Python Programming_aditya COllege
 
UNIT III_Python Programming_aditya COllege
UNIT III_Python Programming_aditya COllegeUNIT III_Python Programming_aditya COllege
UNIT III_Python Programming_aditya COllege
 
Programming python quick intro for schools
Programming python quick intro for schoolsProgramming python quick intro for schools
Programming python quick intro for schools
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)
 
Unit2 input output
Unit2 input outputUnit2 input output
Unit2 input output
 
07. Arrays
07. Arrays07. Arrays
07. Arrays
 
Python 3.pptx
Python 3.pptxPython 3.pptx
Python 3.pptx
 
Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Datatypes in python
Datatypes in pythonDatatypes in python
Datatypes in python
 
DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)
 
Python For Scientists
Python For ScientistsPython For Scientists
Python For Scientists
 
An overview of Python 2.7
An overview of Python 2.7An overview of Python 2.7
An overview of Python 2.7
 
A tour of Python
A tour of PythonA tour of Python
A tour of Python
 
Feature Engineering in NLP.pdf
Feature Engineering in NLP.pdfFeature Engineering in NLP.pdf
Feature Engineering in NLP.pdf
 

Kürzlich hochgeladen

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 

Kürzlich hochgeladen (20)

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 

Python for R developers and data scientists

  • 1. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Python for R developers and data scientists Artur Matos http://www.lambdatree.com June 8, 2016
  • 2. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Outline 1 Getting Up and Running 2 Vectors 3 Data Frames 4 Analysis 5 Visualization 6 I/O 7 Conclusion
  • 3. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Section 1 Getting Up and Running
  • 4. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Which Python? Python runtimes Several available: CPython, PyPy, Jython. . . CPython is the official runtime written in C. PyPy is a JIT-based runtime that runs significantly faster than CPython. For scientific computing, CPython is the only choice. Python 2 vs Python 3 Python 3 is not backwards compatible Answer today is Python 3 (might have answered differently last year) Unless you have other teams using Python 2. . . But all major packages support Python 3 already
  • 5. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Installation Several options, ranging from simple to complex. We will use Anaconda here, which will get you up and running quickly. On Linux and Mac you can also install Python with your package manager. Use virtualenv to isolate Python environments (not covered here).
  • 6. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Installing Anaconda https://www.continuum.io/downloads
  • 7. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Installing Anaconda (2)
  • 8. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Jupyter
  • 9. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Python syntax in 2 minutes Convert input into uppercase; Cuts anything longer than 10 characters; Adds extra spaces if shorter than 10 characters; Add single quotes. >>> quote_pad_string("This is rather long") ’THIS IS RA’ >>> quote_pad_string("Short") ’SHORT ’
  • 10. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Python syntax in 2 minutes (2) Function: # WARNING: This is purely to cover some basic python syntax # there are better ways to do this in Python def quote_pad_string(a_string): maximum_length = 10 num_missing_characters = maximum_length - len(a_string) if num_missing_characters < 0: num_missing_characters = 0 if num_missing_characters: for i in range(num_missing_characters): a_string = a_string + " " else: a_string = a_string[:maximum_length] return "’" + a_string.upper() + "’"
  • 11. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Python syntax in 2 minutes (2) def defines the body of a function. Python is dynamically typed: # WARNING: This is purely to cover some basic python syntax # there are better ways to do this in Python def quote_pad_string(a_string): maximum_length = 10 num_missing_characters = maximum_length - len(a_string) if num_missing_characters < 0: num_missing_characters = 0 if num_missing_characters: for i in range(num_missing_characters): a_string = a_string + " " else: a_string = a_string[:maximum_length] return "’" + a_string.upper() + "’"
  • 12. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Python syntax in 2 minutes (2) Python uses indentation for code blocks instead of curly braces: # WARNING: This is purely to cover some basic python syntax # there are better ways to do this in Python def quote_pad_string(a_string): maximum_length = 10 num_missing_characters = maximum_length - len(a_string) if num_missing_characters < 0: num_missing_characters = 0 if num_missing_characters: for i in range(num_missing_characters): a_string = a_string + " " else: a_string = a_string[:maximum_length] return "’" + a_string.upper() + "’"
  • 13. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Python syntax in 2 minutes (2) ‘=’ for assignment: # WARNING: This is purely to cover some basic python syntax # there are better ways to do this in Python def quote_pad_string(a_string): maximum_length = 10 num_missing_characters = maximum_length - len(a_string) if num_missing_characters < 0: num_missing_characters = 0 if num_missing_characters: for i in range(num_missing_characters): a_string = a_string + " " else: a_string = a_string[:maximum_length] return "’" + a_string.upper() + "’"
  • 14. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Python syntax in 2 minutes (2) ‘if’ statement: # WARNING: This is purely to cover some basic python syntax # there are better ways to do this in Python def quote_pad_string(a_string): maximum_length = 10 num_missing_characters = maximum_length - len(a_string) if num_missing_characters < 0: num_missing_characters = 0 if num_missing_characters: for i in range(num_missing_characters): a_string = a_string + " " else: a_string = a_string[:maximum_length] return "’" + a_string.upper() + "’"
  • 15. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Python syntax in 2 minutes (2) ‘for’ statement: # WARNING: This is purely to cover some basic python syntax # there are better ways to do this in Python def quote_pad_string(a_string): maximum_length = 10 num_missing_characters = maximum_length - len(a_string) if num_missing_characters < 0: num_missing_characters = 0 if num_missing_characters: for i in range(num_missing_characters): a_string = a_string + " " else: a_string = a_string[:maximum_length] return "’" + a_string.upper() + "’"
  • 16. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Python syntax in 2 minutes (2) len(a_string) is a function call, a_string.upper() is a method invocation: # WARNING: This is purely to cover some basic python syntax # there are better ways to do this in Python def quote_pad_string(a_string): maximum_length = 10 num_missing_characters = maximum_length - len(a_string) if num_missing_characters < 0: num_missing_characters = 0 if num_missing_characters: for i in range(num_missing_characters): a_string = a_string + " " else: a_string = a_string[:maximum_length] return "’" + a_string.upper() + "’"
  • 17. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Section 2 Vectors
  • 18. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Scalars - R In R there are no real scalar types. They are just vectors of length 1: > a <- 5 # Equivalent to a <- c(5) > a [1] 5 > length(a) [1] 1
  • 19. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Scalars - Python In Python scalars and vectors are not the same thing: >>> a = 5 # Scalar 5 >>> b = np.array([5]) # Array with one element array([5]) >>> len(b) # Equivalent to ’length’ in R 1 This won’t work: >>> len(a) TypeError: object of type ’int’ has no len()
  • 20. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Vectors, matrices and arrays - R In R, there’s ‘c’ for 1d vectors, ‘matrix’ for 2 dimensions, and ‘array’ for higher-order dimensions: > c(1,2,3,4) [1] 1 2 3 4 > matrix(1:4, nrow=2,ncol=2) [,1] [,2] [1,] 1 3 [2,] 2 4 > array(1:3, c(2,4,6)) ... Strangely enough, a 1d array is not the same as a vector: > a <- as.array(1:3) [1] 1 2 3 > is.vector(a) [1] FALSE
  • 21. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Vectors, matrices and arrays - Python Python has no builtin vector or matrix type. You will need numpy: >>> import numpy as np # Equivalent to ’library(numpy)’ in R. >>> np.array([1, 2, 3, 4]) # 1d vector array([1, 2, 3, 4]) >>> np.array([[1, 2], [3, 4]]) # matrix array([[1, 2], [3, 4]]) np.array works with any dimension and it’s a single type (ndarray). (There’s also a matrix type specifically for two dimensions but it should be avoided. Always use ndarray.)
  • 22. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Generating regular sequences R > 5:10 # Shortend for seq [1] 5 6 7 8 9 10 > seq(0, 1, length.out = 11) [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Python >>> np.arange(3.0) array([ 0., 1., 2.]) >>> np.arange(3, 7) array([3, 4, 5, 6]) >>> np.arange(3, 7, 2) array([3, 5])
  • 23. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Vector operations - Python For the most part, vector operations in Python work just like in R: >>> a = np.arange(5.0) array([ 0., 1., 2., 3., 4.]) >>> 1.0 + a # Adding array([ 1., 2., 3., 4., 5.]) >>> a * a # Multiplying element wise array([ 0., 1., 4., 9., 16.]) >>> a ** 3 # to the power of 3 array([ 0., 1., 8., 27., 64.])
  • 24. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Vector operations - Python (2) For matrix multiplication use the ‘@’ operator: >> a = np.array([[1, 0], [0, 1]]) array([[1, 0], [0, 1]]) >> b = np.array([[4, 1], [2, 2]]) array([[4, 1], [2, 2]]) >> a @ b array([[4, 1], [2, 2]]) (In Python 2 use np.dot(a,b).)
  • 25. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Vector operations - Python (3) Numpy also has the usual mathematical operations that work on vectors: >> a = np.arange(5.0) array([ 0., 1., 2., 3., 4.]) >> np.sin(a) array([ 0., 0.84147098, 0.90929743, 0.14112001, -0.7568025 ]) Full reference here: http://docs.scipy.org/doc/numpy/reference/routines.math.html
  • 26. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Recycling - R When doing vector operations, R automatically extends the smallest element to be as large as the other: > c(1,2) + c(1,2,3,4) # Equivalent to c(1,2,1,2) + c(1,2,3,4) [1] 2 4 4 6 You can do this even if the lengths aren’t multiples of one another, albeit with a warning: > c(1,2) + c(1,2,3,4,5) [1] 2 4 4 6 6 Warning message: In c(1, 2) + c(1,2,3,4,5) : longer object length is not a multiple of shorter object length
  • 27. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Recycling - Python This won’t work in Python however: >> np.arange(2.0) + np.arange(4.0) ---------------------------------------- ValueError: operands could not be broadcast together with shapes (2,) (4,) Numpy has much more strict recycling (aka broadcasting) rules.
  • 28. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Broadcasting Rules - Python 2x3 and 2x1: 0 1 2 3 4 5 + 0 1
  • 29. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Broadcasting Rules - Python 2x3 and 2x1: 0 1 2 3 4 5 + 0 1 = 0 1 2 3 4 5 + 0 0 0 1 1 1
  • 30. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Broadcasting Rules - Python 2x3 and 2x1: 0 1 2 3 4 5 + 0 1 = 0 1 2 3 4 5 + 0 0 0 1 1 1 2x3 and 1x3: 0 1 2 3 4 5 + 0 1 2
  • 31. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Broadcasting Rules - Python 2x3 and 2x1: 0 1 2 3 4 5 + 0 1 = 0 1 2 3 4 5 + 0 0 0 1 1 1 2x3 and 1x3: 0 1 2 3 4 5 + 0 1 2 = 0 1 2 3 4 5 + 0 1 2 0 1 2
  • 32. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Broadcasting Rules - Python 2x3 and 2x1: 0 1 2 3 4 5 + 0 1 = 0 1 2 3 4 5 + 0 0 0 1 1 1 2x3 and 1x3: 0 1 2 3 4 5 + 0 1 2 = 0 1 2 3 4 5 + 0 1 2 0 1 2 Adding a single element array or a scalar always works: 0 1 2 3 4 5 + 0
  • 33. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Broadcasting Rules - Python 2x3 and 2x1: 0 1 2 3 4 5 + 0 1 = 0 1 2 3 4 5 + 0 0 0 1 1 1 2x3 and 1x3: 0 1 2 3 4 5 + 0 1 2 = 0 1 2 3 4 5 + 0 1 2 0 1 2 Adding a single element array or a scalar always works: 0 1 2 3 4 5 + 0 = 0 1 2 3 4 5 + 0 0 0 0 0 0
  • 34. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Broadcasting Rules - Python 2x3 and 2x1: 0 1 2 3 4 5 + 0 1 = 0 1 2 3 4 5 + 0 0 0 1 1 1 2x3 and 1x3: 0 1 2 3 4 5 + 0 1 2 = 0 1 2 3 4 5 + 0 1 2 0 1 2 Adding a single element array or a scalar always works: 0 1 2 3 4 5 + 0 = 0 1 2 3 4 5 + 0 0 0 0 0 0 This won’t work (the dimensions need to match exactly or be 1): 0 1 2 3 + 0 1
  • 35. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Indexing - i:j:k syntax a = np.arange(10) a = 0 1 2 3 4 5 6 7 8 9
  • 36. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Indexing - i:j:k syntax a = np.arange(10) a = 0 1 2 3 4 5 6 7 8 9 Indexing in Python starts from 0 (not 1): >>> a[0] 0.0 Indexing on a single value returns a scalar (not an array!)
  • 37. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Indexing - i:j:k syntax a = np.arange(10) a = 0 1 2 3 4 5 6 7 8 9 Use ‘i:j’ to index from position i to j-1: >>> a[1:3] array([ 1, 2])
  • 38. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Indexing - i:j:k syntax a = np.arange(10) a = 0 1 2 3 4 5 6 7 8 9 An optional ‘k’ element defines the step: >>> a[1:7:2] array([1, 3, 5])
  • 39. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Indexing - i:j:k syntax a = np.arange(10) a = 0 1 2 3 4 5 6 7 8 9 i and j can be negative, which means they will start counting from the last: >>> a[1:-3] array([1, 2, 3, 4, 5, 6])
  • 40. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Indexing - i:j:k syntax a = np.arange(10) a = 0 1 2 3 4 5 6 7 8 9 i and j can be negative, which means they will start counting from the last: >>> a[-3:-1] array([7, 8])
  • 41. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Indexing - i:j:k syntax a = np.arange(10) a = 0 1 2 3 4 5 6 7 8 9 While a negative k will go in the opposite direction: >>> a[-3:-9:-1] array([7, 6, 5, 4, 3, 2])
  • 42. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Indexing - i:j:k syntax a = np.arange(10) a = 0 1 2 3 4 5 6 7 8 9 Not all need to be included: >>> a[::-1] array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
  • 43. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Indexing - multiple dimensions Use ‘,’ for additional dimensions: x = 1 2 3 4 5 6 >>> x[0:2, 0:1] array([[1], [4]])
  • 44. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Indexing using conditions Operators like ‘>’ or ‘<=’ operate element-wise and return a logical vector: >>> a > 4 array([False, False, False, False, False, True, True, True, True, True], dtype=bool) These can be combined into more complex expressions: >> (a > 2) && (b ** 2 <= a) ... And used as indexing too: >> a[a > 4] array([5, 6, 7, 8, 9])
  • 45. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Assignment Any index can be used together with ‘=’ for assignment: >>> a[0] = 10 array([10, 1, 2, 3, 4, 5, 6, 7, 8, 9]) Conditions work as well: >>> a[a > 4] = 99 array([99, 1, 2, 3, 4, 99, 99, 99, 99, 99])
  • 46. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Section 3 Data Frames
  • 47. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Pandas Data frames in Python aren’t builtin. You will need pandas: import pandas as pd Loading the iris dataset: >>> iris = pd.read_csv("""https://raw.githubusercontent.com/pydata/ pandas/master/pandas/tests/data/iris.csv""") >>> iris.head() SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa
  • 48. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Selection SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa . . . . . . . . . . . . . . . . . .
  • 49. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Selection SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa . . . . . . . . . . . . . . . . . . ‘.<columnName>’ returns only that column: >>> iris.SepalLength 0 5.1 1 4.9 2 4.7 ...
  • 50. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Selection SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa . . . . . . . . . . . . . . . . . . This also works: >>> iris["SepalLength"] ...
  • 51. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Selection SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa . . . . . . . . . . . . . . . . . . You can select multiple columns by passing a list: >>> iris[["SepalWidth", "SepalLength"]] SepalWidth SepalLength 0 3.5 5.1 1 3.0 4.9 ...
  • 52. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Selection SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa . . . . . . . . . . . . . . . . . . Or if you pass a slice you can select rows: >>> iris[1:3] SepalLength SepalWidth PetalLength PetalWidth Name 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa
  • 53. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Selection SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa . . . . . . . . . . . . . . . . . . With loc you can slice both rows and columns: >>> iris.loc[1:3, ["SepalLength", "SepalWidth"]] SepalLength SepalWidth 1 4.9 3.0 2 4.7 3.2 3 4.6 3.1 loc is inclusive at the end.
  • 54. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Selection SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa . . . . . . . . . . . . . . . . . . As well as only one single row: >>> iris.loc[3] SepalLength 4.6 SepalWidth 3.1 PetalLength 1.5 PetalWidth 0.2 Name Iris-setosa Name: 3, dtype: object
  • 55. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Selection SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa . . . . . . . . . . . . . . . . . . iloc works with integer indices, similar to numpy arrays: >>> iris.iloc[0:2, 0:3] SepalLength SepalWidth PetalLength 0 5.1 3.5 1.4 1 4.9 3.0 1.4
  • 56. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Selection SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa . . . . . . . . . . . . . . . . . . ‘:’ will include all the rows (or all the columns): >>> iris.iloc[0:2, :] SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa
  • 57. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Selection SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa . . . . . . . . . . . . . . . . . . Or you can use conditions like numpy: >>> iris[iris.SepalLength < 5] SepalLength SepalWidth PetalLength PetalWidth Name 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa ...
  • 58. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Selection SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa . . . . . . . . . . . . . . . . . . Picking a single value returns a scalar: >>> iris.iloc[0,0] 5.0999999999999996 Normally it’s better to use at or iat (faster).
  • 59. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Assignment Assignment works as expected: >>> iris.loc[iris.SepalLength > 7.6, "Name"] = "Iris-orlando" Beware that this doesn’t work: >> iris[iris.SepalLength > 7.6].Name = "Iris-orlando" SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
  • 60. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Operations Operators work as expected: >>> iris["SepalD"] = iris["SepalLength"] * iris["SepalWidth"] There’s also apply: >>> iris[["SepalLength", "SepalWidth"]].apply(np.sqrt) SepalLength SepalWidth 0 2.258318 1.870829 1 2.213594 1.732051 ... Use axis=1 to apply function to each row.
  • 61. Installation Vectors Data Frames Analysis Visualization I/O Conclusion SQL-like operations - group by >>> iris.groupby("Name").mean() SepalLength SepalWidth PetalLength PetalWidth Name Iris-setosa 5.006 3.418 1.464 0.244 Iris-versicolor 5.936 2.770 4.260 1.326 Iris-virginica 6.588 2.974 5.552 2.026
  • 62. Installation Vectors Data Frames Analysis Visualization I/O Conclusion SQL-like operations (2) - join >> left key lval 0 foo 1 1 foo 2 >> right key rval 0 foo 4 1 foo 5 >> pd.merge(left, right, on=’key’) key lval rval 0 foo 1 4 1 foo 1 5 ...
  • 63. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Time Series Pandas data frames can also work as time series, replacing R’s ts, xts or zoo. Downloading some stock data from Google finance: >>> import pandas.io.data as web >>> import datetime >>> aapl = web.DataReader("AAPL", ’google’, datetime.datetime(2013, 1, 1), datetime.datetime(2014, 1, 1)) Open High Low Close Volume Date 2013-01-02 79.12 79.29 77.38 78.43 140124866 2013-01-03 78.27 78.52 77.29 77.44 88240950 ... Time series are just regular pandas data frames but with time stamps as indices.
  • 64. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Time Series (2) Use loc to select based on dates: >>> aapl.loc[’20130131’:’20130217’] Open High Low Close Volume Date 2013-01-31 65.28 65.61 65.00 65.07 79833215 2013-02-01 65.59 65.64 64.05 64.80 134867089 ... Use iloc as before for selecting based on numerical indices: >>> aapl.iloc[1:3] Open High Low Close Volume Date 2013-01-03 78.27 78.52 77.29 77.44 88240950 2013-01-04 76.71 76.95 75.12 75.29 148581860
  • 65. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Section 4 Analysis
  • 66. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Statistical tests Use scipy.stats for common statistical tests: >>> from scipy import stats >>> iris_virginica = iris[iris.Name == ’Iris-virginica’].SepalLength.values >>> iris_setosa = iris[iris.Name == ’Iris-setosa’].SepalLength.values >>> t_test = stats.ttest_ind(iris_virginica, iris_setosa) >>> t_test.pvalue 6.8925460606740589e-28 Use scikits.bootstrap for bootstrapped confidence intervals.
  • 67. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Ordinary Least Squares Use statsmodels: import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf The formula API is very similar to R: >>> results = smf.ols("PetalWidth ~ Name + PetalLength", data=iris).fit() It automatically includes an intercept (just like R). Use smf.glm for generalized linear models.
  • 68. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Formula API Very similar to R: You can include arbitrary transformations, e.g. “np.log(PetalWidth)”. To remove the intercept add a “- 1” or “0 +” Use “C(a)” to coerce a number to a factor Use “a:b” for modelling interactions between a and b. “a*b” means “a + b + a:b” Strings are automatically coerced to factors (more on this later)
  • 69. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Decision trees Use scikit-learn: >>> from sklearn import tree >>> clf = tree.DecisionTreeClassifier() >>> clf = clf.fit(sk_iris.data, sk_iris.target) After being fitted, the model can then be used to predict the class of samples: >>> clf.predict([[5.1, 3.5, 1.4, 0.2]]) array([0])
  • 70. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Support Vector Machines scikit-learn has a very regular API. Here’s the same example using an SVM: >>> from sklearn import svm >>> clf_svm = svm.SVC() >>> clf_svm = clf_svm.fit(sk_iris.data, sk_iris.target) >>> clf_svm.predict([[5.1, 3.5, 1.4, 0.2]]) array([0])
  • 71. Installation Vectors Data Frames Analysis Visualization I/O Conclusion K-Means clustering Clustering follows the same pattern: >>> from sklearn import cluster >>> k_means = cluster.KMeans(n_clusters=3) >>> k_means.fit(sk_iris.data) KMeans(copy_x=True, init=’k-means++’, ... labels_ contains the assigned categories, following the same order as the data: >>> k_means.labels_ array([1, 1, 1, 1, 1... predict works the same as for the other models, and returns the predicted category.
  • 72. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Principal Component Analysis from sklearn import decomposition pca = decomposition.PCA(n_components=3) pca = pca.fit(sk_iris.data) explained_variance_ratio_ and components_ will include the explained variance and the PCA components respectively: >>> pca.explained_variance_ratio_ array([ 0.92461621, 0.05301557, 0.01718514]) >>>pca.components_ array([[ 0.36158968, -0.08226889, 0.85657211, 0.35884393], [-0.65653988, -0.72971237, 0.1757674 , 0.07470647], [ 0.58099728, -0.59641809, -0.07252408, -0.54906091]])
  • 73. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Cross validation scikit-learn also includes extensive support for cross-validation. Here’s a simple split into training and out-of-sample: >>> from sklearn import cross_validation >>> X_train, X_test, y_train, y_test = cross_validation.train_test_split( ... sk_iris.data, sk_iris.target, test_size=0.4, random_state=0) >>> X_train.shape, y_train.shape ((90, 4), (90,)) >>> X_test.shape, y_test.shape ((60, 4), (60,)) It also supports K-fold, stratified K-fold, shuffling, etc. . .
  • 74. Installation Vectors Data Frames Analysis Visualization I/O Conclusion NAs There’s no builtin NA in Python. You normally use NaN for NAs. numpy has a bunch of builtin functions to ignore NaNs: >>> a = np.array([1.0, 3.0, np.NaN, 5.0]) >>> a.sum() nan >>> np.nansum(a) 9.0 Pandas usually ignores NaNs when computing sums, means, etc.. but propagates them accordingly. scikit-learn assumes there’s no missing data so be sure to pre-process them, e.g. remove them or set them to 0. Look at sklearn.preprocessing.Imputer statsmodels also use NaNs for missing data, but only has basic support for handling them (it can only ignore them or raise an error). See the missing attribute in the model class.
  • 75. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Factors Similarly to NAs, Python has no builtin factor data type. Different packages handle them differently: numpy has no support for factors. Use integers. Pandas has categoricals, which work fairly similar to factors statsmodels convert strings to their own internal factor type, very similar to R. There’s also the ‘C’ operator. scikit-learn doesn’t support factors internally, but has some tools to convert strings into dummy variables, e.g. DictVectorizer
  • 76. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Notorious Omissions Bayesian modelling Time series analysis Econometrics Signal processing, i.e. filter design Natural language processing . . .
  • 77. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Section 5 Visualization
  • 78. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Visualization
  • 79. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Section 6 I/O
  • 80. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Pandas read_csv Reads data from CSV files >>> pd.read_csv(’foo.csv’) Unnamed: 0 A B C D 0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860 1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953 ... Conversely there is to_csv to write CSV files: In [136]: df.to_csv(’foo.csv’)
  • 81. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Other options For data frames: HDF5: read_hdf5, to_hdf5 Excel: read_excel, to_excel SQL: read_sql, to_sql Stata: read_stata, to_stata SAS: read_sas, to_sas REST APIs: read_json or alternatively use requests For numpy arrays: You can use load and save for saving into .npy format Normally I prefer to use HDF5 with the h5py library
  • 82. Installation Vectors Data Frames Analysis Visualization I/O Conclusion h5py - datasets Creating a data set: >>> import h5py >>> import numpy as np >>> >>> f = h5py.File("mytestfile.hdf5", "w") >>> dset = f.create_dataset("mydataset", (100,), dtype=’i’) Datasets work similarly to numpy arrays: >>> dset[...] = np.arange(100) >>> dset[0] 0 >>> dset[10] 10 >>> dset[0:100:10] array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
  • 83. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Other options pickle - Python standard serialization format. See also shelve. tinydb - local document-oriented database (good for NLP tasks) sqlalchemy - Heavy-duty SQL to relational mapper.
  • 84. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Section 7 Conclusion
  • 85. Installation Vectors Data Frames Analysis Visualization I/O Conclusion Things I haven’t covered: Python data structures: dicts, lists Python - R interoperability: RPy Parallel computing: IPython.parallel, pyspark Optimizing python code: Cython, numba, numexpr Hope you’ve enjoyed. Feel free to get in touch: amatos@lambdatree.com