Python for R developers and data scientists

Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python for R developers and data scientists
Artur Matos
http://www.lambdatree.com
June 8, 2016

Outline
1 Getting Up and Running
2 Vectors
3 Data Frames
4 Analysis
5 Visualization
6 I/O
7 Conclusion

Section 1
Getting Up and Running

Which Python?
Python runtimes
Several available: CPython, PyPy, Jython. . .
CPython is the official runtime written in C.
PyPy is a JIT-based runtime that runs significantly faster than CPython.
For scientific computing, CPython is the only choice.
Python 2 vs Python 3
Python 3 is not backwards compatible
Answer today is Python 3 (might have answered differently last year)
Unless you have other teams using Python 2. . .
But all major packages support Python 3 already

Installation
Several options, ranging from simple to complex.
We will use Anaconda here, which will get you up and running quickly.
On Linux and Mac you can also install Python with your package manager.
Use virtualenv to isolate Python environments (not covered here).

Installing Anaconda
https://www.continuum.io/downloads

Installing Anaconda (2)

Jupyter

Python syntax in 2 minutes
Convert input into uppercase;
Cuts anything longer than 10 characters;
Adds extra spaces if shorter than 10 characters;
Add single quotes.
>>> quote_pad_string("This is rather long")
’THIS IS RA’
>>> quote_pad_string("Short")
’SHORT ’

Python syntax in 2 minutes (2)
Function:
# WARNING: This is purely to cover some basic python syntax
# there are better ways to do this in Python
def quote_pad_string(a_string):
maximum_length = 10
num_missing_characters = maximum_length - len(a_string)
if num_missing_characters < 0:
num_missing_characters = 0
if num_missing_characters:
for i in range(num_missing_characters):
a_string = a_string + " "
else:
a_string = a_string[:maximum_length]
return "’" + a_string.upper() + "’"

def deﬁnes the body of a function. Python is dynamically typed:
maximum_length = 10
else:

Python uses indentation for code blocks instead of curly braces:
maximum_length = 10
else:

‘=’ for assignment:
maximum_length = 10
else:

‘if’ statement:
maximum_length = 10
else:

‘for’ statement:
maximum_length = 10
else:

len(a_string) is a function call, a_string.upper() is a method invocation:
maximum_length = 10
else:

Section 2
Vectors

Scalars - R
In R there are no real scalar types. They are just vectors of length 1:
> a <- 5 # Equivalent to a <- c(5)
> a
[1] 5
> length(a)
[1] 1

Scalars - Python
In Python scalars and vectors are not the same thing:
>>> a = 5 # Scalar
5
>>> b = np.array([5]) # Array with one element
array([5])
>>> len(b) # Equivalent to ’length’ in R
1
This won’t work:
>>> len(a)
TypeError: object of type ’int’ has no len()

Vectors, matrices and arrays - R
In R, there’s ‘c’ for 1d vectors, ‘matrix’ for 2 dimensions, and ‘array’ for higher-order
dimensions:
> c(1,2,3,4)
[1] 1 2 3 4
> matrix(1:4, nrow=2,ncol=2)
[,1] [,2]
[1,] 1 3
[2,] 2 4
> array(1:3, c(2,4,6))
...
Strangely enough, a 1d array is not the same as a vector:
> a <- as.array(1:3)
[1] 1 2 3
> is.vector(a)
[1] FALSE

Vectors, matrices and arrays - Python
Python has no builtin vector or matrix type. You will need numpy:
>>> import numpy as np # Equivalent to ’library(numpy)’ in R.
>>> np.array([1, 2, 3, 4]) # 1d vector
array([1, 2, 3, 4])
>>> np.array([[1, 2], [3, 4]]) # matrix
array([[1, 2],
[3, 4]])
np.array works with any dimension and it’s a single type (ndarray).
(There’s also a matrix type speciﬁcally for two dimensions but it should be avoided.
Always use ndarray.)

Generating regular sequences
R
> 5:10 # Shortend for seq
[1] 5 6 7 8 9 10
> seq(0, 1, length.out = 11)
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Python
>>> np.arange(3.0)
array([ 0., 1., 2.])
>>> np.arange(3, 7)
array([3, 4, 5, 6])
>>> np.arange(3, 7, 2)
array([3, 5])

Vector operations - Python
For the most part, vector operations in Python work just like in R:
>>> a = np.arange(5.0)
array([ 0., 1., 2., 3., 4.])
>>> 1.0 + a # Adding
array([ 1., 2., 3., 4., 5.])
>>> a * a # Multiplying element wise
array([ 0., 1., 4., 9., 16.])
>>> a ** 3 # to the power of 3
array([ 0., 1., 8., 27., 64.])

Vector operations - Python (2)
For matrix multiplication use the ‘@’ operator:
>> a = np.array([[1, 0], [0, 1]])
array([[1, 0],
[0, 1]])
>> b = np.array([[4, 1], [2, 2]])
array([[4, 1],
[2, 2]])
>> a @ b
array([[4, 1],
[2, 2]])
(In Python 2 use np.dot(a,b).)

Vector operations - Python (3)
Numpy also has the usual mathematical operations that work on vectors:
>> a = np.arange(5.0)
array([ 0., 1., 2., 3., 4.])
>> np.sin(a)
array([ 0., 0.84147098, 0.90929743, 0.14112001, -0.7568025 ])
Full reference here:
http://docs.scipy.org/doc/numpy/reference/routines.math.html

Recycling - R
When doing vector operations, R automatically extends the smallest element to be as
large as the other:
> c(1,2) + c(1,2,3,4) # Equivalent to c(1,2,1,2) + c(1,2,3,4)
[1] 2 4 4 6
You can do this even if the lengths aren’t multiples of one another, albeit with a
warning:
> c(1,2) + c(1,2,3,4,5)
[1] 2 4 4 6 6
Warning message:
In c(1, 2) + c(1,2,3,4,5) :
longer object length is not a multiple of shorter object length

Recycling - Python
This won’t work in Python however:
>> np.arange(2.0) + np.arange(4.0)
----------------------------------------
ValueError: operands could not be broadcast together with shapes
(2,) (4,)
Numpy has much more strict recycling (aka broadcasting) rules.

Broadcasting Rules - Python
2x3 and 2x1:
0 1 2
3 4 5
+
0
1

2x3 and 2x1:
0 1 2
3 4 5
+
0
1
=
0 1 2
3 4 5
+
0 0 0
1 1 1

2x3 and 2x1:
0 1 2
3 4 5
+
0
1
=
0 1 2
3 4 5
+
0 0 0
1 1 1
2x3 and 1x3:
0 1 2
3 4 5
+ 0 1 2

2x3 and 2x1:
0 1 2
3 4 5
+
0
1
=
0 1 2
3 4 5
+
0 0 0
1 1 1
2x3 and 1x3:
0 1 2
3 4 5
+ 0 1 2 =
0 1 2
3 4 5
+
0 1 2
0 1 2

2x3 and 2x1:
0 1 2
3 4 5
+
0
1
=
0 1 2
3 4 5
+
0 0 0
1 1 1
2x3 and 1x3:
0 1 2
3 4 5
+ 0 1 2 =
0 1 2
3 4 5
+
0 1 2
0 1 2
Adding a single element array or a scalar always works:
0 1 2
3 4 5
+ 0

2x3 and 2x1:
0 1 2
3 4 5
+
0
1
=
0 1 2
3 4 5
+
0 0 0
1 1 1
2x3 and 1x3:
0 1 2
3 4 5
+ 0 1 2 =
0 1 2
3 4 5
+
0 1 2
0 1 2
0 1 2
3 4 5
+ 0 =
0 1 2
3 4 5
+
0 0 0
0 0 0

2x3 and 2x1:
0 1 2
3 4 5
+
0
1
=
0 1 2
3 4 5
+
0 0 0
1 1 1
2x3 and 1x3:
0 1 2
3 4 5
+ 0 1 2 =
0 1 2
3 4 5
+
0 1 2
0 1 2
0 1 2
3 4 5
+ 0 =
0 1 2
3 4 5
+
0 0 0
0 0 0
This won’t work (the dimensions need to match exactly or be 1):
0 1 2 3 + 0 1

Indexing - i:j:k syntax
a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9

a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9
Indexing in Python starts from 0 (not 1):
>>> a[0]
0.0
Indexing on a single value returns a scalar (not an array!)

a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9
Use ‘i:j’ to index from position i to j-1:
>>> a[1:3]
array([ 1, 2])

a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9
An optional ‘k’ element deﬁnes the step:
>>> a[1:7:2]
array([1, 3, 5])

a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9
i and j can be negative, which means they will start counting from the last:
>>> a[1:-3]
array([1, 2, 3, 4, 5, 6])

a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9
i and j can be negative, which means they will start counting from the last:
>>> a[-3:-1]
array([7, 8])

a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9
While a negative k will go in the opposite direction:
>>> a[-3:-9:-1]
array([7, 6, 5, 4, 3, 2])

a = np.arange(10)
a = 0 1 2 3 4 5 6 7 8 9
Not all need to be included:
>>> a[::-1]
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

Indexing - multiple dimensions
Use ‘,’ for additional dimensions:
x =
1 2 3
4 5 6
>>> x[0:2, 0:1]
array([[1],
[4]])

Indexing using conditions
Operators like ‘>’ or ‘<=’ operate element-wise and return a logical vector:
>>> a > 4
array([False, False, False, False, False, True, True,
True, True, True], dtype=bool)
These can be combined into more complex expressions:
>> (a > 2) && (b ** 2 <= a)
...
And used as indexing too:
>> a[a > 4]
array([5, 6, 7, 8, 9])

Assignment
Any index can be used together with ‘=’ for assignment:
>>> a[0] = 10
array([10, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Conditions work as well:
>>> a[a > 4] = 99
array([99, 1, 2, 3, 4, 99, 99, 99, 99, 99])

Section 3
Data Frames

Pandas
Data frames in Python aren’t builtin. You will need pandas:
import pandas as pd
Loading the iris dataset:
>>> iris = pd.read_csv("""https://raw.githubusercontent.com/pydata/
pandas/master/pandas/tests/data/iris.csv""")
>>> iris.head()
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Selection
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .

Selection
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
‘.<columnName>’ returns only that column:
>>> iris.SepalLength
0 5.1
1 4.9
2 4.7
...

Selection
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
This also works:
>>> iris["SepalLength"]
...

Selection
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
You can select multiple columns by passing a list:
>>> iris[["SepalWidth", "SepalLength"]]
SepalWidth SepalLength
0 3.5 5.1
1 3.0 4.9
...

Selection
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
Or if you pass a slice you can select rows:
>>> iris[1:3]
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa

Selection
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
With loc you can slice both rows and columns:
>>> iris.loc[1:3, ["SepalLength", "SepalWidth"]]
SepalLength SepalWidth
1 4.9 3.0
2 4.7 3.2
3 4.6 3.1
loc is inclusive at the end.

Selection
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
As well as only one single row:
>>> iris.loc[3]
SepalLength 4.6
SepalWidth 3.1
PetalLength 1.5
PetalWidth 0.2
Name Iris-setosa
Name: 3, dtype: object

Selection
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
iloc works with integer indices, similar to numpy arrays:
>>> iris.iloc[0:2, 0:3]
SepalLength SepalWidth PetalLength
0 5.1 3.5 1.4
1 4.9 3.0 1.4

Selection
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
‘:’ will include all the rows (or all the columns):
>>> iris.iloc[0:2, :]
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa

Selection
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
Or you can use conditions like numpy:
>>> iris[iris.SepalLength < 5]
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
...

Selection
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
. . . . . . . . . . . . . . . . . .
Picking a single value returns a scalar:
>>> iris.iloc[0,0]
5.0999999999999996
Normally it’s better to use at or iat (faster).

Assignment
Assignment works as expected:
>>> iris.loc[iris.SepalLength > 7.6, "Name"] = "Iris-orlando"
Beware that this doesn’t work:
>> iris[iris.SepalLength > 7.6].Name = "Iris-orlando"
SettingWithCopyWarning: A value is trying to be set on a copy
of a slice from a DataFrame.

Operations
Operators work as expected:
>>> iris["SepalD"] = iris["SepalLength"] * iris["SepalWidth"]
There’s also apply:
>>> iris[["SepalLength", "SepalWidth"]].apply(np.sqrt)
SepalLength SepalWidth
0 2.258318 1.870829
1 2.213594 1.732051
...
Use axis=1 to apply function to each row.

SQL-like operations - group by
>>> iris.groupby("Name").mean()
SepalLength SepalWidth PetalLength PetalWidth
Name
Iris-setosa 5.006 3.418 1.464 0.244
Iris-versicolor 5.936 2.770 4.260 1.326
Iris-virginica 6.588 2.974 5.552 2.026

SQL-like operations (2) - join
>> left
key lval
0 foo 1
1 foo 2
>> right
key rval
0 foo 4
1 foo 5
>> pd.merge(left, right, on=’key’)
key lval rval
0 foo 1 4
1 foo 1 5
...

Time Series
Pandas data frames can also work as time series, replacing R’s ts, xts or zoo.
Downloading some stock data from Google ﬁnance:
>>> import pandas.io.data as web
>>> import datetime
>>> aapl = web.DataReader("AAPL", ’google’,
datetime.datetime(2013, 1, 1),
datetime.datetime(2014, 1, 1))
Open High Low Close Volume
Date
2013-01-02 79.12 79.29 77.38 78.43 140124866
2013-01-03 78.27 78.52 77.29 77.44 88240950
...
Time series are just regular pandas data frames but with time stamps as indices.

Time Series (2)
Use loc to select based on dates:
>>> aapl.loc[’20130131’:’20130217’]
Date
2013-01-31 65.28 65.61 65.00 65.07 79833215
2013-02-01 65.59 65.64 64.05 64.80 134867089
...
Use iloc as before for selecting based on numerical indices:
>>> aapl.iloc[1:3]
Date
2013-01-03 78.27 78.52 77.29 77.44 88240950
2013-01-04 76.71 76.95 75.12 75.29 148581860

Section 4
Analysis

Statistical tests
Use scipy.stats for common statistical tests:
>>> from scipy import stats
>>> iris_virginica = iris[iris.Name == ’Iris-virginica’].SepalLength.values
>>> iris_setosa = iris[iris.Name == ’Iris-setosa’].SepalLength.values
>>> t_test = stats.ttest_ind(iris_virginica, iris_setosa)
>>> t_test.pvalue
6.8925460606740589e-28
Use scikits.bootstrap for bootstrapped conﬁdence intervals.

Ordinary Least Squares
Use statsmodels:
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
The formula API is very similar to R:
>>> results = smf.ols("PetalWidth ~ Name + PetalLength", data=iris).fit()
It automatically includes an intercept (just like R).
Use smf.glm for generalized linear models.

Formula API
Very similar to R:
You can include arbitrary transformations, e.g. “np.log(PetalWidth)”.
To remove the intercept add a “- 1” or “0 +”
Use “C(a)” to coerce a number to a factor
Use “a:b” for modelling interactions between a and b.
“a*b” means “a + b + a:b”
Strings are automatically coerced to factors (more on this later)

Decision trees
Use scikit-learn:
>>> from sklearn import tree
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(sk_iris.data, sk_iris.target)
After being ﬁtted, the model can then be used to predict the class of samples:
>>> clf.predict([[5.1, 3.5, 1.4, 0.2]])
array([0])

Support Vector Machines
scikit-learn has a very regular API. Here’s the same example using an SVM:
>>> from sklearn import svm
>>> clf_svm = svm.SVC()
>>> clf_svm = clf_svm.fit(sk_iris.data, sk_iris.target)
>>> clf_svm.predict([[5.1, 3.5, 1.4, 0.2]])
array([0])

K-Means clustering
Clustering follows the same pattern:
>>> from sklearn import cluster
>>> k_means = cluster.KMeans(n_clusters=3)
>>> k_means.fit(sk_iris.data)
KMeans(copy_x=True, init=’k-means++’, ...
labels_ contains the assigned categories, following the same order as the data:
>>> k_means.labels_
array([1, 1, 1, 1, 1...
predict works the same as for the other models, and returns the predicted category.

Principal Component Analysis
from sklearn import decomposition
pca = decomposition.PCA(n_components=3)
pca = pca.fit(sk_iris.data)
explained_variance_ratio_ and components_ will include the explained variance
and the PCA components respectively:
>>> pca.explained_variance_ratio_
array([ 0.92461621, 0.05301557, 0.01718514])
>>>pca.components_
array([[ 0.36158968, -0.08226889, 0.85657211, 0.35884393],
[-0.65653988, -0.72971237, 0.1757674 , 0.07470647],
[ 0.58099728, -0.59641809, -0.07252408, -0.54906091]])

Cross validation
scikit-learn also includes extensive support for cross-validation. Here’s a simple split
into training and out-of-sample:
>>> from sklearn import cross_validation
>>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(
... sk_iris.data, sk_iris.target, test_size=0.4, random_state=0)
>>> X_train.shape, y_train.shape
((90, 4), (90,))
>>> X_test.shape, y_test.shape
((60, 4), (60,))
It also supports K-fold, stratiﬁed K-fold, shuﬄing, etc. . .

NAs
There’s no builtin NA in Python. You normally use NaN for NAs. numpy has a bunch
of builtin functions to ignore NaNs:
>>> a = np.array([1.0, 3.0, np.NaN, 5.0])
>>> a.sum()
nan
>>> np.nansum(a)
9.0
Pandas usually ignores NaNs when computing sums, means, etc.. but propagates
them accordingly.
scikit-learn assumes there’s no missing data so be sure to pre-process them, e.g.
remove them or set them to 0. Look at sklearn.preprocessing.Imputer
statsmodels also use NaNs for missing data, but only has basic support for
handling them (it can only ignore them or raise an error). See the missing
attribute in the model class.

Factors
Similarly to NAs, Python has no builtin factor data type.
Diﬀerent packages handle them diﬀerently:
numpy has no support for factors. Use integers.
Pandas has categoricals, which work fairly similar to factors
statsmodels convert strings to their own internal factor type, very similar to R.
There’s also the ‘C’ operator.
scikit-learn doesn’t support factors internally, but has some tools to convert strings
into dummy variables, e.g. DictVectorizer

Notorious Omissions
Bayesian modelling
Time series analysis
Econometrics
Signal processing, i.e. ﬁlter design
Natural language processing
. . .

Section 5
Visualization

Visualization

Section 6
I/O

Pandas
read_csv Reads data from CSV ﬁles
>>> pd.read_csv(’foo.csv’)
Unnamed: 0 A B C D
0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860
1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
...
Conversely there is to_csv to write CSV ﬁles:
In [136]: df.to_csv(’foo.csv’)

Other options
For data frames:
HDF5: read_hdf5, to_hdf5
Excel: read_excel, to_excel
SQL: read_sql, to_sql
Stata: read_stata, to_stata
SAS: read_sas, to_sas
REST APIs: read_json or alternatively use requests
For numpy arrays:
You can use load and save for saving into .npy format
Normally I prefer to use HDF5 with the h5py library

h5py - datasets
Creating a data set:
>>> import h5py
>>> import numpy as np
>>>
>>> f = h5py.File("mytestfile.hdf5", "w")
>>> dset = f.create_dataset("mydataset", (100,), dtype=’i’)
Datasets work similarly to numpy arrays:
>>> dset[...] = np.arange(100)
>>> dset[0]
0
>>> dset[10]
10
>>> dset[0:100:10]
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])

Other options
pickle - Python standard serialization format. See also shelve.
tinydb - local document-oriented database (good for NLP tasks)
sqlalchemy - Heavy-duty SQL to relational mapper.

Section 7
Conclusion

Things I haven’t covered:
Python data structures: dicts, lists
Python - R interoperability: RPy
Parallel computing: IPython.parallel, pyspark
Optimizing python code: Cython, numba, numexpr
Hope you’ve enjoyed. Feel free to get in touch: amatos@lambdatree.com

Python for R developers and data scientists

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Python for R developers and data scientists

Ähnlich wie Python for R developers and data scientists (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Python for R developers and data scientists