4. Python: Data Types, Assignment, Operator
4
Both " and ' can be used to denote strings. If the apostrophe character should be part of the
string, use " as outer boundaries:
"Barack's last name is Obama"
Alternatively, can be used as an escape character: 'Barack's last name is Obama'
Integers (int)
In [1]:
# Integers
a = 2
b = 239
Floating point numbers (float)
In [2]:
# Floats
c = 2.1
d = 239.0
Strings (str)
In [3]:
e = 'Hello world!'
my_text = 'This is Future Lab'
Boolean (bool)
In [4]:
x = True
y = False
Standard calculator-like
operations
Most basic operations on integers and
floats such as addition, subtraction,
multiplication work as one would expect:
In [5]: 2 * 4
Out[5]: 8
In [6]: 2 / 5
Out[6]: 0.4
In [7]: 3.1 + 7.4
Out[7]: 10.5
Exponents
Exponents are denoted by **:
In [8]: 2**3
Out[8]: 8
Floor division
Floor division is denoted by //. It returns
the integer part of a division result
(removes decimals after division):
In [9]: 10 // 3
Out[9]: 3
Modulo
Modulo is denoted by %. It returns the
remainder after a division:
In [10]: 10 % 3
Out[10]: 1
Operations on strings
Strings can be added (concatenated) by
use of the addition operator +:
In [11]: 'Bruce' + ' ' + 'Wayne'
Out[11]: 'Bruce Wayne'
Multiplication is also allowed:
In [12 'a' * 3
Out[12]:'aaa'
5. Python: Control Flow
5
In Python, code blocks are separated by use of indentation. See the definition of an if-
statement below:
Syntax of conditional blocks
if condition:
# Code goes here (must be indented!)
# Otherwise, IndentationError will be thrown
# Code placed here is outside of the if-statement
Where evaluation of condition must return a boolean (True or False).
Remember:
1. The : must be present after condition.
2. The line immediately after : must be indented.
3. The if-statement is exited by reverting the indentation as shown
above.
This is how Python interprets the code as a block.
The same indentation rules are required for all types of code blocks, the if-block above is just an
example. Examples of other types of code blocks are for and while loops, functions etc.
All editors will automatically make the indentation upon hitting enter after the :, so it doesn't take long
to get used to this.
if-statements
An if-statement has the following syntax:
In [13]:
x = 2
if x > 1:
print('x is larger than 1')
if / else-statements
In [14]:
y = 1
if y > 1:
print('y is larger than 1')
else:
print('y is less than or equal to 1')
if / elif / else
In [15]:
z = 0
if z > 1:
print('z is larger than 1')
elif z < 1:
print('z is less than 1')
else:
print('z is equal to 1')
6. Python: Data Structures
6
Data structures are constructs that can contain one or more variables. They are containers that can store a lot of data into a single entity.
Python's four basic data structures are:
● Lists
● Dictionaries
● Tuples
● Sets
Lists
Lists are defined by square brackets [] with elements separated by commas. They
can have elements of any data type.
Lists are arguably the most used data structure in Python.
List syntax
L = [item_1, item_2, ..., item_n]
Mutability
Lists are mutable. They can be changed after creation.
List
In [1]:
# List with integers
a = [10, 20, 30, 40]
# Multiple data types in the same list
b = [1, True, 'Hi!', 4.3]
# List of lists
c = [['Nested', 'lists'], ['are', 'possible']]
7. Python: Data Structures
7
Dictionaries
Dictionaries have key/value pairs which are enclosed in curly
brackets{}. A value can be fetched by querying the corresponding key.
Referring the data via logically named keys instead of list indexes
makes the code more readable.
Dictionary syntax
d = {key1: value1, key2: value2, ..., key_n: value_n}
Note that values can be of any data type like floats, strings etc., but
they can also be lists or other data structures.
Keys must be unique within the dictionary. Otherwise it would be
hard to extract the value by calling out a certain key, see the section
about indexing and slicing below.
Keys also must be of an immutable type.
Mutability
Dictionaries are mutable. They can be changed after creation.
# Strings as keys and numbers as values
d1 = {'axial_force': 319.2, 'moment': 74, 'shear': 23}
# Strings as keys and lists as values
d2 = {'Point1': [1.3, 51, 10.6], 'Point2': [7.1, 11, 6.7]}
# Keys of different types (int and str, don't do this!)
d3 = {1: True, 'hej': 23}
The first two dictionaries above have a certain trend. For d1 the keys are strings and
the values are integers. For d2 the keys are strings and the values are lists. These
are well-structured dictionaries.
However, d3 has keys that are of mixed types! The first key is an integer and the
second is a string. This is totally valid syntax, but not a good idea to do.
As with most stuff in Python the flexibility is very nice, but it can also be confusing
to have many different types mixed in the same data structure. To make code more
readable, it is often preferred to keep the same trend throughout the dictionary. I.e.
all keys are of same type and all values are of the same type as in d1 and d2.
The keys and values can be extracted separately by the methods dict.keys() and
dict.values():
In [3]: d1.keys()
Out[3]: dict_keys(['axial_force', 'moment', 'shear'])
In [4]: d1.values()
Out[4]: dict_values([319.2, 74, 23])
8. Python: Data Structures
8
Tuples
Tuples are very comparable to lists, but they are defined by
parentheses (). Most notable difference from lists is that tuples are
immutable.
Tuple syntax
t = (item_1, item_2, ..., item_n)
Mutability
Tuples are immutable. They cannot be changed after creation.
Tuple examples
In [5]:
# Simple tuple of integers
t1 = (1, 24, 56)
# Multiple types as tuple elements
t2 = (1, 1.62, '12', [1, 2 , 3])
# Tuple of tuples
points = ((4, 5), (12, 6), (14, 9))
Sets
Sets are defined with curly brackets {}. They are unordered and don't
have an index. See description of indexing further down. Sets also
have unique items.
Set syntax
s = {item_1, item_2, ..., item_n}
The primary idea about sets is the ability to perform set operations.
These are known from mathematics and can determine the union,
intersection, difference etc. of two given sets.
A list, string or tuple can be converted to a set by
set(sequence_to_convert). Since sets only have unique items, the set
resulting from the operation has same values as the input sequence,
but with duplicates removed. This can be a way to create a list with
only unique elements.
For example:
# Convert list to set and back to list again with now only unique
elements
list_uniques = list(set(list_with_duplicates))
9. Python: Function
9
A function is a block of code that is first defined, and thereafter can be
called to run as many times as needed. A function might have
arguments, some of which can be optional if a default value is
specified.
A function is called by parentheses: function_name(). Arguments are
placed inside the parentehes and comma separated if there are more
than one. Similar to f(x, y) from mathematics.
A function can return one or more values to the caller. The values to
return are put in the return statement. When the code hits a return
statement the function terminates. If no return statement is given, the
function will return None
def function_name(arg1, arg2, default_arg1=0, default_arg2=None):
'''This is the docstring
The docstring explains what the function does, so it is like a multiline comment. It does not have to be here,
but it is good practice to use them to document the code. They are especially useful for more complicated
functions, although functions should in general be kept as simple as possible.
Arguments could be explained together with their types (e.g. strings, lists, dicts etc.).
'''
# Function code goes here
# Possible 'return' statement terminating the function. If 'return' is not specified, function returns None.
return return_val1, return_val2
If multiple values are to be returned, they can be separated by commas as shown. The returned entity will by default be a tuple.
Note that when using default arguments, it is good practice to only use immutable types. An example further below will demonstrate why this is recommended.
In [5]:
def say_hello_to(name):
''' Say hello to the input name '''
print(f'Hello {name}')
say_hello_to('Anders') # <--- Calling the function
prints 'Hello {name}'
r = say_hello_to('Anders') # <--- Calling the function
prints 'Hello {name}' and assigns None to r
print(r) # <--- Prints None, since
function had no return statement
14. Numpy: Data Types
14
Numerical types:
•integers (int)
•unsigned integers (uint)
•floating point (float)
•complex
Other data types:
•booleans (bool)
•string
•datetime
•Python object
Data Type Description
bool_ Boolean (True or False) stored as a byte
int8 Byte (-128 to 127)
int16 Integer (-32768 to 32767)
int32 Integer (-2.15E-9 to 2.15E+9)
int64 Integer (-9.22E-18 to 9.22E+18)
uint8 Unsigned integer (0 to 255)
uint16 Unsigned integer (0 to 65535)
uint32 Unsigned integer (0 to 4.29E+9)
uint64 Unsigned integer (0 to 1.84E+19)
float16 Half precision signed float
float32 Single precision signed float
float64 Double precision signed float
complex64 Complex number: two 32-bit floats (real and
imaginary components)
complex128 Complex number: two 64-bit floats (real and
imaginary components)
16. Numpy: Indexing & Slicing
16
arr = np.arange(10)
print(arr) # [0 1 2 3 4 5 6 7 8 9]
print(arr[5]) #5
print(arr[5:8]) #[5 6 7]
arr[5:8] = 12
print(arr) #[ 0 1 2 3 4 12 12 12 8 9]
One-dimensional arrays are simple; on the surface they act similarly to Python lists:
As you can see, if you assign a scalar value to a slice, as in
arr[5:8] = 12, the value is propagated (or broadcasted) to
the entire selection.
An important first distinction from Python’s built-in lists is
that array slices are views on the original array.
This means that the data is not copied, and any
modifications to the view will be reflected in the source
array.
arr = np.arange(10)
print(arr)
# [0 1 2 3 4 5 6 7 8 9]
arr_slice = arr[5:8]
print(arr_slice)
# [5 6 7]
arr_slice[1] = 12345
print(arr)
# [ 0 1 2 3 4 5
12345 7 8 9]
arr_slice[:] = 64
print(arr)
# [ 0 1 2 3 4 64 64 64 8 9]
19. Pandas: Dataframe Method & Attribute
19
df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions
size number of elements
shape return a tuple representing the dimensionality
values numpy representation of the data
df.method() description
head( [n] ), tail( [n] ) first/last n rows
describe() generate descriptive statistics (for numeric columns only)
max(), min() return max/min values for all numeric columns
mean(), median() return mean/median values for all numeric columns
std() standard deviation
sample([n]) returns a random sample of the data frame
dropna() drop all the records with missing values
Unlike attributes, python methods have parenthesis.
All attributes and methods can be listed with a dir() function: dir(df)
21. Pandas: Dataframe Data Types
21
Pandas Type Native Python Type Description
object string The most general dtype. Will be
assigned to your column if column
has mixed types (numbers and
strings).
int64 int Numeric characters. 64 refers to
the memory allocated to hold this
character.
float64 float Numeric characters with decimals.
If a column contains numbers and
NaNs(see below), pandas will
default to float64, in case your
missing value has a decimal.
datetime64, timedelta[ns] N/A (but see the datetime module
in Python’s standard library)
Values meant to hold time data.
Look into these for time series
experiments.
22. Pandas: Dataframe Group By
22
Using "group by" method we can:
- Split the data into groups based on some criteria
- Calculate statistics (or apply a function) to each group
Once groupby object is create we can calculate various statistics for
each group
23. Pandas: Dataframe Filtering
23
To subset the data we can apply Boolean indexing.
This indexing is commonly known as a filter.
For example if we want to subset the rows in which the salary value is greater than $120K:
Any Boolean operator can be used to subset the data:
> greater; >= greater or equal;
< less; <= less or equal;
== equal; != not equal;
24. Pandas: Dataframe Selecting Row
24
If we need to select a range of rows, we can specify the range using ":"
Notice that the first row has a position 0, and the last value in the range is omitted:
So for 0:10 range the first 10 rows are returned with the positions starting with 0 and ending with 9
iloc methods
25. Pandas: Dataframe Common Aggregate
25
Aggregation - computing a summary statistic
about each group, i.e.
compute group sums or means
compute group sizes/counts
Common aggregation functions:
min, max
count, sum, prod
mean, median, mode, mad
std, var
df.method() description
describe Basic statistics (count, mean, std, min, quantiles, max)
min, max Minimum and maximum values
mean, median, mode Arithmetic average, median and mode
var, std Variance and standard deviation
sem Standard error of mean
skew Sample skewness
kurt kurtosis
28. Intro ML
28
Arthur Samuel, a pioneer in the field of artificial intelligence and computer gaming, coined
the term “Machine Learning” as – “Field of study that gives computers the capability to
learn without being explicitly programmed”.
How it is different from traditional
Programming:
- In Traditional Programming, we feed the
Input, Program logic and run the program to
get output.
- In Machine Learning, we feed the input,
output and run it on machine during training
and the machine creates its own logic, which
is being evaluated while testing.
29. Intro ML: Terminology
29
Terminologies that one should know before starting Machine Learning:
- Model: A model is a specific representation learned from data by applying some
machine learning algorithm. A model is also called hypothesis.
- Feature: A feature is an individual measurable property of our data. A set of numeric
features can be conveniently described by a feature vector. Feature vectors are fed as
input to the model. For example, in order to predict a fruit, there may be features like
color, smell, taste, etc.
- Target(Label): A target variable or label is the value to be predicted by our model. For
the fruit example discussed in the features section, the label with each set of input
would be the name of the fruit like apple, orange, banana, etc.
- Training: The idea is to give a set of inputs(features) and it’s expected outputs(labels),
so after training, we will have a model (hypothesis) that will then map new data to one
of the categories trained on.
- Prediction: Once our model is ready, it can be fed a set of inputs to which it will provide
a predicted output(label).
30. Intro ML: Type of Learning
30
- Supervised Learning
- Unsupervised Learning
- Semi-Supervised Learning
1. Supervised Learning:
Supervised learning is when the model is getting trained on a labelled dataset. Labelled
dataset is one which have both input and output parameters. In this type of learning
both training and validation datasets are labelled as shown in the figures below.
Types of Supervised Learning:
- Classification
- Regression
31. Intro ML: Type of Learning
31
2. Unsupervised Learning:
Unsupervised learning is the training of machine using information that is neither
classified nor labeled and allowing the algorithm to act on that information without
guidance. Here the task of machine is to group unsorted information according to
similarities, patterns and differences without any prior training of data. Unsupervised
machine learning is more challenging than supervised learning due to the absence of
labels.
Types of Supervised Learning:
- Clustering
- Association
3. Semi-supervised machine learning:
To counter these disadvantages, the concept
of Semi-Supervised Learning was introduced.
In this type of learning, the algorithm is trained
upon a combination of labeled and unlabeled
data. Typically, this combination will contain a
very small amount of labeled data and a very
large amount of unlabeled data.
38. Intro ML: Classification
Formally, given training set (xi,yi) for i=1…n, we want to create a
classification model f that can predict label y for a new x.
The machine learning algorithm will create the function f.
The predicted value of y for a new x is sign(f(x)).
Classification ?
- Yes/No questions – binary classification
- automatic handwriting recognition, speech recognition, biometrics, document
classification, spam detection, predicting credit default risk, detecting credit
card fraud, predicting customer churn, predicting medical outcomes (strokes,
side effects, etc.)
41. Intro Scikit-Learn: Why?
A. Simple and efficient tools for predictive data analysis
- Machine Learning methods
- Data processing
- Visualization
A. Accessible to everybody, and reusable in various contexts
- Documented API with lot’s of examples
- Not bound to Training frameworks (e.g. Tensorflow, Pytorch)
- Building blocks for your data analysis
A. Built on NumPy, SciPy, and matplotlib
- No own data types (unlike Pandas)
- Benefit from NumPy and SciPy optimizations
- Extends the most common visualisation tool
Open source, commercially usable - BSD license
Version 1.0 since September 2021
•https://scikit-learn.org/stable/
42. Intro Scikit-Learn: Tools
A. Classification:
Categorizing objects to one or more classes.
- Support Vector Machines (SVM)
- Nearest Neighbors
- Random Forest
- . . .
A. Regression:
Prediction of one (uni-) or more (multi-variate) continuous-
valued attributes.
- Support Vector Regression (SVR)
- Nearest Neighbors
- Random Forest
- . . .
A. Clustering:
Group objects of a set.
- k-Means
- Spectral Clustering
- Mean-Shift
- . . .
D. Dimensionality reduction:
Reducing the number of random variables.
- Principal Component Analysis (PCA)
- Feature Selection
- non-Negative Matrix Factorization
- . . .
E. Model selection:
Compare, validate and choose parameters/models.
- Grid Search
- Cross Validation
- . . .
F. Pre-Processing:
Prepare/transform data before training models.
- Conversion
- Normalization
- Feature Extract
43. Intro Scikit-Learn: Supervised ML Flow
Easy install via PIP or Conda for Windows, macOS and Linux, e.g:
$ pip install scikit-learn or
$ conda install -c intel scikit-learn
You may share the feedback form here while opening up the floor for QnA session
You may share the feedback form here while opening up the floor for QnA session
You may share the feedback form here while opening up the floor for QnA session
•In semi supervised learning labelled data is used to learn a model and using that model unlabeled data is labelled called pseudo labelling now using whole data model is trained for further use
Machine learning as a field, *read*
If we want to teach the computer to recognize images of chairs, then we give the computer a whole bunch of images, and tell it which ones are chairs and which are now, and then it’s supposed to learn to recognize chairs, even ones it hasn’t seen before. It’s not like we tell the computer how to recognize a chair, we don’t tell it “a chair has 4 legs and a back and a flat surface to sit on and so on”, we just give it a lot of examples.
Machine learning has close ties to statistics, in fact it’s hard to say what’s different about predictive statistics and machine learning, and these fields are very closely linked right now.
The problem I just told you about is a classification problem where we are trying to identify chairs. The way we set the problem up is that we have a *read*
We use the training set to learn a model of what a chair is. The test set are images that are not in the training set, and we want to be able to make predictions on those, as to whether or not each image is a chair.
It could be that some the labels on the training set are noisy. That could happen. In fact one if these labels is noisy *point*. That’s ok, because as long as their isn’t too much noise, we should still be able to learn a model for a chair. It just won’t be able to classify perfectly, and that happens. Some prediction problems are harder than others, but that’s ok, we just do the best we can from the training data. And in terms of the size of the training data, the more the merrier. We want as much data as we can to train these models.
How do we represent an image of a chair, or a flower, or whatever, in the training set? I just zoomed in on a piece of this image over here, and you can see that the pixels in the image. We can represent each pixel according to its rgb values (red green blue), so we get three numbers representing each image. So you can represent the whole image as a collection of rgb values. So the image becomes this very large vector of numbers. And in general, when doing machine learning, we need to represent each observation in the training and test sets as a vector of numbers. The label is also represented by a number. Here the number is -1 because the image is not a chair. The chairs would all get label +1.
Here’s another example. This is a problem that comes from NYC’s power company, where they wanted to predict which manholes were going to have a fire. So we would represent each manhole as a vector, and here are the components in the vector. The first component might be *read*.
In general, the first step is to figure out how to represent your data as a vector. You can make the vector very large, you can include lots of factors if you like, that’s fine. Computationally things are easier if you use fewer features, but then you risk leaving out information. So there’s a tradeoff right there that you will have to worry about, and we’ll talk more about that later. But in any case, you can’t do ML if you don’t have your data represented this way, so that’s the first step. *pause*
You’d think that manholes with more cables, more recent serious events, etc. would be more prone to explosions and fires in the future. But what combination of them would give you the best predictor? How do you combine them together? You could add them all up but that might not be the best thing. You could give them all weights and add them up, but how do you know the weights? That’s what ML does for you. It tells you what combinations to use to get the best predictors.
Just to be formal about it, *Read*
The features are also called *read* so you can choose whatever terminology you like.
Let’s take a simple version of the manhole example where we have only two features, *Read*. So each observation can be represented as a point on a 2d graph, which means I can plot the whole dataset.
You may share the feedback form here while opening up the floor for QnA session