2. ● Data Scientist at Mekari (Jun 2021-Present)
● Data Science Intern at Mekari (Jan 2021-Apr 2021)
● Part-time AI Researcher at Bisa AI (Aug 2020-Nov
2020)
● AI Engineer Intern at Bisa AI (Apr 2020-Aug 2020)
Ammar Chalifah
Teknik Biomedis ITB 2017 @ammarchalifah
ammarchalifah.com
2
5. 1. Artificial Intelligence Specialist (74% annual growth)
2. Robotics Engineer
3. Data Scientist (37% annual growth)
4. Full Stack Engineer
5. Site Reliability Engineer
6. Customer Success Specialist
7. Sales Development Representative
8. Data Engineer (33% annual growth)
9. Behavioral Health Technician
10. Cybersecurity Specialist
11. Back End Developer
12. Chief Revenue Officer
13. Cloud Engineer
14. JavaScript Developer
15. Product Owner
Top 15 Emerging Jobs in the US
LinkedIn Emerging Jobs Report (2020)
Linkedin 2020 Emerging Jobs Report.
https://business.linkedin.com/content/dam/me/business/en-
us/talent-solutions/emerging-jobs-
report/Emerging_Jobs_Report_U.S._FINAL.pdf
5
6. Demand for Data Science Skills
Between 2013 and 2015, demand
for data-related skills increased by
59%, 50%, 69%, and 88% for the ICT,
Media and Entertainment,
Professional Services, and Financial
Services industries.
However, Asia Pacific’s proficiency is
lagging behind other regions in key
data science skills.
High demand, low supply.
Demand is growing quickly with
big opportunity in Asia Pacific
World Economic Forum. (2019). Data Science in the New Economy, Insight Report.
http://www3.weforum.org/docs/WEF_Data_Science_In_the_New_Economy.pdf
6
7. Competition.
Company that uses data to make
data-driven decisions will win and
steal the laggards’ market share.
Every industry shows growing
demand for data-related skills. The
demand is expected to keep on
growing in the next several years.
Demand for data-related skills
is growing because it can be
used to extract values from
data.
What Drives Demand in Data-related Jobs?
Almost every
industry shows
growth in demand
for data-related
skills (WEF Report)
7
8. “Wait, what is data?”
It is just a collection of meaningless, raw facts.
8
9. DIKW Pyramid
Data
Raw facts, unprocessed,
unorganized
Information
Organized, processed
data, meaningful
Knowledge
Contextual, mix of values
and experiences
Wisdom
Evaluated understanding,
integrated knowledge
Data is only valuable if we can extract
values from it, by processing it to
create information, knowledge, and
wisdom.
“Yeah, ok. But this concept is too
abstract. What is data? What values can
we get from exploiting it? How can we
extract values from it?”
9
10. Types of Data
UNSW Sydney. (2020). Types of data and scales of measurement. https://studyonline.unsw.edu.au/blog/types-of-
data
Allen, Richard. What are the types of big data? https://www.selecthub.com/big-data-analytics/types-of-big-data-
analytics/
Structured vs.
unstructured
Quantitative vs. qualitative
Discrete vs. continuous
Nominal vs. ordinal
Binary vs. multi-class
Data is just random facts. To get
value, data must be processed.
Wait, but how do we get the
data that we need?
10
11. Data Science Hierarchy of Needs
https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007%5D
Top-of-the-pyramid
products (AI, Deep
Learning, A/B testing, etc)
can only be built on top of
a strong foundation.
11
12. Analytic Ascendancy Model
Data analytics have
different levels based
on difficulty and value:
descriptive, diagnostic,
predictive, and
prescriptive. These are
the values that we
want to get from data.
12
13. ● Data-related skills is growing in demand. Supply is inadequate. Opportunities everywhere!
● Demand is growing because data can be extracted to get values, giving upperhand to those who believe in
data-driven decision making.
● Data is just a collection of meaningless, raw facts. Data need to be processed to get useful information.
● Data have different types, which require different approaches to process them.
● Data science have a hierarchy of needs. Strong foundations in the data environment is needed before value
can be extracted.
● Value from data have different levels based on impact and difficulty.
For the sake of efficiency, we will jump directly to predictive analysis. Suppose we have a collected dataset, so there
are three steps left before we can extract predictive value from our data: (1) exploratory data analysis; (2) feature
engineering; and (3) modelling.
Recap
13
16. Goals of EDA
16
Look at data before making any assumptions.
Size, number of
columns, data
types
Understand
context
Look at data
distribution and
identify outliers
Have a
descriptive
understanding
(centrality,
variability)
Analyze
correlations
And many
more!
19. Unzip the downloaded data. Open Google Colaboratory, then upload the heart.csv file to session storage. Execute
the code snippet below to load the CSV file into a pandas DataFrame.
EDA tips number 1: Read relevant information from the data source (readme files, column descriptions)
and display your data. You can refer to UCI archive page to read the full documentation of the data
(https://archive.ics.uci.edu/ml/datasets/Heart+Disease ). The df.head() line was used to display the
first 5 rows of your data.
Load CSV to DataFrame
import pandas as pd
file_name = "heart.csv"
df = pd.read_csv(file_name)
df.head()
EDA 1
19
20. Observing Table and Reading Docs
Display the data. See the column names, see
the data types.
Browse the documentation. This heart
disease docs can be found on the
university’s archive page:
https://archive.ics.uci.edu/ml/dataset
s/Heart+Disease
After reading the docs and
seeing the table, you realized
that this dataset has 13
columns of features and 1
target. The objective of this
predictive analysis is to
predict the target value based
on features values.
20
21. Next, you want to know the size of your data, the exact data types of each column, existence of empty data points in
your dataset. Pandas provides you easy-to-use functions to do just that in few lines of codes. If you have null values,
you need an extra step to impute or manipulate them.
Data Shape, Types, and Non-null Count
EDA 2 Check data size, columns’
data types, and existence
of null values.
df.shape
df.info()
All your data are
numerical, with no null
values.
21
303 rows, 14 columns.
22. Next, descriptive statistics will help you understand the centrality and variability of each numerical
features.
Descriptive Statistics
EDA 3
df.describe()
22
23. Data Visualization
EDA 4
Freely explore the data. Use data visualization to help make
your exploration more intuitive.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 6))
colors = {1:'red', 0:'blue'}
grouped = df.groupby('target')
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='age',
y='thalach', label=key, color=colors[key])
plt.show()
23
24. Besides visualizing our data, we also need
to check the correlation between features
and targets. There is possibility that a
feature is heavily correlated with the
target, making an ML approach inefficient.
There is also a possibility that several
features are heavily correlated with each
other, making the use of those features
together unnecessary.
Correlation Analysis
EDA 5
ax, fig = plt.subplots(nrows=1, ncols=1, figsize =
(10,10))
sns.heatmap(df.corr(), annot = True, ax = fig)
24
25. Check outliers. Outliers may cause biased models.
Outlier Checking
EDA 6
fig, ax = plt.subplots(nrows=1, ncols=len(df.columns), figsize =
(20,5))
for i,c in enumerate(df.columns):
sns.boxplot(data=df, y=c, ax = ax[i])
fig.tight_layout()
Box and whisker plot
25
26. Linearity check give us better understanding of the distribution of data points in each feature and
the skewness of the features.
Linearity and Distribution Check
EDA 7
fig, ax = plt.subplots(nrows=1, ncols=len(df.columns), figsize =
(20,5))
for i,c in enumerate(df.columns):
sns.histplot(data=df, y=c, ax = ax[i])
fig.tight_layout()
26
28. Goals of Feature Engineering
28
Clean and process the data to help analysis/modelling
Rescale
numeric value
Clean missing
values (by
dropping or
imputing data)
Combine
multiple
features
Decode data
(categorical to
numerical,
numerical to
ordinal, etc)
Handle outliers
And many
more!
29. The EDA we have done before only give descriptive of inferential statistics. To extract more values from data, the
higher level in analytics ascendancy model is predictive analysis. One popular way to do predictive analysis is by
using a machine learning approach i.e. letting the machine learn by providing inputs (features) and outputs, with the
goal of finding the underlying rules that transform inputs to outputs.
In our hands-on experience, we have 13 features (or inputs), where we want to know the output (whether the patient
has heart problem or not) based on those features. Most of the time, we need to process our input by using feature
engineering.
Predictive Machine Learning Model
ML model
Input Output
Input, or features Output, or labels
29
30. Why?
We are lucky to have a clean, non-null, and all numeric data. Sometimes, you will need to analyze data from not-so-
ideal datasets: which have null values, extreme outliers, or nominal data (e.g. string). We can’t directly pump the data
into our machine learning model, so feature engineering become an important part of data science process.
Besides missing values or nominal data, sometimes we also need to process our numerical data: standardize,
normalize, threshold, etc. Different machine learning models require different input characteristics.
Now, we will explore several important feature engineering techniques, and later on we will implement some of them
to our data.
30
31. FE 1
31
Drop
Numerical
Imputation
Categorical
Imputation
Drop rows or columns with
missing values. Easy to do, but
may cause significant data
loss.
Fill with another numerical
value, such as 0 or median
(depends on case)
Fill with another categorical
value, such as most frequent
value or new category (e.g.
’Others’)
Handling Missing Values
# Drop missing rows
df = df[df.isnull() == False]
# Drop missing columns
df = df[df.columns[df.isnull().mean() == 0]]
# Impute with 0
df = df.fillna(0)
# Impute with median
df = df.fillna(df.median())
# Impute with new categorical
df = df.fillna('Others')
# Impute with most frequent
df['column_name'].fillna(df['column_name'].value_counts().id
xmax(), inplace=True)
32. FE 2
32
Outlier
Detection
Standard deviation vs
percentile
Outliers can be handled by:
- Drop outliers
- Cap outliers
Handling Outliers
#Dropping the outlier rows with standard deviation
factor = 3
upper_lim = df['column'].mean () + df['column'].std () * factor
lower_lim = df['column'].mean () - df['column'].std () * factor
df = df[(df['column'] < upper_lim) & (df['column'] > lower_lim)]
#Dropping the outlier rows with Percentiles
upper_lim = df['column'].quantile(.95)
lower_lim = df['column'].quantile(.05)
df = df[(df['column'] < upper_lim) & (df['column'] > lower_lim)]
#Capping the outlier rows with Percentiles
upper_lim = df['column'].quantile(.95)
lower_lim = df['column'].quantile(.05)
df.loc[(df[column] > upper_lim),column] = upper_lim
df.loc[(df[column] < lower_lim),column] = lower_lim
33. FE 3
33
Binning make model
more robust by
sacrificing information to
create more general (or
regularized) categories. It
prevents overfitting, but
cost performance.
Binning
Rençberoğlu, Emre (2019). Fundamental Techniques of Feature Engineering for Machine Learning. https://towardsdatascience.com/feature-
engineering-for-machine-learning-3a5e293a5114
34. FE 4
34
One-hot encoding encodes categorical data into multi-columns
binary numerical data.
One-hot Encoding
User ID Major
1 Biomedical Engineering
2 Electrical Engineering
3 Electrical Engineering
User
ID
Biomedical
Engineering
Electrical
Engineering
1 1 0
2 0 1
3 0 1
35. FE 5
35
Rescales numerical data. Two most popular scaling methods are
min-max normalization and standardization. Min-max
normalization scales all values to a range between 0 and 1.
Standardization scales all values to a new distribution with 0
mean and 1 standard deviation.
Scaling
# Min-max normalization
df['normalized'] = (df['value'] -
df['value'].min()) / (df['value'].max() -
df['value'].min())
Min-max normalization
# Standardization
df['standardized'] = (df['value'] -
df['value'].mean()) / df['value'].std()
Standardization
38. Regression
Predicted value is a continuous
numerical value.
Performance measured by error.
38
Predicted value is a categorical data.
Performance measured by accuracy.
Classification
Generally, there are two kinds of prediction
https://www.javatpoint.com/regression-vs-classification-in-machine-learning
39. Machine learning model development is an iterative process, with successive trial-and-error. We may end up need to
try a bunch of different feature engineering methods, but we can make an educated guess for our first trial.
● First, we don’t need to process binary numerical data.
● Second, we know there are no outliers based on the histogram in linearity and distribution check.
● Third, there are several numerical value that is not normalized nor standardized. We may need to rescale these
columns.
● Lastly, there are no missing values nor categorical values in the data.
Choice of feature engineering is heavily dependent on which machine learning algorithm we’ll use. So, let’s jump to
the last phase of this workshop: picking our machine learning model!
39
Which feature engineering methods suit our need?
40. What is the function on the graph above?
40
Trivia 101
41. Regression Prediction
41
Regression maps input to a continuous output variable.
Main Idea: Given the regression function is hθ(x) = θ1x + θ0 ,
choose θ0 and θ1 so that hθ(x) is close to y of our training examples (x,y)
Questions that can be answered by regression:
● How expensive is this house?
● How many tonnes of product will be delivered next month?
Example of machine learning regression algorithms:
● Linear regression
Interestingly, an ordinal classification problem can be framed as a regression problem (for example, 3 class with
ordered severity can be seen as a regression).
Src: Machine Learning Andrew Ng, Stanford Edu
42. Classification Prediction
42
Classification maps input variables to probability of output classes. Classification may be binary or multi-class.
Questions that can be answered by classification:
● What animal is this?
● What kind of disease is this?
Example of machine learning regression algorithms:
● Logistic regression
● Naive Bayesian classification
● k-Nearest Neighbours
● Decision Tree
● Random Forest
Interestingly, a classification algorithm can be used to solve regression problems by framing it as a multi-class
classification problem with many classes!
Src: Machine Learning Andrew Ng, Stanford Edu
43. 43
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
logit = LogisticRegression(random_state = 17)
logit.fit(X_train, y_train)
print(accuracy_score(logit.predict(X_test), y_test))
importance = logit.coef_
# summarize feature importance
for x,v in zip(X_train.columns, importance[0]):
print('Feature: {}, Score: {:.5f}'.format(x,v))
# plot feature importance
plt.bar([x for x in range(len(importance[0]))],
importance[0])
plt.show()
Accuracy
Feature importance
44. So, how about our heart
disease data?
44
It’s up to you! Just do some
experiment to find the optimal
model. For now, let’s try to frame
it as a classification problem.
46. [1] Patil, Prasad (2018). What is Exploratory Data Analysis? https://towardsdatascience.com/exploratory-data-
analysis-8fc1cb20fd15
[2] Rençberoğlu, Emre (2019). Fundamental Techniques of Feature Engineering for Machine Learning.
https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114
References
46
47. ● Data Analyst Intern at Moving Walls (Apr - Jul 2021)
● Researcher Intern at NCIRI (Jul - Sep 2020)
● Backend Developer Intern at Bangunindo (Dec 2019-Jan 2020)
Ramadhita Umitaibatin
Teknik Biomedis ITB 2017
@ramadhitau
Ramadhita Umitaibatin
(LinkedIn)
47
Contributors
48. CREDITS: This presentation template was created by
Slidesgo, including icons by Flaticon, and infographics
& images by Freepik.
Thank you~
For further inquiries, please don’t
hesitate to contact me :)
48