1. DSA – 105 Introduction to
Data Science
Week 3 – Steps involved in Data Science
Ferdin Joe John Joseph, PhD
Faculty of Information Technology
Thai-Nichi Institute of Technology
2. Week 3
Agenda
• Steps involved in Data Science
Faculty of Information Technology, Thai - Nichi Institute of
Technology
2
3. Process in Data Science Life Cycle (DSLC)
Faculty of Information Technology, Thai - Nichi Institute of
Technology
3
5. 1. Business Understanding
Use data science to answer five types of questions:
• How much or how many? (regression)
• Which category? (classification)
• Which group? (clustering)
• Is this weird? (anomaly detection)
• Which option should be taken? (recommendation)
Faculty of Information Technology, Thai - Nichi Institute of
Technology
5
6. Data Mining
Decide on database usage
• Data Collection strategies and process
• Using of SQL queries
• Usage of dataframe packages like pandas
• Usage of JSON
• Usage of softwares to store and manage data
Faculty of Information Technology, Thai - Nichi Institute of
Technology
6
7. Data Cleaning
• Also known as “Data Janitor” work. The most important component.
• Cleaner the data, better the decisions.
• It consumes atleast 50% of the entire process.
• Eg. Manage the datatype of the values and convert wherever needed,
i.e. numerical values stored as integer or strings.
• Eg. Consistent format and spelling for categorical data.
‘Male’ or ‘male’
Faculty of Information Technology, Thai - Nichi Institute of
Technology
7
8. Data Exploration
• Brainstorming on what to do with ‘cleaned’ data
• Understand the bias and patterns in data
• Analyze a random subset of data and visualize them
• Look for anomalies and outliers in the data’s pattern
• Create hypotheses about data and problem on how the solution has
to be given
Faculty of Information Technology, Thai - Nichi Institute of
Technology
8
9. Feature Engineering
• A feature is a measurable property or attribute of a phenomenon
being observed.
• Feature engineering is the process of using domain knowledge to
transform your raw data into informative features that represent the
business problem you are trying to solve.
• There are 2 tasks in feature engineering
• Feature Selection
• Feature Construction
Faculty of Information Technology, Thai - Nichi Institute of
Technology
9
10. Feature Selection
• Feature selection is the process of cutting down the features that add
more noise than information.
• This avoids the complexity due to high-dimensional spaces
• It has three methods
• Filter methods (apply statistical measure to assign scoring to each feature)
• Wrapper methods (frame the selection of features as a search problem and
use a heuristic to perform the search)
• Embedded methods (use machine learning to figure out which features
contribute best to the accuracy)
Faculty of Information Technology, Thai - Nichi Institute of
Technology
10
11. Feature Construction
• Involves creating new features from the ones that is already available.
• For example, if you have a feature for age, but your model only cares
about if a person is an adult or minor, you could threshold it at 18,
and assign different categories to instances above and below that
threshold.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
11
12. Predictive Modelling
• Predictive modeling is where the machine learning finally comes into
your data science project.
• Based on the questions you asked in the business understanding
stage, this is where you decide which model to pick for your problem.
• The model that you end up training will be dependent on the size,
type and quality of your data, how much time and computational
resources you are willing to invest, and the type of output you intend
to derive.
• Trained model needs to be evaluated for its accuracy using validation
techniques like k-fold cross validation.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
12
13. Predictive Modeling
• Percentage of correct classification is used to measure the accuracy of
classification model
• ROC curves are plotted for true positive rate against false positive rate
• Coefficient of determination, Mean Square Error (MSE) and average
absolute error gives the correctness of regression models
Faculty of Information Technology, Thai - Nichi Institute of
Technology
13
14. Data Visualisation
• Combines the fields of communication, psychology, statistics, and art.
• Communicating the data in a simple yet effective and visually pleasing
way.
• Jupyter notebooks are having lot of packages for visualization. Eg
Matplotlib
• Drag n Drop tools like Tableau and Plotly
Faculty of Information Technology, Thai - Nichi Institute of
Technology
14
15. Goals of Data Science Process
• The goal of this process is to continue to move a data science project
forward towards a clear engagement end point.
• We recognize that data science is a research activity and that progress
often entails an approach that moves two steps forward and one step
(or worse) backwards.
• Being able to clearly communicate this to customers can help avoid
misunderstanding and frustration for all parties involved, and increase
the odds of success.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
15
16. Activity
• Perform Data Science Process on Olympic medal tally for events post
WW2
Faculty of Information Technology, Thai - Nichi Institute of
Technology
16
17. • Tools and Technologies in Data Science
Faculty of Information Technology, Thai - Nichi Institute of
Technology
17